Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Globbing autocorrects misencoded filenames?



On Tue, 19 Jan 2010 21:49:16 +0100
waba@xxxxxxx wrote:
> IOW, I create a file named "Ãbc" (a-grave b c) using latin1 encoding on
> my utf8 system, but it still matches as utf8 during globbing.

What's going on is that when the shell finds an invalid character in the
current encoding it treats it as a single-byte "character" (really just a
number) because it doesn't know what else to do.  Because Unicode imported
ISO-8859-1 (or maybe -15, can't remember) as the 128 or so characters after
ASCII, in this case the raw single-byte character matches the properly
converted UTF-8 character.

The fix is for the shell never to allow a properly converted character to
match an invalid character, but to continue to allow two invalid characters
(or, obviously, two valid characters) to match.

Index: Src/pattern.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/pattern.c,v
retrieving revision 1.50
diff -u -r1.50 pattern.c
--- Src/pattern.c	29 May 2009 21:06:46 -0000	1.50
+++ Src/pattern.c	21 Jan 2010 10:54:31 -0000
@@ -1795,9 +1795,9 @@
 
 
 /* Get a character and increment */
-#define CHARREFINC(x, y)	charrefinc(&(x), (y))
+#define CHARREFINC(x, y, z)	charrefinc(&(x), (y), (z))
 static wchar_t
-charrefinc(char **x, char *y)
+charrefinc(char **x, char *y, int *z)
 {
     wchar_t wc;
     size_t ret;
@@ -1808,7 +1808,8 @@
     ret = mbrtowc(&wc, *x, y-*x, &shiftstate);
 
     if (ret == MB_INVALID || ret == MB_INCOMPLETE) {
-	/* Error.  Treat as single byte. */
+	/* Error.  Treat as single byte, but flag. */
+	*z = 1;
 	/* Reset the shift state for next time. */
 	memset(&shiftstate, 0, sizeof(shiftstate));
 	return (wchar_t) STOUC(*(*x)++);
@@ -1865,7 +1866,7 @@
 /* Increment a pointer past the current character. */
 #define CHARINC(x, y)	((x)++)
 /* Get a character and increment */
-#define CHARREFINC(x, y)	(STOUC(*(x)++))
+#define CHARREFINC(x, y, z)	(STOUC(*(x)++))
 /* Counter the number of characters between two pointers, smaller first */
 #define CHARSUB(x,y)	((y) - (x))
 
@@ -2419,9 +2420,21 @@
 	    while (chrop < chrend && patinput < patinend) {
 		char *savpatinput = patinput;
 		char *savchrop = chrop;
-		patint_t chin = CHARREFINC(patinput, patinend);
-		patint_t chpa = CHARREFINC(chrop, chrend);
-		if (!CHARMATCH(chin, chpa)) {
+		int badin = 0, badpa = 0;
+		/*
+		 * Care with character matching:
+		 * We do need to convert the character to wide
+		 * representation if possible, because we may need
+		 * to do case transformation.  However, we should
+		 * be careful in case one, but not the other, wasn't
+		 * representable in the current locale---in that
+		 * case they don't match even if the returned
+		 * values (one properly converted, one raw) are
+		 * the same.
+		 */
+		patint_t chin = CHARREFINC(patinput, patinend, &badin);
+		patint_t chpa = CHARREFINC(chrop, chrend, &badpa);
+		if (!CHARMATCH(chin, chpa) || badin != badpa) {
 		    fail = 1;
 		    patinput = savpatinput;
 		    chrop = savchrop;


-- 
Peter Stephenson <pws@xxxxxxx>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK


Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom



Messages sorted by: Reverse Date, Date, Thread, Author