Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: PATCH: multibyte FAQ (MacOS X)



(I have subscribed to zsh-workers so no need to reply to me)

At 6:26 PM +0300 05.12.18, Andrey Borzenkov wrote:
>I wonder, when decomposition happens?

It seems that open(2) and opendir(2) accept both precomposed and decomposed form, but they internally convert the filename/dirname into decomposed form. For example

/* a-umlaut: precomposed */
name[0] = 0xc3; name[1] = 0xa4;
name[2] = 0x00;
fd = open(name,O_CREAT,mode);

has the same effect as

/* a + umlaut: decomposed */
name[0] = 0x61;			/* 'a' */
name[1] = 0xcc; name[2] = 0x88;	/* umlaut */
name[3] = 0x00;
fd = open(name,O_CREAT,mode);

and the created file has a filename in decomposed form.
readdir(2) always returns filenames in decomposed form (and UTF-8 encoding).

If a user input a-umlaut from his keyboard, then it is in precomposed form (at least for US and Japanese keyboards). But I think zsh need not to convert it into decomposed form. For example,

zsh% echo hello > Xa-umlaut    (two characters, 'X' and 'a-umlaut')

this works fine even if the 'a-umlaut' is in precomposed form. The created file has filename in decomposed form, and if I use filename completion

zsh% cat X<TAB>

then I get

zsh% cat Xa+umlaut

and the a+umlaut is in decomposed form. The decomposed char is displayed correctly both in Apple's Terminal (which runs in Apple's Aqua window system) and in xterm (with Unicode font, of course), but I can't edit the command line correctly.

You can test decomposed char as follows
(I assume insert-unicode-cahr is bound to ^XU)

zsh% echo ^XU61^XU^XU308^XU
a+umlaut
zsh% (go up in the history stack and try to edit it)


>To be sure - do you mean that e.g. accented characters are internally kept as 
>two characters? Does it agree with <http://www.unicode.org/reports/tr15/>?

How Apple decomposes characters can be found in
http://developer.apple.com/technotes/tn/tn1150table.html

I don't know whether this exactly follows the "Canonical Decomposition (Normalization Form D)" in the Unicode Standard; probably not. 



Messages sorted by: Reverse Date, Date, Thread, Author