Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: Unicode, Korean, normalization form, Mac OS X and tab completion

X-seq: zsh-workers 32638
From: Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
To: "Zsh List Hackers'" <zsh-workers@xxxxxxx>
Subject: Re: Unicode, Korean, normalization form, Mac OS X and tab completion
Date: Sat, 31 May 2014 20:16:17 +0100
In-reply-to: <AB81F9FB-8D84-4656-9EFE-F2F98B196861@me.com>
List-help: <mailto:zsh-workers-help@zsh.org>
List-id: Zsh Workers List <zsh-workers.zsh.org>
List-post: <mailto:zsh-workers@zsh.org>
Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
References: <AB81F9FB-8D84-4656-9EFE-F2F98B196861@me.com>

On Sat, 31 May 2014 12:56:06 +0900
Kwon Yeolhyun <yeolhyunkwon@xxxxxx> wrote:
> 4) Mac OS X uses normalized string as filename. Assuming there’s a
> file with the name of 가나다, it has the name of
> ㄱㅏㄴㅏㄷㅏ(decomposed into hangul jamos) internally. (Link to hangul
> jamos:
> http://www.utf8-chartable.de/unicode-utf8-table.pl?start=4352&number=1024)
> 5) I guess the reason why the tab completion has failed is that zsh
> compare the user input, 가나다, with the filename, ㄱㅏㄴㅏㄷㅏ.
> 가나다 and ㄱㅏㄴㅏㄷㅏ are canonically equivalent but have different
> binary representations.

You're right, this is a real problem that could do with solving.

The actual conversion between the two is easy enough --- though most of
use here don't use MACs or character sets that show up the problem, so
we'd need a volunteer to help with this (relatively) easy bit.

The difficult bit, about which I suspect only Bart and I are likely to
have detailed opinions, is where to do the conversion.

Doing it at the point where data is read from the keyboard is
problematic, since what we put back onto the command line is quite
intricately tied to what we read from it in the first place, and
arbitrary transformations at this point make it hard to know what to put
back after the completion.

Doing it right down in the guts is even harder --- there are some
incredibly complicated things going on to support features like partial
word completion that currently treat data simply as octet strings, and
upgrading this is a huge job.

So if we can guarantee the keyboard input is in one form (and I'm not
sure we necessarily can) it might be easier to convert file names into
that format.  The trouble here is that to be consistent we need to
convert all data passed into the completion system, e.g. from file
contents passed as strings via functions.  (In principle it's
more correct to normalise all input anyway.)

I'm currently wondering if there is scope for normalising keyboard input
really early --- before we feed it back to the shell --- and turning it
back into the usual keyboard form right at the end, perhaps not worrying
too much if the original input was in a different form as long as
they're equivalent.  But I suspect it's not that easy.

So this will take a certain amount of thought.

pws

Follow-Ups:
- Re: Unicode, Korean, normalization form, Mac OS X and tab completion
  - From: Bart Schaefer

References:
- Unicode, Korean, normalization form, Mac OS X and tab completion
  - From: Kwon Yeolhyun

Messages sorted by: Reverse Date, Date, Thread, Author