Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: PATCH: multibyte characters in patterns.

X-seq: zsh-workers 22410
From: Vincent Lefevre <vincent@xxxxxxxxxx>
To: Zsh hackers list <zsh-workers@xxxxxxxxxx>
Subject: Re: PATCH: multibyte characters in patterns.
Date: Mon, 10 Apr 2006 17:40:04 +0200

On 2006-04-09 22:38:58 +0100, Peter Stephenson wrote:
> This adds handling for multibyte characters in patterns when the shell
> is compiled with MULTIBYTE_SUPPORT.  This is activated in two ways:
> 
> - Set the new MULTIBYTE option.  This will eventually cover parameter
> expansion and anything else in the main shell that needs it.  It won't
> cover ZLE; that will always use the locale directly.  The reason for the
> difference is that scripts and functions may trip up on binary input or,
> for example, ISO-8859-1-encoded files that used to be handled properly
> before the locale was taken into account.  Whether it should be turned
> on by default is still to be determined.
> - Use the (#u) globbing flag.  Unfortunately (#m) was already taken;
> it's supposed to suggest "Unicode" or "UTF-8", even though we'll handle
> other character sets.  (#U) is the opposite, as expected.

It could also suggest mUltibyte, and looks like the Greek letter mu
(for MUltibyte). :)

Could you give examples of what it does exactly?
Do you mean that "?" can now match a multibyte character?
Will it also match a UTF-8 character while being in ISO-8859-1 locales?
(The reason could be to be able to handle data that use another encoding
than the locales, mainly when data are shared amongst different users
who use different locales, in which case these data are encoded in UTF-8
in general.)

How about that in UTF-8 locales?

dixsept:~> foo="bàr"
dixsept:~> echo $foo[2]

> I've done virtually no optimisation of the code, and this could make a
> big difference.  Where it used to skip over a character simply with an
> inline test for Meta and a couple of increments, it now always enters a
> function, and with multibyte mode in effect always loops over the system
> test for a character.  The latter is inevitable but that doesn't mean
> the code is as good as it could be.  It would be possible to convert to
> wide characters, although it's complicated by the fact that we need to
> support arbitrary bytes, too; it would have to be done with something
> like a discriminated union of a char or wchar string.  Then we would
> have to convert each test string as well.  I don't know how important
> this is likely to be.

Couldn't an "unused" area of Unicode be used for arbitrary bytes?

-- 
Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA

Messages sorted by: Reverse Date, Date, Thread, Author