Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: UTF-8 support

X-seq: zsh-workers 20453
From: Oliver Kiddle <okiddle@xxxxxxxxxxx>
To: Zsh-workers <zsh-workers@xxxxxxxxxx>
Subject: Re: UTF-8 support
Date: Tue, 05 Oct 2004 13:01:32 +0200
In-reply-to: <200410041620.i94GKNro006000@xxxxxxxxxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
References: <20041001184122.GA9094@fargo> <23473.1096659965@xxxxxxxxxxxxxxxxxxxxx> <200410041620.i94GKNro006000@xxxxxxxxxxxxxx>

Peter wrote:
> I came to the conclusion that was going to be very time consuming --- it
> means unmetafying potentially a long string (we don't know where the
> characters end) and calling a function every time we want to compare multibyte
> characters.  Doing it only for UTF-8 can be optimised to work with
> extensions to the current tests; it's simple to test for the length of a
> UTF-8 character (although some error checking is also necessary).

If you want to find a short string in a long string you can surely
metafy the short string instead of unmetafying the long string.

The approach I was suggesting has the big advantage that we can add
support in isolated areas without first breaking the entire shell.

I think it would be bad a mistake to rewrite our own, UTF-8 specific
versions of all the routines that libc already provides. Even if we can
make one or two slightly more efficient by handling the meta process at
the same time. And if we're going to restrict the code to UTF-8, we
could ditch the meta stuff and use overcoding. This amounts to storing
the null character as the overlong two byte sequence c080. The code for
that would be a lot simpler but you can't expect to pass overlong
sequences elsewhere without getting errors. At least UTF-8 allows you to
strchr for 7-bit ASCII characters in a UTF-8 string (other multi-byte
encodings allow this only for /). Can we perhaps change the Meta
character to 0xc0. We can then use overcoding for UTF-8 but make the
UTF-8 specific code much more minimal in the metafy process.

The most efficient way would be to maintain string lengths, Pascal style
(length in bytes not characters). Possibly even using wchar_t instead
of multi-byte encodings. We could perhaps do that for limited sections
of code such as parameters. That would cope better when someone decides
to change the current locale. If we extend that elsewhere, we need to
be careful if we want to maintain portability of word code files, however.

> Given that the whole point of Unicode is to replace all other schemes,
> I'm not so keen about supporting other schemes if it's that much less
> efficient.

I'm not suggesting supporting alternatives to Unicode but alternatives
to UTF-8. I'd bet that single-byte 8-bit encodings will stick around on
small or embedded systems for longer than you might expect. My main
objection is to any suggestion of not using library calls to handle the
work. mblen may be easy to reimplement but wcwidth is not so we'd end up
with a mixture.

I don't mind so much whether we support other multibyte encodings with
more limited ASCII compatibility than UTF-8. It'd be better to have
limited support than an error message followed by setting LC_CTYPE to
C, though.

Oliver

Follow-Ups:
- Re: UTF-8 support
  - From: Peter Stephenson

References:
- Re: UTF-8 support
  - From: David Gómez
- Re: UTF-8 support
  - From: Oliver Kiddle
- Re: UTF-8 support
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author