Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

RE: UTF-8 fonts

Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?

> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
> My first thought about using UTF-8 instead of eight bit characters

this sounds like you want to convert input to UTF-8 internally?

> was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.

You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.

But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.

How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.

> Probably we need a configuration option to switch this on or off.

Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use. 

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what
> - reading multi-byte characters --- timeouts and the like

use standard OS interfaces to read wide characters.

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation
>   bytes, we may be stuck with using wcwidth for display.  This is a pain
>   because it involves explicity wchar_t's, and I have no experience at
>   all with these (except that they mess up compilation of otherwise
> trivial
>   string-handling functions).
> - all the stuff I've forgotten.
> Any comments?


Messages sorted by: Reverse Date, Date, Thread, Author