Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: UTF-8 fonts

X-seq: zsh-workers 17724
From: Oliver Kiddle <okiddle@xxxxxxxxxxx>
To: zsh-workers@xxxxxxxxxx
Subject: Re: UTF-8 fonts
Date: Tue, 24 Sep 2002 14:39:59 +0100
In-reply-to: <14747.1032454617@xxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
References: <14747.1032454617@xxxxxxx>
Sender: kiddleo@xxxxxxxxxx

This certainly seems to be the most requested feature these days so I
would echo what Clint said in his response.

On 19 Sep, you wrote:
> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.

The info pages for glibc also have useful sections on this. I should
probably also mention the patch someone has at:
  http://www.ono.org/software/zsh-euc/
Not that I've looked at it in detail - I just found it some time and
put it in my bookmarks.

> My first thought about using UTF-8 instead of eight bit characters was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.

As far as I can see the `Meta' system should work unchanged.

There are other multibyte character sets which ideally we should be
concerned about besides just UTF-8. Does anyone know if we need to be
concerned with any which are stateful?.

> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.

Basically any dealing with string lengths, substrings or with single
characters will need modifying.

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)

I'm not sure what you're trying to achieve with this 64-bit integer
suggestion. strncmp should be fine for string comparisons and I
wouldn't bet against that 6-byte maximum being increased ever.

> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.

I think converting things to unicode/wchar_t's would be a bad idea and
it would be better to stick to whatever LC_CTYPE is for everything
internal (except temporarily in some operations). wchar_t is
implementation defined (though ISO10646 (unicode) is fairly commonly
used for it) so it would break the portability of word code files if
they were to be stored using wchar_t's. Admittedly people could get
themselves into a mess by switching between different encodings but
that is nothing that you don't get anyway with text files. It also
makes implementing this easier because it can be done gradually, fixing
areas where the shell doesn't work for multi-byte characters whereas
using wchar_t's would need rewriting large chunks of everything at
once.

> Probably we need a configuration option to switch this on or off.

A good starting point would be a configuration option, the basic
autoconf tests and then much of the common stuff in string.c for things
like getting substrings and counting string lengths. Getting various
parts of the shell to work with multibyte character sets can then be
done piece by piece.

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

That'll be from the locale (LC_CTYPE) but I think that by using
functions like mbrlen(), you get libc to worry about that for you. 

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what

I suppose that is initially a question of finding out what something
like xterm generates for meta keys when in utf-8 mode. A quick test
with cat -v reveals empty square boxes as I type and things like `M-e'
when I look at the redirected file.

> - reading multi-byte characters --- timeouts and the like

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation

note that there are multi-row characters to contend with as a separate
issue to multi-byte characters. Though we already sort of have these
with things like ^[. 

Oliver

This e-mail and any attachment is for authorised use by the intended recipient(s) only.  It may contain proprietary material, confidential information and/or be subject to legal privilege.  It should not be copied, disclosed to, retained or used by, any other party.  If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender.  Thank you.

Follow-Ups:
- Re: UTF-8 fonts
  - From: Clint Adams

References:
- UTF-8 fonts
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author