Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: UTF-8 fonts

X-seq: zsh-workers 17730
From: Peter Stephenson <pws@xxxxxxx>
To: zsh-workers@xxxxxxxxxx (Zsh hackers list)
Subject: Re: UTF-8 fonts
Date: Wed, 25 Sep 2002 12:36:20 +0100
In-reply-to: "Borzenkov Andrey"'s message of "Wed, 25 Sep 2002 15:11:39 +0400." <6134254DE87BD411908B00A0C99B044F042E3E33@xxxxxxxxxxxxxxxxxxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm

Borzenkov Andrey wrote:
> Just to make it clear. Is the aim to use UTF-8 internally or to support
> (arbitrary) multibyte encoding?

The first with as much of the second as we can get in without too much
work.  My current plan is to rely on mbtowc/mblen to identify and match
multibyte characters in strings where we need to.  These are already
aware of the locale, so our only assumption is that characters without
the top bit set are ASCII; this is the major limitation on any non-UTF-8
multibyte encodings --- it's this feature of UTF-8 that makes it so
suitable for UNIX use.  With this assumption wide characters are
transparent to the vast majority of the shell and we only need to look
at the characters for comparisons between characters, lengths of strings
and testing the size for output.

> > See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> > the subject.
> > 
> > My first thought about using UTF-8 instead of eight bit characters
> 
> this sounds like you want to convert input to UTF-8 internally?

If you read the link, you will see that the plan as far as stream
applications are concerned is that the input is already UTF-8 and the
output will be treated as UTF-8.  So we don't do any conversion except
for the cases I mentioned.

> You also need to modify any place where shell compares or translates (upper
> <-> lower) characters.

We decided some time ago not to use strcoll, because it broke in some
nasty ways.  So it's now documented that we just use character positions
in the character set for comparisons.  This has generated far fewer
complaints (as far as I'm aware, none) than the previous version.  It
seems inevitable to extend this to multibyte characters.

> But this also means you cannot assume anything about current character set
> and cannot assume that it is transparent w.r.t. current string handling in
> zsh.

We are going to assume that bytes without the top-bit set are ASCII, and
the remainder require mb* handling.

> How do you know your input (and strings you are processing) are UTF-8?
> Besides, standards do not provide a way to input multibyte character - you
> can only read wide character.

No, as I said above the whole point of UTF-8 is that you can for the
most part just use normal strings.  I am not planning on supporting any
system that doesn't have this feature.

> > - determining whether the terminal is actually in UTF-8 mode, probably
> >   from the locale
> 
> Impossible. Local names are just arbitrary chosen strings; there is no
> "character set code" defined in any locale definition, at least on Unix.

Read the document at the link I gave which suggests otherwise.  However,
I now think we can in any case leave this to the mb* suite to decide.

> > - reading multi-byte characters --- timeouts and the like
> 
> use standard OS interfaces to read wide characters.

No, because we are reading them as individual bytes.  If you don't have
the complete multibyte character you are pretty stuck unless you
interpret a partial character yourself, hence the problems with active
meta-characters in zle.

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070

**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************

Follow-Ups:
- Re: UTF-8 fonts
  - From: Oliver Kiddle
- Re: UTF-8 fonts
  - From: Nadav Har'El

References:
- RE: UTF-8 fonts
  - From: Borzenkov Andrey

Messages sorted by: Reverse Date, Date, Thread, Author