Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: UTF-8 support

X-seq: zsh-workers 20447
From: David Gómez <david@xxxxxxxxxxxx>
To: Oliver Kiddle <okiddle@xxxxxxxxxxx>
Subject: Re: UTF-8 support
Date: Mon, 4 Oct 2004 18:08:58 +0200
Cc: David Gómez <david@xxxxxxxxxxxx>, Zsh-workers <zsh-workers@xxxxxxxxxx>
In-reply-to: <23473.1096659965@xxxxxxxxxxxxxxxxxxxxx>
Mail-followup-to: Oliver Kiddle <okiddle@xxxxxxxxxxx>, David Gómez <david@xxxxxxxxxxxx>, Zsh-workers <zsh-workers@xxxxxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
References: <20041001184122.GA9094@fargo> <23473.1096659965@xxxxxxxxxxxxxxxxxxxxx>

Hi Oliver ;),

> > So i conclude from your response that nobody is working on it ;).
> > I understand the time problem, everybody is short on time, including
> 
> Nothing has been done. A few people may have done some work that was
> never posted.

Is yet possible to find that work ;)?

> I got as far reading up, thinking about what the right
> approach would be and adding support for stuff like the following to
> print characters given their unicode code point:
>   echo '\u20ac'
> It seemed a good point to start because it'll be useful for testing.

Yes, it's useful for testing to be able use unicode points as input
to echo. I've used to do some testing myself ;)

> Most parts of the source will need work but it is possible to add
> support in individual areas. So don't start with completion, find
> something simple like the print builtin (in particular -c and -C
> options).

I see, splitting the parameters in columns needs the print builtin
have knowledge of the real width if you're using UTF-8 input.

> Builtins in general are simple because they are relatively
> self-contained. If you try to attack zle first, you'll just get fed up
> with it being too hard.

I think you're totally right. zle is to hard for a start, given i
have no experience in zsh source. I'll give a look to the print builtin
and will play a bit with zsh code to learn more.

> done, another idea for something simple would be to add a Test/U01 test
> and add code to make it search for a UTF-8 locale ($langinfo[CODESET] in
> the langinfo module will help) 

Good, i didn't know about that module ;)

> The source and comments are the only documentation I know of but you can
> always ask on the list.

Thanks, i'll do ;)

> Do you know much about unicode/UTF-8? For the
> minimum, read http://www.joelonsoftware.com/articles/Unicode.html
> and then read http://www.cl.cam.ac.uk/~mgk25/unicode.html

I knew a bit. But i've been reading your links these days and have
refreshed my rusted utf-8 concepts ;).

> In my opinion it would be sensible to support multibyte encodings in
> general and not just UTF-8.

I think the reason behind using UTF-8 is not having to use any other
encondings at all, so adding support for other multibytes encoding
wouldn't be needed in my opinion. But, on the other hand, using mbs*
from libc would made easy support any multibyte the current locale
has selected.

> stateful encodings. There are a few characters which are defined to
> display as double width even in proportional fonts so keep that in mind.

In what scripts happens these characters?

> You can detect whether UTF-8 is enabled with the C library's locale
> functions but we shouldn't need to: functions such as mbrlen do all the
> work for us.

Shouldn't mbrlen and company only be used when an UTF-8 locale is selected?

Thanks,

-- 
David Gómez

"The question of whether computers can think is just like the question of
whether submarines can swim." -- Edsger W. Dijkstra

Follow-Ups:
- Re: UTF-8 support
  - From: Oliver Kiddle
- Re: UTF-8 support
  - From: Clint Adams

References:
- Re: UTF-8 support
  - From: David Gómez
- Re: UTF-8 support
  - From: Oliver Kiddle

Messages sorted by: Reverse Date, Date, Thread, Author