Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: UTF-8 support

X-seq: zsh-workers 20455
From: Peter Stephenson <pws@xxxxxxx>
To: Zsh-workers <zsh-workers@xxxxxxxxxx>
Subject: Re: UTF-8 support
Date: Tue, 05 Oct 2004 12:32:01 +0100
In-reply-to: <29214.1096974092@xxxxxxxxxxxxxxxxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
References: <20041001184122.GA9094@fargo> <23473.1096659965@xxxxxxxxxxxxxxxxxxxxx> <200410041620.i94GKNro006000@xxxxxxxxxxxxxx> <29214.1096974092@xxxxxxxxxxxxxxxxxxxxx>

Oliver Kiddle wrote:
> If you want to find a short string in a long string you can surely
> metafy the short string instead of unmetafying the long string.

Both strings are likely to be metafied anyway, internally, but that
doesn't help if you're using the library routines for comparisons, since
they don't know about meta characters; and because you don't know where
a character ends, you also don't know at what byte two characters differ
without using library functions.  Unless you guess where it ends you
need the entire string from the first multibyte character in the
representation used by the library.

Indeed, unless we start with some assumption about the encoding we have
to compare every single character with library functions on an
unmetafied string.  This is very messy if we have to support systems
where the library functions aren't available (and we break quite a lot
unless we do that).  So, while I can't say for sure, I strongly suspect
we're going to end up with having to make some of the assumptions which
are already encoded into the library.  Thus some kind of hybrid is
forced on us for practical reasons.  Given this, I suspect that assuming
UTF-8 and avoiding the library functions where we don't need them is
actually going to be the neatest.  However, this remains to be seen.

I can't see an advantage in assuming UTF-8 and then relying on the
library for comparisons etc.  This seems to give the worst of both
worlds.

> The approach I was suggesting has the big advantage that we can add
> support in isolated areas without first breaking the entire shell.

That can be done however we decide, at least if we keep the current Meta
scheme.  Indeed, that's probably the way to go; we can experiment with
different methods locally before altering the rest of the shell.  The
pattern code is probably the most time-critical for comparing multibyte
characters.  Maybe this is a good time to look at removing the
requirement for NULL-terminated strings after all.

> mblen may be easy to reimplement but wcwidth is not so we'd end up
> with a mixture.

Yes, we certainly need library calls in zle.  However, formatting
strings for interactive output doesn't need to go particularly fast.
As I said, I think that in practice we're stuck with a mixture anyway.

pws

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

References:
- Re: UTF-8 support
  - From: David Gómez
- Re: UTF-8 support
  - From: Oliver Kiddle
- Re: UTF-8 support
  - From: Peter Stephenson
- Re: UTF-8 support
  - From: Oliver Kiddle

Messages sorted by: Reverse Date, Date, Thread, Author