Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Some groundwork for Unicode in Zle



> Peter wrote:
> > from how the line is encoded internally.  We can use wchar_t inside and
> > pass back a multibyte string.
> 
> Good to see this being addressed. How do you plan to cope with encoding
> nulls if you use wchar_t? (or does zle not bother?) The whole meta stuff
> is what really scared me off ever touching this.

Do you mean null-terminating a string?  I don't think we need that
inside ZLE, though it's easy to get into confusion with the conversions
needed for completion.  Or do you mean difficulties with L'\0'?

> > I've made a very dull patch that does a few things that might make adding
> > Unicode support to Zle easier.  Actually, I think within Zle it should
> > be easy to use generic wchar_t's and not worry about whether they're
> > really Unicode, but I still propose to rely on __STDC_ISO_10646__ to
> 
> Why? Relying on __STDC_ISO_10646__ will rule out a good number of
> systems that do otherwise have good support for multibyte encodings such
> as UTF-8. __STDC_ISO_10646__ is defined on surprisingly few systems. We
> really don't care about what wchar_t is internally if we let libc do our
> conversions.

This definition means we can use wchar_t in a way natural for Unicode
(yes, it doesn't matter if that really is Unicode or something else)
without worrying, and likewise there is compiler support.  For example,
it means L'\0' etc. works.

I'm really not interested in fudging round that sort of thing at this
stage, nor random #ifdef fudges.  Anything I do is likely to be on
Linux.  If someone else once to see what the effect of relaxing the test
is and how it can be fixed up, if necessary, I will be delighted.  For
me it's an unnecessary complication stopping me getting something going.

> > Before I get to details of what I've patched so far, one question: how
> > do we turn input into characters?  My first thought was to do it at a low
> > level around getkey, possibly in getkeybuf which already does
> 
> That would seem more sensible to me. Allowing partial multi-byte
> sequences to be bound is not very nice and probably not very useful.

I'm not sure I agree.  Suppose you have a Meta-sequence bound, i.e. a
binding for some ASCII character with the high bit set.  This conflicts
with a fully working UTF-8 input system, but you may well not have that,
just a system which happens to boot up with a UTF-8 locale.
Implementing only wchar_t based lookups will break this completely.
That seems pretty fatal to me.

> > The actual Unicode-related changes are minimal.  system.h shows how I
> 
> Did you mean to attach an actual patch?

No, it's huge with all the changes of names.

> > #if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
> > # include <wchar.h>
> 
> You'll probably want to include wchar.h even if __STDC_ISO_10646__ is
> not defined. For \u/\U, wchar_t was only useful when converting from
> unicode to wchar_t could be done trivially: when __STDC_ISO_10646__ is
> defined. It otherwise uses iconv or a hardcoded UTF-8 conversion. For
> zle, I can't think of any instance where you would care whether whar_t
> is unicode.

You may well be right, but again it's just another unnecessary
complication at this stage and I don't have a system to test the
effect.

> > /*
> >  * More stringent requirements to enable complete Unicode conversion
> >  * between wide characters and multibyte strings.
> >  */
> > #if defined(HAVE_MBTOWC)
> > /*#define ZLE_UNICODE_SUPPORT	1*/
> 
> I don't quite follow the logic of that check.
> 
> I wouldn't have thought ZLE_UNICODE_SUPPORT is a good name for the
> define. The requirement is to support multibyte character encodings, not
> specifically "unicode" and the same define will probably be extended to
> areas outside of zle. How about ENABLE_MULTIBYTE, perhaps linked to a
> configure --disable-multibyte option.

The *requirement* is to support Unicode, in particular UTF-8.  The
*hope* is to be able to allow other schemes without much work.  The
whole point of Unicode is to replace other schemes anyway, and they are
unlikely ever to be well-tested even if they work.

I'm strongly of the opinion we should stick with multibyte strings
outside Zle, and continue to have the interface to ZLE pass back such a
string.  I think the difficulties and costs of extending the use of
wchar_t outside ZLE would be prohibitive for very little gain.  So
this definition does apply only to ZLE, and (although this is not
necessarily all it does) is specifically targeted at making Unicode
schemes work.

> > typedef wchar_t *ZLE_STRING_T;
> > #else
> > typedef int ZLE_CHAR_T;
> 
> Why int and not unsigned char? Is it really worth having the separate
> STRING type? Again, I wouldn't use "ZLE" in the name given that we may
> want to use it outside zle someday.

This is uncontroversial; it's what we do at the moment.  Functions
returning a single character return an int and functions dealing with a
string use an unsigned char *.  The separate definitions wouldn't be
necessary if we just had wchar_t arrays, it's the backward compatibility
that makes it necessary.

As I said, I have no intention of ever using this scheme outside ZLE.

> > All the tests still pass, so I will commit this some time today.
> 
> Would it be worth creating a separate branch for multibyte support? It
> could later become 4.3. If so I'd suggest we continue to commit
> everything non-multibyte related to the current branch to avoid the old
> issue of the current release being very old.

(We'd probably do it the other way --- create a separate stable 4.2
branch so the new one was a mainline.)  It may become necessary, but
this will be just another way of slowing things down, so I'd like to
avoid it until things become too hairy.

It would be good at least to get 4.2.2 out of the way before anything
more than groundwork appears, though.

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************



Messages sorted by: Reverse Date, Date, Thread, Author