Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Some groundwork for Unicode in Zle



It seems clear that the line editor is the place most people are missing
Unicode support, so I suggest we start from there and work back.  It's
relatvely self-contained in that we can protect the rest of the shell
from how the line is encoded internally.  We can use wchar_t inside and
pass back a multibyte string.

I've made a very dull patch that does a few things that might make adding
Unicode support to Zle easier.  Actually, I think within Zle it should
be easy to use generic wchar_t's and not worry about whether they're
really Unicode, but I still propose to rely on __STDC_ISO_10646__ to
ensure us we have a suitable environment, since otherwise it opens a
huge can of worms.  (If it seems easy to fix up afterwards, fine, but
it could be a lot of work for ever decreasing gain.)

Before I get to details of what I've patched so far, one question: how
do we turn input into characters?  My first thought was to do it at a low
level around getkey, possibly in getkeybuf which already does
metafication.  This would loop until it picked up enough bytes for a
wide character, and would return that.  Then essentially all higher
level uses of characters in ZLE would be based around wchar_t, including
looking up keys.  (This would mean much greater use of sparse keymaps,
though we could keep a dense keymap for the first 128 characters and not
lose much efficiency.)  The advantage is this is transparent to input
systems that handle multibyte strings properly.  The disadvantage is
that for other 8-bit characters you can get stuck.

To get around that it would be possible to keep the input as multibyte
strings until the keymap lookup.  That's much more conservative and
makes it easier to handle older input systems, bind single-byte
characters with the high bit set (if you still want to), etc.  Then
we possibly need some smart way of doing self-insert.  For example, it
could be made to test pending input for a complete multibyte character
and convert it.  I'm still not sure how to test whether a multibyte
string is invalid rather than incomplete.


The present change:

The vast majority of the patch is simply to get rid of the lies in the
header about the names of variables.  cs and ll are now zlecs and zlell
throughout instead of being #define'd (they used to be zshcs and zshll
in the definition but I thought the new names were more consistent).
Also, the zle pointers are called by name instead of a by a macro
defined to be the name of the zle function.  This isn't directly related
to Unicode but has been annoying me for ages.

The actual Unicode-related changes are minimal.  system.h shows how I
suggest deciding whether to compile in support into zle.  At the minimum
we need wctomb and mbtowc as well as C support.  (For future
sophistication we will want wcwidth etc. but we can build a working
system for many character sets without it.)  ZLE_UNICODE_SUPPORT will be
defined to 1 when the conditions are met.  The header chunk looks like
this.  Part of it is moved from utils.c.

/*
 * This is a subset of ZLE_UNICODE_SUPPORT.  It is not all that likely
 * that only the subset is supported, however it's easy to make the
 * \u and \U escape sequences work with just the following.
 */
#if defined(HAVE_WCHAR_H) && defined(HAVE_WCTOMB) && defined (__STDC_ISO_10646__)
# include <wchar.h>

/*
 * More stringent requirements to enable complete Unicode conversion
 * between wide characters and multibyte strings.
 */
#if defined(HAVE_MBTOWC)
/*#define ZLE_UNICODE_SUPPORT	1*/
#endif
#else
# ifdef HAVE_LANGINFO_H
#   include <langinfo.h>
#   if defined(HAVE_ICONV) || defined(HAVE_LIBICONV)
#     include <iconv.h>
#   endif
# endif
#endif

#ifdef ZLE_UNICODE_SUPPORT
typedef wchar_t ZLE_CHAR_T;
typedef wchar_t *ZLE_STRING_T;
#else
typedef int ZLE_CHAR_T;
typedef unsigned char *ZLE_STRING_T;
#endif

The other change is that I have made the variable "line" local to Zle.
This required adding the function zlegetline to return the line so far.
This is currently trivial, but will eventually look like part of the return
sequence frome zlegetline.  As we need all the help we can get, I have
renamed "line" to "zleline" --- a local variable "line" is used internally
in many places in the completion code, and the word occurs in all sorts
of comments, so it was hard to locate uses of the variable.

Apart from the fact that no one understands how the command line is used
inside the completion code, there is also the problem that zlecs and
zlell (cursor position and line length) are exposed in lex.c and hist.c
for use when analysing a line for completion.  This will be a problem
when ZLE is measuring in characters and lex.c in bytes.  I tried to
separate out the variables into lexcs and lexll, but didn't get it to
work.

All the tests still pass, so I will commit this some time today.

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************



Messages sorted by: Reverse Date, Date, Thread, Author