Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Silent UTF-8 assumption?



Andrey Borzenkov wrote:
> --nextPart1795203.6vxPbZfGLe
> Content-Type: text/plain;
>   charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> Content-Disposition: inline
> 
> This caught my attention:
> 
> static wchar_t
> charref(char *x, char *y)
> {
>     wchar_t wc;
>     size_t ret;
> 
>     if (!(patglobflags & GF_MULTIBYTE) || !(STOUC(*x) & 0x80))
>         return (wchar_t) STOUC(*x);
> 
> well, this is definitely not valid for arbitrary multibyte character
> set.

We're not using an arbitrary character set, we're using one that has the
portable character set (i.e. ASCII) as a 7-bit subset, including the
property of UTF-8 that any true multibyte stream has the eighth bit set
in all octets.  That's entirely for the practical reason that, if we
don't make that assumption, all hell will break use because we have to
make *every* part of the shell that ever tests a character, even an
ASCII character, multibyte aware.

There's a good chance the multibyte character set in question is UTF-8,
but it doesn't necessarily have to be.

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php

To get further information regarding CSR, please visit our Investor Relations page at http://ir.csr.com/csr/about/overview



Messages sorted by: Reverse Date, Date, Thread, Author