Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: zsh doesn't understand some multibyte characters



On Fri, May 15, 2015 at 01:43:45AM +0900, Jun T. wrote:

> 
> 2015/05/14 03:29, Danek Duvall <duvall@xxxxxxxxxxxxxx> wrote
> > 
> > If I set
> > 
> >    comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 };
> > 
> > in the test, it thinks that character's wcwidth() is 2, not 1.
> 
> U+2026 is one of the characters whose "East Asian Width" property
> is set to "Ambiguous". Widths of these characters are *really* ambiguous;
> in western (monospaced) fonts they have a single width,
> while in (most of?) CJK fonts they have double width.
> 
> Usually, wcwidth() returns 1 for these characters so they are not
> displayed correctly in CJK fonts, unless applications take spacial care of
> them. For example, xterm has an option -cjk to handle this problem.
> 
> Your report indicates that Solaris is one of the rare systems in
> which wcwidth() returns 2 for U+2026.
> 
> Are there any fonts in which U+2026 has double width on Solaris?

Likely, but I don't know for sure, and I'm not sure how to tell.

As one of our globalization folks explained in a long-open bug against
Solaris' "broken" wcwidth(), we currently have a single width table, and
the ambiguous-width characters all(?) come back as width 2.  They're
proposing two tables, switched based on the locale -- if you're in an east
Asian locale, you'll get 2 for these, and otherwise 1, similarly to the way
that gnome-terminal uses VTE_CJK_WIDTH.

The only commentary mk_wcwidth() has about ambiguous character widths is in
the alternate _cjk implementation, which he doesn't recommend for general
use.  I don't know if the Solaris approach (double-width in CJK locales,
single-width elsewhere) is common enough to want to make this
runtime-configurable in programs that care; for instance, zsh could have a
setopt flag to switch to double-width when the user knew they were in that
environment.

I'm a bit surprised that xterm's -cjk option isn't automatic -- shouldn't
it know whether the font it's loading is double-width or not?  Either way,
it could respond to some escape code that programs which care (or even
wcwidth() itself or a standard replacement) could use to query it about the
current width.  Perhaps that's the ideal solution?

I'd started talking to Thomas Dickey about this a couple of years ago (I
keep running into this problem, start talking to people about it, decide
it's too hard and I don't have enough time, and drop it until the next time
around); perhaps I could pick that thread up again with that suggestion?

FWIW, I tried xterm -cjk, both with my normal western font and with a CJK
font, and in both cases it handles U+2026 fine, putting it in a double-wide
box.  Vim seemed to handle it, too.

> > I don't know why the zero-width combining character was chosen as the
> > test.
> 
> The test was first introduced to detect a broken wcwidth() on Mac OS X,
> where wcwidth() returns 1 for combining characters.

Which seems unambiguously broken, unlike the one on Solaris.

Thanks,
Danek



Messages sorted by: Reverse Date, Date, Thread, Author