Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Quoting problem and crashes with ${(#)var}



Bart Schaefer wrote:
> I'm a bit puzzled, given this test ...
> 
> }      if (isset(MULTIBYTE) && ires > 127) {
> 
> ... why ${(V)x} for x in 128 through 159 display as \u0080 through
> \u009f, but then 160 through 255 are treated as directly printable.

On my terminal, I've got different effects, which worries me more: if I
assign the UTF-8 representation of character 128 to a variable, ${(V)x}
tries to print it out directly (and it only shows up if send it through
xxd or equivalent).  Quite possibly the shell is linked with different
libraries.  (However, the ZLE function insert-unicode-char correctly
shows it as control character, ^ followed by A with a grave accent.)

Anyway, 128 to 159 aren't printable, 160 on are: in Unicode:

0080	<control>
...
009F	<control>
	= APPLICATION PROGRAM COMMAND
00A0	NO-BREAK SPACE
	= NBSP
	x (space - 0020)
	x (figure space - 2007)
	x (narrow no-break space - 202F)
	x (word joiner - 2060)
	x (zero width no-break space - FEFF)
	# <noBreak> 0020

(V) is documented as "make special characters visible".  That's exactly
what you're getting (but I'm not, for some reason---I'd be interested in
knowing where on your system the printability test is taking place).

> Furthermore, if I run with LANG=C I get
> 
> % for x in {1..254}; h[x]=${(V#)x}
> zsh: character not in range
> 
> That seems wrong.  It does the right thing if "unsetopt multibyte"
> is also in effect, but why should I have to explicitly do so?

Well, because you've (explicitly or otherwise) got it set to a locale
with no knowledge of characters beyond 127; it only knows about the
portable character set.  It's simply telling you it doesn't know what to
do with them.  It can't guess, because there's nothing really for it to
guess; locale C is a statement of ignorance about the non-portable,
post-ASCII world.  You're probably expecting 128 to print out a single octet
corresponding to the value of a C unsigned char with 128 in it.  Yet for
all the computer's been told, character 128 on the terminal you're using
is "really" the symbol for a deity worshipped by a Venusian cargo cult
which is represented by a string of 17 0xff's followed by 0x73 0x57, the
magic number indicating the transfer of energy between worshippers when
the Earth is high in the sky.  (All right, maybe this particular example
wasn't very realistic.)

What you're asking is for some kludged special case for LANG=C
(presumably we shouldn't second-guess any other character set).  It's
doable, I suppose, but I can't see the gain.  MULTIBYTE mode was never
intended to be backward compatible; that's exactly why NO_MULTIBYTE
exists.

-- 
Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/



Messages sorted by: Reverse Date, Date, Thread, Author