Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Test ./E03posix.ztst was expected to fail, but passed.



2022-03-23 03:26:44 +0100, Vincent Lefevre:
> On 2022-03-22 14:04:30 -0700, Bart Schaefer wrote:
> > Specifically in this instance, we consider it a POSIX bug that '%s'
> > always counts byte positions and that zsh has fixed this when it
> > counts character positions.
> 
> But, AFAIK, on the POSIX side, it has never been regarded as a bug
> (I haven't seen any bug report).
[...]

It's been raised several times on the POSIX mailing list, and
my understanding the opengroup doesn't consider it as a bug, and
they have made it clear that they would not address it. They may
consider specifying ksh93's %Ls (which pads based on display
width, not byte nor character count) if enough implementations
start to support it.

That's why I didn't bother raising it as a bug personally, but
to me, that position (where printf(1) is meant to be an
interface to printf(3) without decoding those bytes into
characters) does not make sense. printf is to print formatted
text, not doing padding of binary strings. printf(3) was
extended with wprintf(3) to handle wide characters, printf(1)
should have been enhanced to switch to that or equivalent just
like every other text utility is now specified to be able to
cope with wide characters.

printf(1) should need to decode arguments into text if only
because in the format or %b arguments, the "\" character (also
"%" in the format) is being interpreted specially. zsh doesn't
btw (which may be considered a bug, but then again those
non-UTF8 multibyte charsets are poorly supported throughout,
and to me it doesn't seem worth the effort given that hardly
anybody uses multibyte charsets other than UTF-8 these days):

$ LC_ALL=zh_HK SHELL=/bin/zsh luit
zsh$ locale charmap
BIG5-HKSCS
zsh$ printf 'αb' | hd
00000000  a3 08                                             |..|
00000002

(as α is encoded as 0xa3 0x5c in BIG5-HKSCS as used in that
locale, 0x5c being also \)

Yash is probably the only shell that does implement the POSIX
spec as POSIXly likely intends it to be:

~$ LC_ALL=zh_HK SHELL=yash luit
yash$ printf 'αb' | hd
00000000  a3 5c 62                                          |.\b|
00000003
yash$ printf %5s 'αb' | hd
00000000  20 20 a3 5c 62                                    |  .\b|
00000005
yash$ printf %5b 'αb' | hd
00000000  20 20 a3 5c 62                                    |  .\b|
00000005

That is bytes are decoded into characters for those backslashes
to be interpreted "correctly" (yash does decode everything, it's
not specific to printf¹), and then encoded back to behave as if
being passed to printf(3) as POSIX requires.

I've not verified it, but I've read somewhere the C standard was
considering enhancing printf("%.3s") so it doesn't break
characters in the middle (or maybe it's already the case?).
So printf '%.3s\n' Stéphane, where é is UTF-8 encoded in a
locale using UTF-8 would output "St" instead of "St<0xc3>".

My opinion would be:

- not change how %5s works in zsh. To me, zsh made an effort to
  fix that, I can't expect anyone relying on the POSIX
  behaviour which to me is a bug. One can always do

    printf() {
      set -o localoptions +o multibyte; builtin printf "$@"
    }

  if they want the POSIX behaviour.

- no need to fix the problems with backslashes in those
  messed-up multibyte encodings as I'd expect they're being
  phased out.

- maybe implement ksh93's %Ls (zsh does have a ${(ml[5])param}
  alternative though it does both padding and truncation).

---
¹ That approach is not tenable IMO as that means yash can't cope
with arbitrary file paths, arguments, or environment variables

-- 
Stephane




Messages sorted by: Reverse Date, Date, Thread, Author