Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: bug report : printf %.1s outputting more than 1 character



quote :

============
This triggers a branch of the printf code introduced by this comment:
    /*
    * Invalid/incomplete character at this
    * point.  Assume all the rest are a
    * single byte.  That's about the best we
    * can do.
    */
============

does the following ( below the "====" line ) behavior look even reasonable at all, regardless of your spec ? Because what the spec ends up doing is treating the rest of the input string as 1 byte and printing everything out, even though there are valid code points further down the input string. 

The behavior is correct when LC_ALL=C is set, meaning zsh already has the codes needed to generate the correct output. My point was that instead of treating the rest of the input string, regardless of size, as 1 byte/character, why not have it behave "as if" LC_ALL=C is in effect whenever it enters this branch :

if (chars < 0) {
/*
 * Invalid/incomplete character at this
 * point.  Assume all the rest are a
 * single byte.  That's about the best we
 * can do.
 */
lchars += lleft;
lbytes = (ptr - b) + lleft;
break;

and continue in this mode until a locale-valid character is found, then revert back to multi-byte behavior ? wouldn't that be a more logical behavior ?

If that's too complex to implement, then perhaps treat rest of input string as a collection of individual bytes instead of just 1 byte ?

I just find printf '%.3s' outputting a 179 KB string rather odd.

=========================

 zsh --restricted --no-rcs --nologin --verbose -xtrace -f -c '___=$'\''=\343\276\255#\377\210\234\256A\301B\354\210\264_'\''; command printf "%s" "$___" | gwc -lcm; for __ in {1..16}; do builtin printf "%.${__}s" "$___" | gwc -lcm; done '
___=$'=\343\276\255#\377\210\234\256A\301B\354\210\264_'; command printf "%s" "$___" | gwc -lcm; for __ in {1..16}; do builtin printf "%.${__}s" "$___" | gwc -lcm; done
+zsh:1> ___=$'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> printf %s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=1
+zsh:1> printf %.1s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       1       1
+zsh:1> __=2
+zsh:1> printf %.2s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       2       4
+zsh:1> __=3
+zsh:1> printf %.3s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       3       5
+zsh:1> __=4
+zsh:1> printf %.4s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=5
+zsh:1> printf %.5s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=6
+zsh:1> printf %.6s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=7
+zsh:1> printf %.7s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=8
+zsh:1> printf %.8s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=9
+zsh:1> printf %.9s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=10
+zsh:1> printf %.10s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=11
+zsh:1> printf %.11s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=12
+zsh:1> printf %.12s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=13
+zsh:1> printf %.13s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=14
+zsh:1> printf %.14s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=15
+zsh:1> printf %.15s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=16
+zsh:1> printf %.16s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16

+zsh:1> ___=$'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> LC_ALL=C printf %s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'
+zsh:1> gwc -lcm
      0       7      16
+zsh:1> __=1
+zsh:1> LC_ALL=C +zsh:1> printf %.1s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       1       1
+zsh:1> __=2
+zsh:1> LC_ALL=C +zsh:1> printf %.2s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       1       2
+zsh:1> __=3
+zsh:1> LC_ALL=C +zsh:1> printf %.3s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       1       3
+zsh:1> __=4
+zsh:1> LC_ALL=C +zsh:1> printf %.4s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       2       4
+zsh:1> __=5
+zsh:1> LC_ALL=C +zsh:1> printf %.5s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       3       5
+zsh:1> __=6
+zsh:1> LC_ALL=C +zsh:1> printf %.6s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       3       6
+zsh:1> __=7
+zsh:1> LC_ALL=C +zsh:1> printf %.7s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       3       7
+zsh:1> __=8
+zsh:1> LC_ALL=C +zsh:1> printf %.8s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       3       8
+zsh:1> __=9
+zsh:1> LC_ALL=C +zsh:1> printf %.9s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       3       9
+zsh:1> __=10
+zsh:1> LC_ALL=C +zsh:1> printf %.10s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       4      10
+zsh:1> __=11
+zsh:1> LC_ALL=C +zsh:1> printf %.11s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       4      11
+zsh:1> __=12
+zsh:1> LC_ALL=C +zsh:1> printf %.12s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       5      12
+zsh:1> __=13
+zsh:1> LC_ALL=C +zsh:1> printf %.13s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       5      13
+zsh:1> __=14
+zsh:1> LC_ALL=C +zsh:1> printf %.14s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       5      14
+zsh:1> __=15
+zsh:1> LC_ALL=C +zsh:1> printf %.15s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       6      15
+zsh:1> __=16
+zsh:1> LC_ALL=C +zsh:1> printf %.16s '=㾭#????A?B숴_'
+zsh:1> gwc -lcm
      0       7      16




On Tuesday, March 14, 2023 at 11:46:14 PM EDT, Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:


On Tue, Mar 14, 2023 at 7:40 PM Jason C. Kwan <jasonckwan@xxxxxxxxx> wrote:
>
> I'm using the macOS 13.2.1 OS-provided zsh, version 5.8.1, which I understand isn't the latest and greatest of 5.9, so perhaps this bug has already been addressed.

A related case been addressed by declaring it an intentional
divergence from POSIX, see
https://www.zsh.org/mla/workers/2022/msg00240.html

However ...


> In the 4-byte sequence as seen below ( defined via explicit octal codes ), under no Unicode scenario should 4 bytes be printed out via a command of printf %.1s, by design.
>
>  - The first byte of \377 \xFF is explicitly invalid under UTF-8 (even allowing up to 7-byte in the oldest of definitions).


This triggers a branch of the printf code introduced by this comment:
    /*
    * Invalid/incomplete character at this
    * point.  Assume all the rest are a
    * single byte.  That's about the best we
    * can do.
    */

Thus, you've deliberately invoked a case where zsh's response to
invalid input is to punt.  This dates back to the original
implementation in workers/23098,
https://www.zsh.org/mla/workers/2007/msg00019.html, January 2007.



Messages sorted by: Reverse Date, Date, Thread, Author