Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: bug report : printf %.1s outputting more than 1 character



On Wed, 2023-03-15 at 08:31 -0700, Bart Schaefer wrote:
> On Tue, Mar 14, 2023 at 9:56 PM Jason C. Kwan <jasonckwan@xxxxxxxxx> wrote:
> > 
>> does the following ( below the "====" line ) behavior look even
>> reasonable at all, regardless of your spec ? Because what the spec ends
>> up doing is treating the rest of the input string as 1 byte and printing
>> everything out, even though there are valid code points further down the
>> input string.
> 
> I'm not the resident expert on multibyte character sets, so I'm just
> reporting the situation and waiting for e.g. PWS to respond.  However,
> as far as my understanding of the multibyte library goes, once you've
> "desynchronized" the input by encountering an invalid byte, you're not
> guaranteed that anything further that you see can be correctly
> interpreted as a code point.  I agree that it's not ideal to just dump
> everything else "raw".

Elsewhere, we mostly treat invalid codes as if they're single octets, so
this is a bit inconsistent.  I think it's really just to try to avoid
overcomplicating %s output.  However, it would probably be more
consistent just to treat everything that doesn't make sense as single
bytes until we get back on track.  There doesn't seem any point about
doing anything different with incomplete characters here, either ---
we've already got all the characters we're going to get.  Something like
this, but feel free to tweak further --- I don't have any motivation to
do so myself.

This is probably good enough for the obvious simple case of "just
output the next thing you see whatever the heck it looks like".

pws

diff --git a/Src/builtin.c b/Src/builtin.c
index 70a950666..9719d26d1 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -5222,20 +5222,21 @@ bin_print(char *name, char **args, Options ops, int func)
 #ifdef MULTIBYTE_SUPPORT
 			if (isset(MULTIBYTE)) {
 			    chars = mbrlen(ptr, lleft, &mbs);
-			    if (chars < 0) {
-				/*
-				 * Invalid/incomplete character at this
-				 * point.  Assume all the rest are a
-				 * single byte.  That's about the best we
-				 * can do.
-				 */
-				lchars += lleft;
-				lbytes = (ptr - b) + lleft;
-				break;
-			    } else if (chars == 0) {
-				/* NUL, handle as real character */
+			    /*
+			     * chars <= 0 means one of
+			     *
+			     * 0: NUL, handle as real character
+			     *
+			     * -1: MB_INVALID: Assume this is
+			     *     a single character as we do
+			     *     elsewhere in the code.
+			     *
+			     * -2: MB_INCOMPLETE: We're not waiting
+			     *     for input on this occasion, so
+			     *     just treat this as invalid.
+			     */
+			    if (chars <= 0)
 				chars = 1;
-			    }
 			}
 			else	/* use the non-multibyte code below */
 #endif





Messages sorted by: Reverse Date, Date, Thread, Author