Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: num_in_chars incremented after each mbrtowc()

I start with the second patch (which you haven't look at?) as it is
more justified. To put it simply – ptr is incremented, num_in_char
isn't. That's wrong.

To put it less simply, I've written a function that returns raw
pointer into string and sets prevcharlen and nextcharlen. These two
values come from num_in_char. If I don't add one more incrementation
of num_in_char to the "if (*ptr == Meta) {" block, characters are
displayed as errors (i.e. the question marks symbols).

The first patch is less justified. Only bytes of incomplete characters
are counted in num_in_chars and that is what the code wants. But if
one defines the variable as "holds current byte width of character" it
is clear that *every* call to mbrtowc() should increment that
variable. Currently last byte will not be counted because there will
be no MB_INCOMPLETE returned.

To put it simply, call to mbrtowc() must be followed by num_in_char++,
it cannot be anything different.

Best regards,
Sebastian Gniazdowski

On 6 December 2015 at 16:49, Peter Stephenson
<p.w.stephenson@xxxxxxxxxxxx> wrote:
> On Sun, 6 Dec 2015 10:37:21 +0100
> Sebastian Gniazdowski <sgniazdowski@xxxxxxxxx> wrote:
>> Hello,
>> while working my hands off on implementing display width handling in
>> params.c rather than subst.c I encountered a bug in mb_metastrlenend.
>> It will reveal itself only on improper unicode strings.
> I don't understand your patch: the change is to increment num_in_char in
> exactly the cases where it is deliberately set to 0 later to reflect the
> fact we've now got a complete character and the effect is included in
> num instead.
> Somebody complained about this function a couple of months ago and I
> explained, then, too; it suggests it needs some more comments, so I've
> added some.
> It may be the real difficulty is with the API, in which case you'll need
> to say (in words, not videos, please) what you're expecting.  As long as
> this is consistently dealt with in callers of the function it might be
> possible to change --- I guess you're only worried about the case for
> returning a width, which is uncommon in the code and indeed doesn't
> really have a well defined result for incomplete/invalid characters.
> Maybe you have a particular strategy in mind.
> Consequently I haven't looked at your other patch.
> pws
> diff --git a/Src/utils.c b/Src/utils.c
> index d131383..45f8286 100644
> --- a/Src/utils.c
> +++ b/Src/utils.c
> @@ -5179,6 +5179,17 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
>         ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);
>         if (ret == MB_INCOMPLETE) {
> +           /*
> +            * "num_in_char" is only used for incomplete characters.  The
> +            * assumption is that we will output this octet as a single
> +            * character (of single width) if we don't get a complete
> +            * character; if we do get a complete character, num_in_char
> +            * becomes irrelevant and is set to zero.
> +            *
> +            * This is in contrast to "num" which counts the characters
> +            * or widths in complete characters.  The two are summed,
> +            * so we don't count characters twice.
> +            */
>             num_in_char++;
>         } else {
>             if (ret == MB_INVALID) {

Messages sorted by: Reverse Date, Date, Thread, Author