Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: read -d $'\200' doesn't work with set +o multibyte

X-seq: zsh-workers 51159
From: Oliver Kiddle <opk@xxxxxxx>
To: Zsh hackers list <zsh-workers@xxxxxxx>
Subject: Re: read -d $'\200' doesn't work with set +o multibyte
Date: Fri, 09 Dec 2022 21:05:02 +0100
Archived-at: <https://zsh.org/workers/51159>
In-reply-to: <20221209154225.2z3lbtf422ypnmjx@chazelas.org>
List-id: <zsh-workers.zsh.org>
References: <20221209154225.2z3lbtf422ypnmjx@chazelas.org>

Stephane Chazelas wrote:
> Even in a locale with a single-byte charmap, when multibyte is
> off, I can't make read -d work when the delimiter is a byte >=
> 0x80.

In my testing, it does work in a single-byte locale. I tested on
multiple systems.

Looking at the multibyte implementation of read, the approach taken
is to use a wchar_t for the delimiter and then maintain mbstate_t for
the input. This supports a delimiter that can be any single unicode
codepoint. In my testing this is working as intended. But note that \351
alone is incomplete in UTF-8 terms so what wchar_t value should that be
mapped to.

Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x
locale. Those are duplicates of the control characters. In my testing
with a single-byte locale \x89 as a delimiter will end input at a tab
character but the converse (\t as a delimiter) will not terminate at
\x89 in the input.

My understanding of the proposed POSIX wording is that it requires
the individual octet, regardless of any character mapping to be the
delimiter. Does anyone track the austin list? Would be good if they can
be persuaded to relax what they specify. The part I especially object to
is requiring that the input does not contain null bytes. The fact that
zsh can cope with nulls is often really useful. Why can't they leave
that unspecified? I can understand wanting to standardise a lowest
common denominator but that is punishing an existing richer
implementation.

One way forward would be to take the argument to -d as a literal and
potentially multi-byte delimiter. UTF-8 has the property that a valid
sequence can't occur within a longer sequence so for UTF-8 you would not
need to worry about it finding a delimiter within a different
character. This is not the case with combining characters but the
current implementation will also stop at the uncombined character.
There are other multi-byte encodings for which this is not true. I've
no idea how relevant things like EUC-JP and Shift-JIS still are.

A side effect of this would be support for strings of quite distinct
characters as a multi-character delimiter.

Should we document the fact that -d '' works like -d $'\0'? Perhaps mark
this as being for compatibility with other shells? Fortunately, it does
work as specified but this may only be by accident. When the -d feature
was added, it was probably only checked that the behaviour with an empty
delimiter was sane.

> $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte
> $ locale charmap
> ISO-8859-15

What do you get with the following, I'd sooner trust this:
  zmodload zsh/langinfo; echo $langinfo[CODESET]

Oliver

Follow-Ups:
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Stephane Chazelas

References:
- read -d $'\200' doesn't work with set +o multibyte
  - From: Stephane Chazelas

Messages sorted by: Reverse Date, Date, Thread, Author