Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: [PATCH] [[:blank:]] only matches on SPC and TAB



2018-05-14 09:47:33 +0100, Peter Stephenson:
> On Mon, 14 May 2018 07:44:31 +0100
> Stephane Chazelas <stephane.chazelas@xxxxxxxxx> wrote:
> > Tue Oct 13 21:42:47 1998  Andrew Main  <zefram@xxxxxxx>
> > 
> >         * Doc/Zsh/expn.yo, Src/glob.c: Add the [:blank:] character
> > class required by POSIX, which has no corresponding ctype macro.
> > 
> > Which explains why it's not using isblank() and strongly
> > suggests that it was not intentional.
> 
> I think that's correct, but I tend to agree with Sebastian that some
> caution is required here since it's not necessarily clear what action
> with non-ASCII spaces is actually wanted when this is used.  I'd be
> surprised if it actually broke anything, though.
[...]

I was going to say that surely, when someone uses [:blank:] that
means they want to trust the locale on the definition of
"blank", and I can't see why that should be different from other
character classes, but I just noticed that the documentation
actually says:

     [:blank:]
               The character is either space or tab

Instead of "horizontal whitespace". And on GNU systems,
"isblank(3)" also says its SPC and TAB:

     Returns true if C is a blank character; that is, a space or a tab.
     This function was originally a GNU extension, but was added in
     ISO C99.

While iswblank(3) is careful to refer to locale classification.

In practice, the only system where I could find a locale with a
single-byte charset with "blank" characters other than SPC and
TAB was NetBSD. And there, isblank(0xa0) under setlocale() in a
locale that uses ISO8859-1 for instance does return true (as
POSIX requires if that's how 0xa0 is classified in the locale.
However in the same locale, its sh (which is not multibyte
aware) outputs no in:

case $nbsb in
  [[:blank:][:space:]]) echo yes;;
  *) echo no
esac

(bash outputs yes for both blank and space as POSIX requires).

I don't think many people complained when multi-byte support was
added and English people were starting to have their [[:alpha:]]
match on Greek or Korean letters in addition to English ones
(fair enough as "alpha" means the first letter of the Greek
alphabet).

The main problem if we want to align with other shells and make
the shell POSIX compliant is that the documentation currently
states explicitely that  it matches on space and tab only.

The question is would any script be broken if we changed it?

People still keep using [a-z] when they mean to match English
lower case letters while in effect nowadays, except in zsh and a
very few other utilities that match ranges based on code points,
that matches on hundreds more (like à, œ, ć, if not ch, fi...), I
wouldn't be surprised if people use [[:alnum:]] thinking it only
matches on Latin letters without diacritics and Arabic decimal
degits.

But then again, that still works more or less for them, as they
use it anyway against text that only contains English data.

To me the correct way to do a strict match against ASCII blanks
(or English letters, or ASCII punctuations) would be to use the
C locale.

-- 
Stephane



Messages sorted by: Reverse Date, Date, Thread, Author