Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: =~ doesn't work with NUL characters



2017-06-14 16:49:38 -0400, Phil Pennock:
[...]
> Without rematchpcre, this is ERE per POSIX APIs, which don't portably
> support size-supplied strings, relying instead upon C-string
> null-termination.
> 
> Current macOS has regnexec() but this is not in the system regexp
> library I see on Ubuntu Trusty or FreeBSD 10.3.  It appears to be an
> extension from when they switched to the TRE implementation in macOS
> 10.8.  <https://laurikari.net/tre/>
> 
> Trying to support this would result in variations in behaviour across
> systems in a way which I think might be undesirable.  The whole point of
> adding the non-PCRE implementation was to match Bash behaviour by
> default, and Bash does the same thing.
[...]

A dirty trick in UTF-8 locales (the norm these days) may be to
encode NUL as U+7FFFFF00 (and bytes 0x80 -> 0xff that don't
form part  of valid characters as U_7FFFFF{80..FF}) (in both the
string and regexp).

That wouldn't work with every regexp implementation though as
some would treat those as invalid characters if they go by
the newer definition where valid characters are only
0000->D7FF, E000->10FFFF.

But with those that do, that would also make the behaviour more
consistent in cases like:

[[ $'\x80' = ? ]] vs [[ $'\x80' =~ '^.$' ]]

That wouldn't help in things like [[ x =~ $'[\0-\177]' ]] (which
anyway doesn't make sense in locales other than C/POSIX) though.

-- 
Stephane



Messages sorted by: Reverse Date, Date, Thread, Author