Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: PATCH: PCRE support for embedded NUL characters



On Sun, 16 Sep 2012 08:50:15 -0400
Phil Pennock <zsh-workers+phil.pennock@xxxxxxxxxxxx> wrote:
> This patch does not touch the docs or code for non-PCRE.  It just
> changes PCRE: code, docs, tests.  As I wrote this mail, I realised a
> couple of issues which lead me to think I shouldn't just commit this as
> is; there are open questions below for Peter/Bart.
> 
> Moritz Bunkus reported a problem which boiled down to regular
> expressions containing ASCII NUL characters; the support for UTF-8
> multibyte characters in regular expressions meant that we no longer
> passed zsh's internal metafied forms to the regular expression
> libraries, but this also meant that NUL characters in zsh now make it
> down.  There's no way I can see to deal with this for zsh/regex, but for
> zsh/pcre we can hack around it.

I'm not sure what the right answer is for zsh/regex, but documenting it
will do for now.

> More careful tracking of length, instead of using strlen(), lets us pass
> the search space (haystack) down intact.  For the regular expression, as
> part of the unmetafy() I now check to see if strlen() doesn't match the
> decoded length, in which case there's a NUL somewhere, and then all the
> NULs get replaced with \x00 in the pattern.
> 
> One thing that occurs to me now: what's the correct expectation if the
> pattern contains two characters, "backslash NUL"?  Before, that broke;
> with this, it becomes \\x00 which won't match.  \CHAR for a non-letter
> CHAR should remove special handling and treat the character as itself.
> Fixing this seems to require full string parsing in zsh with knowledge
> of the regexp escape sequences.  Document it as a limitation?

You can just do a pre-scan of the whole string for backslashes.  If
there's a backslash followed by a non-NULL, skip checking that next
character (which may itself be a backslash that's escaped); if there's a
backslash followed by a NULL the backslash can go.  It's such an unusual
case it hardly seems worth it, though.

> Another open question: are $mbegin/$mend offsets supposed to be in
> octets or in characters?  Given the MB_ prefix, I'm guessing I just
> broke this and will need to fix it tomorrow, before commit, after I get
> some sleep.  Do we have a decent way to count the number of wide
> characters in an unmetafied string which can contain NUL characters?

They are characters.  If the string is unmetafied you can skip the
MB_METACHARLEN() stuff and use the mbrtowc()/WCWIDTH() library calls
directly (WCWIDTH() is only defined in order to be able to replace an
unusable wcwidth()), but a null probably needs to be a special case
since I think the libraries assume it's a terminator.  It looks like the
existing pattern code uses metafied strings.

-- 
Peter Stephenson <pws@xxxxxxx>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK


Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog



Messages sorted by: Reverse Date, Date, Thread, Author