Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

UTF-8 and PCRE and metafy



4.3.11 with rematch_pcre:

% [[ 'foo→bar' =~ ^f.* ]]
zsh: pcre_exec() error: -10

Same with -pcre-match

% locale charmap
UTF-8

Error -10 is PCRE_ERROR_BADUTF8.

In the pcre.c module, we explicitly enable PCRE_UTF8 if UTF8 is in
effect and supported.

By the:
  zwarn("pcre_exec() error: %d", r);
I shoved in a couple more zwarn()s to confirm that the string is in
non-meta form:
  zwarn("pcre_exec() error: %d", r);
  zwarn("lhstr: %s", lhstr);
  zwarn("rhre: /%s/", rhre);
→
  zsh: pcre_exec() error: -10
  zsh: lhstr: foo→bar
  zsh: rhre: /^f.*/

pcretest(1):
% pcretest
PCRE version 8.12 2011-01-15

  re> /^f.*/
data> foo→bar
 0: foo\xe2\x86\x92bar

Okay, so as long as the char is making it through intact as UTF-8 then
PCRE should be handling it.

Debug each char in lhstr as an int, find it's *not* in non-meta form --
why does it print just fine, then?  :(

% [[ 'foo→bar' =~ ^f.* ]]
zsh: pcre_exec() error: -10
zsh: lhstr: foo→bar
zsh: lhstr/%l: foo→bar
zsh: rhre: /^f.*/
zsh: utf-8 enabled?  1
zsh: lhstr char* item: 102
zsh: lhstr char* item: 111
zsh: lhstr char* item: 111
zsh: lhstr char* item: -30
zsh: lhstr char* item: -125
zsh: lhstr char* item: -90
zsh: lhstr char* item: -125
zsh: lhstr char* item: -78
zsh: lhstr char* item: 98
zsh: lhstr char* item: 97
zsh: lhstr char* item: 114

So after line 336 of pcre.c I add:

    unmetafy(lhstr, NULL);

Test:
% unset preexec_functions ; unfunction precmd
% [[ 'foo→bar' =~ ^f.* ]] ; print -l $? $MATCH foo $match
 pattern.c:1403: BUG: - missing from numeric glob
0
foo?^<bar
foo
zefram

I'm guessing I need a bunch of calls to metafy() to process the results
of extraction in zpcre_get_substrings() ?  Where does the string
"zefram" come from?  I mean, Andrew is capable and all, but springing
into existence like that was surprising.

Is there guidance on correct API usage here for calling metafy() and
having lengths all match up?

-Phil



Messages sorted by: Reverse Date, Date, Thread, Author