Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Un-patch: new pattern matching code



I'm just about to make 3.1.6-pws-1 available, and it will contain the new
pattern matching code (or a new pattern matching code, if it turns out to
be duff we can go back).  Since it's quite long I haven't posted it
separately, but here is a description.

The user-visible changes are supposed to be entirely minimal: the
precedence of operators should be more rational, backtracking will work in
numeric ranges, and that's pretty much it.  I intend to document a few more
of these in the manual; I don't think anything there at the moment has
become wrong.  One think works, which used to in the old days but I
inadvertently messed it up because I didn't know about it:  patterns like
foo~bar~rod are valid, and are equivalent to, but less cumbersome than,
foo~(bar|rod).

It's based on a real regular expression matcher, so it should be
significantly faster, although I haven't done any real testing of speed and
I would be glad to hear about any comparisons (or maybe I wouldn't).  In
particular, it should do much less memory management --- in some cases,
such as simple patterns in ${...#...} or [[ ... = ... ]], possibly even
none, as there is now a static buffer.  It is also possible to copy
compiled patterns; just use memcpy and the size member of the struct.
Note that the default is to duplicate the static buffer into a
heap-allocated chunk of the appropriate size; this is needed for file
paths.

It now lives in its own file, pattern.c.  Originally everything here was
going to use unsigned char * to avoid bad comparisons with integers, but
then I found that all the tokens are defined in zsh.h as ((char) 0x83) and
so on, so that went out of the window and I resorted to char * after
updating the whole of the shell to pass unsigned char * down.  As you can
imagine, that was great fun.  Anyway, it now uses the usual STOUC(...)
hack, so problems are quite possible if I've missed any.  However, it's now
been compiled and tested on one machine with unsigned and one with signed
characters, so the basics are probably right.

There are two cpp definitions available: one is ZSH_PAT_DEBUG, which
enables the `patdebug' builtin, which prints out a compiled pattern with
option -p and with more than one argument tests the pattern against the
strings; this was inherited from Henry Spencer's code, but I thought I
might as well leave it.  There's no configure code to enable
ZSH_PAT_DEBUG, just add it by hand.

The other cpp definition is BACKREFERENCES, which enables backreferencing
of parentheses in patterns like sed and perl do.  I haven't enabled it
because I don't know what, if anything, I'm going to do with it.  Matching
is a little more efficient without because the extra recursive call on open
and close parentheses isn't there.  (There's code not to use backreferences
even if they are compiled in, so that doesn't need to be a performance
issue for ordinary globbing.)

Anyone who knows any computer science will probably notice that, while it's
pretty much a finite state machine, some states are more finite than
others.  In particularly, backtracking on excluded matches, where in
(foo~bar)rod the `bar' only has to match the `foo' and the `rod' is
irrelevant, need some extra state recording, and I have retained the old
trick that eliminates infinite loops and prevents exponential behaviour
when failing to match against things like `(f#o#)#' (the version of perl
here goes into an infinite loop when given `(f*o*)*', so it isn't trivial).
This has meant leaving holes for pointers in the compiled expression, which
ought to work (I've tried to make sure everything is aligned to pointer
size anyway) but is probably the most dubious part of the proceedings.

Finally, the usual warning: in case you hadn't guessed, bugs here can cause
problems with pretty much any part of the shell.  Some of the tests to make
sure things like globbing flags are saved and restored are a bit complex.
Alignment problems should have been handled, but there could be some
remaining, particularly on 64-bit architectures which I haven't tried.  So
if you're planning on doing any real work while I'm away, you might want to
stick with 3.1.6.

-- 
Peter Stephenson <pws@xxxxxxxxxxxxxxxxx>       Tel: +39 050 844536
WWW:  http://www.ifh.de/~pws/
Dipartimento di Fisica, Via Buonarroti 2, 56127 Pisa, Italy



Messages sorted by: Reverse Date, Date, Thread, Author