Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: compmatch behaviour



On Sun, 18 May 2008 16:57:53 -0700
Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> So if I comprehend your question, it's not that you need help figuring
> out what the code is doing.  You want help figuring out what it should
> do instead, because the space of all possible wide characters is much
> to big to brute-force it the way Sven originally did.
> 
> Right?

Exactly.  Thanks for looking.

> There are two situations being handled
> simultaneously here, and maybe the first thing to do is to separate
> them.  The first situation is where wpat is a correspondence class
> and we need to select the corresponding position out of lpat.  The
> second case is where lpat is an equivalence class and we need to try
> every possible character in the class at line position *lp.

Hmm... terminology first... Sven's "correspondence class" appears to be the
one with the "equiv" flag set, i.e. {...}.  So here the characters are
numbered and we are searching for a particular one.  So currently (no
generic character sets) we can do this without looping (conceptually---the
implementation needs a loop to find the character but only needs to test
that one character).

However, in my rewrite I want to be able to say "any upper case character"
so that it can match the corresponding lower case character.  I can
count through the characters or [:stuff:] elements taking account of
ranges specially... but then I may just have a pattern such as [:upper:],
so I still need to match against the line.  So this becomes like the other
case.

(In case that's not clear: in the existing implementation we have "{a-z}"
where a is indexed 1, b is 2, etc. etc.  Now I have "{[:upper:]}".  Here
the only index is 1, but it comes along with a pattern matching instruction
"match any upper case character", and on the other side of the correpondence
"{[:lower:]}" is interpreted specially when we call pattern_match() as "the
corresponding lower case character".)

> I think the first think that's needed is to change the Cpattern struct
> from a dense array indexed by the ascii value of a character, into ...
> well, something else, so that it's not necessary to iterate over it
> in the first case, and so the iteration is more sparse in the second
> case.  Of course, that may make pattern_match() more complicated ....

I've already done this, that's the easy bit.  The two flies in the ointment
are this loop and wide character modifications.

> I'm not sure there's any way to avoid iterating in the second case.
> Besides handling two possible ways of interpreting the character class,
> I think (without tracing very far up the call chain) that this is also
> constructing the prefix string that's going to be shown to the user in
> the event of a common prefix shared by ambiguous matches.  It's not
> enough to check whether the test character is in the class, you might
> have to know *which* character it is in the class, so to speak.

(so you can see how this reflects on my "{[:upper:]}" problem for
correspondence classes.)

Even if iteration is necessary, which is plausible, I can't help thinking
there should be a way of iterating over positions in the word/line and
trying to match at each point with the pattern, looking to see which
character matched, rather than blindly trying characters at each position,
which won't work for every upper case character in Unicode.  If we can work
out how to change it along those lines then I think we've solved it.

(This is certainly a different problem from quadratic behaviour; it's more
along the lines of the way the ${foo//.../...} code works, or at least I
hope it is because that problem is soluble.)

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070



Messages sorted by: Reverse Date, Date, Thread, Author