Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: Stuff to do

X-seq: zsh-workers 22788
From: Peter Stephenson <pws@xxxxxxx>
To: zsh-workers@xxxxxxxxxx
Subject: Re: Stuff to do
Date: Fri, 29 Sep 2006 18:08:43 +0100
In-reply-to: <200609292037.17847.arvidjaar@xxxxxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
Organization: Cambridge Silicon Radio
References: <200609271211.k8RCBW5N023914@xxxxxxxxxxxxxx> <200609292037.17847.arvidjaar@xxxxxxxxxx>

Andrey Borzenkov <arvidjaar@xxxxxxxxxx> wrote:
> 1. matcher code assumes character == byte and is using 256 bytes array to
> build character equivalence classes. What is worse, it is passing this array
> around between different functions to suppply results of previous matching. I
> have here patch (attached) that eliminates external dependency on this array
> so matcher internals can be more easily changed. This seems to make code a
> bit more understandable irrespectively :) OK to commit?

Yes, the more the calling conventions are sanitized like this the better
I like it.  The references to external data are one of my worst
nightmares.

> 2. Usage of magic array for character classes ([abcd]) can be naturally
> superceded by using either generic pattern matching or direct comparison.
> Pattern matching provides for using something like [[:lower:]] and possibly
> using matchers etc but potential side effects of extended globbing need
> review. I do not know what is faster. Is it OK?

I'd be quite keen on being able to do this by using globbing.  I think the
current uses of matcher specifications are limited enough (sometimes by
necessity, as we're seeing) that an extension wouldn't be a problem for
compatibility; however, I don't know how to mix this with the equivalence
class stuff.  It would be quite nice to keep it in one place in pattern.c,
but I doubt if that's going to work with all the additions we need.

> 3. Equivalence classes ({abcd}={xyzw}) do not scale beyond single byte
> characters. But if we check usage I believe, it has never been used for
> anything beyond case-insensitive matching. For this particular usage I
> suggest using new matcher type:
>
> m:LPAT>upper
> m:LPAT>lower
>
> with obvious semantic - character from line is converted to lower or upper and
> compared with character from potential match. So m:{a-z}={A-Z} becomes
> m:?>upper etc.
>
> We still can implement {...} for character _set_ but not for character range.
> So far I do not consider it major problem.

I think we'll need to keep it working for ASCII for compatibility, but not
extending it to other characters is, as you say, not a big problem.
However, maybe it's not a problem at all; see below.

> 4. The hardest part. Right anchor. For this matcher must match _backward_. I
> am not aware of any way to walk backward as long as we assume arbitrary
> encoding. Options apparently are
>...
> b) convert this code to use wide characters. Not sure if this is a viable
> option.

This is the option I was thinking about, and it removes the range problem
since it extends the ASCII logic in a natural way (it may be system
dependent, but that's the absolute least of our worries).

I don't think it's a problem using wide characters locally for the
comparisons.  Indeed, the pattern match code does all its character class
stuff with wide characters (or kludged wide characters which are just the
unsigned char values if a multibyte sequence doesn't convert).  It doesn't
really make sense to allow for unconvertible characters in matcher
comparisons---it's great to be able to insert them on the command line in
some fashion, but the matcher specs only make sense for characters that are
convertible.

The worst problem is that we lose the ability to do matching control where
(say) much of the string is ASCII, and our match rules only use ASCII, but
there are also characters that don't work in the current locale.  I don't
think this is a big issue and there are possible ways round:
- partial conversion
- convert them at this stage to $'\...' sequences instead of later
- use marked wide characters where we record a byte that can't be converted
--- any of which could be bolted on later.  So I don't think that's a
showstopper.

I was wondering how much of the code we needed to convert to use wide
characters, and vaguely came to the conclusion the only reasonable sane way
was to do it fairly locally within the comparison function(s), since
otherwise the interface to the rest of the completion system gets very
hairy.  However, I haven't actually looked at the code again since
coming to that conclusion.

However, if there's an easy way of doing it by another method, fine.  I
suspect there isn't.

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php

Follow-Ups:
- Re: Stuff to do
  - From: Andrey Borzenkov

References:
- Stuff to do
  - From: Peter Stephenson
- Re: Stuff to do
  - From: Andrey Borzenkov

Messages sorted by: Reverse Date, Date, Thread, Author