Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: PATCH: =~ regex match



Phil Pennock <zsh-workers+phil.pennock@xxxxxxxxxxxx> wrote:
> I was also thinking about how to deal with UTF8, which is another
> potential advantage to sticking with PCRE.  Zsh isn't specifically
> UTF-8 when in widechar, is it?

That's correct, but that's actually an advantage of the system regular
expression libraries, which will use the locale in the same way as the
rest of the system to handle multibyte strings.

> #if defined(MULTIBYTE_SUPPORT) && defined(HAVE_NL_LANGINFO) && defined
> #(CODESET)
>   {
>     static int have_utf8_pcre = -1;
> 
>     if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
>       if (have_utf8_pcre == -1) {
>         if (pcre_config(PCRE_CONFIG_UTF8, &have_utf8_pcre) {
> 	  have_utf8_pcre = -2; /* erk, failed to ask */
> 	}
>       }
> 
>       if (have_utf8_pcre > 0) {
>         pcre_opts |= PCRE_UTF8;
>       }
>     }
>   }
> #endif
> 
> Which means that in non-UTF-8 multibyte locales, you'll get per-octet
> regexps, but in UTF-8 locales, a multibyte zsh with a libpcre also
> built with UTF-8 support will let you get "proper" matching.

You might want to add that to the pcre library, if appropriate; you
probably also need to test for isset(MULTIBYTE) since unsetting the
multibyte option is supposed to force all strings to be single bytes.

> I'm envious of the =~ operator but that doesn't mean that I want to
> lose the funky stuff of PCRE when I use it -- I like negative
> lookahead assertions, freak that I am.

I don't think there's any question of removing -pcre-match.

> As to BASH_REMATCH ... how frowned upon are new zsh options which
> auto-set for compatibility?  It wouldn't be hard, since the
> infrastructure's all already in place.  Call the zsh option
> BASH_REMATCH to set the BASH_REMATCH variable.  :^)

That would be perfectly sensible.

> If I code this up, is it likely to make it in?  If not, I won't bother
> as full bash compatibility isn't so important to me, only having =~.
> It's not like POSIX is involved here ...

Well, actually it is, since basic shell features should use basic system
features wherever possible rather than requiring optional libraries.  If
we're going to add =~ because it's in bash I don't seen any real point
in duplicating -pcre-match to do it, and the POSIX
regcomp/regexec/regerror/regfree should be available just about
everywhere.

When that happens...

> I just double-checked something in passing and discovered that Bash
> uses the equivalent of KSH_ARRAYS, so the variable would need to be
> marked similarly to that and provided with the entire matched portion
> of the string in index 0.

We'll do it the usual way and respect the setting of KSH_ARRAYS.  This
is on in bash compatibility mode.  If that's not set, but BASH_REMATCH
is, we'll put the first match in $BASH_REMATCH[1].

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


To access the latest news from CSR copy this link into a web browser:  http://www.csr.com/email_sig.php

To get further information regarding CSR, please visit our Investor Relations page at http://ir.csr.com/csr/about/overview



Messages sorted by: Reverse Date, Date, Thread, Author