Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

[PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).

X-seq: zsh-workers 48786
From: Stephane Chazelas <stephane@xxxxxxxxxxxx>
To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
Cc: Zsh hackers list <zsh-workers@xxxxxxx>
Subject: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
Date: Wed, 5 May 2021 12:45:21 +0100
Archived-at: <https://zsh.org/workers/48786>
In-reply-to: <CAH+w=7bHxSbFr60ZU0+oZ6+qEejhfBYTzvL7=aXadY5XzWtSzw@mail.gmail.com>
List-id: <zsh-workers.zsh.org>
Mail-followup-to: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
References: <20191216211013.6opkv5sy4wvp3yn2@chaz.gmail.com> <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> <20210430061117.buyhdhky5crqjrf2@chazelas.org> <CAH+w=7bHxSbFr60ZU0+oZ6+qEejhfBYTzvL7=aXadY5XzWtSzw@mail.gmail.com>

2021-04-30 16:13:34 -0700, Bart Schaefer:
[...]
> I went back and looked at the patch again.

Thanks. Here's a third version with further improvements
addressing some of the comments here.

> Tangential question:  "pgrep" commonly refers to grepping the process
> list, and is linked to "pkill".  I know "zpgrep" precedes this patch,
> but I'm wondering if we should rename it.

I agree, zshpcregrep or zpmatch may be better names. There
exists a pcregrep command, zpcregrep would be likely interpreted
as zip-pcregrep. I'll leave it out for now. IMO, that zpgrep
serves more as example code than a command people would actually
use, so it probably doesn't matter much.

> More directly about regexp-replace:
> 
> If $argv[4,-1] are going to be ignored/discarded, perhaps there should
> be a warning?  (Another thing that predates the patch, I know)

Agreed. I've addressed that.

> What do you think about replacing the final eval with typeset -g, as
> mentioned in workers/48760 ?

I've compared:

(1) eval -- $lvalue=\$value
(2) Bart's typeset -g -- $lvalue=$value
(3) Daniel's (zsh-workers 45073) : ${(P)lvalue::="$value"}

(1) to me is the most legible but if $lvalue is not a valid
lvalue, it doesn't necessarily return a useful error message to
the user (like when lvalue='reboot;var'...)

(2) is also very legible. It has the benefit (or inconvenience
depending on PoV) of returning an error if the lvalue is not a
scalar. It reports an error (and exits the shell process)  upon
incorrect lvalues (except ones such as "var=foo"). A major
drawback though is that if chokes on lvalue='array[n=1]' or
lvalue='assoc[keywith=characters]'

(3) is the least legible. It also causes the lvalue to be
dereferenced twice. For instance with lvalue='a[++n]', n is
incremented twice. However, it does report an error upon invalid
lvalue (even though ${(P)lvalue} alone doesn't), and as we use
${(P)lvalue} above already, that has the benefit of that lvalue
being interpreted consistently. Non-scalar variables are
converted to scalar (like with (1)). It works OK for
lvalue='assoc[$key]' and lvalue='assoc[=]' or
lvalue='assoc[\]]' for instance.

Performance wise, for usual cases (lvalue being a simple
variable name and value short enough), (1) seems to be the worst
in my tests and (3) best, (2) very close. But that's reversed
for the less usual cases.

So, I've gone for (3), changed the code to limit the number of
times the lvalue is dereferenced. I've also addressed an issue
whereby regexp-replace empty '^' x would not insert x in ERE
mode.

(note that it is affected by the (e) failure exit code issue
I've just raised separately; I'm not attempting to work around
it here; though I've added the X flag for error reporting to be
more consistent)

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 8bf1a208e..db06d7925 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4328,7 +4328,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4346,6 +4346,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..c947a2043 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -1,43 +1,109 @@
-# Replace all occurrences of a regular expression in a variable.  The
-# variable is modified directly.  Respects the setting of the
-# option RE_MATCH_PCRE.
+# Replace all occurrences of a regular expression in a scalar variable.
+# The variable is modified directly.  Respects the setting of the option
+# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode.
 #
-# First argument: *name* (not contents) of variable.
-# Second argument: regular expression
-# Third argument: replacement string.  This can contain all forms of
-# $ and backtick substitutions; in particular, $MATCH will be replaced
-# by the portion of the string matched by the regular expression.
-
-integer pcre
+# Arguments:
+#
+# 1. *name* (not contents) of variable or more generally any lvalue,
+#    expected to be scalar.  That lvalue will be evaluated once to
+#    retrieve the current value, and two more times (not just one as a
+#    side effect of using ${(P)varname::=$value}; FIXME) for the
+#    assignment of the new value if a substitution was made.  So lvalues
+#    such as array[++n] where the subscript is dynamic should be
+#    avoided.
+#
+# 2. regular expression
+#
+# 3. replacement string.  This can contain all forms of
+#    $ and backtick substitutions; in particular, $MATCH will be
+#    replaced by the portion of the string matched by the regular
+#    expression. Parsing errors are fatal to the shell process.
+#
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable.
 
-[[ -o re_match_pcre ]] && pcre=1
+if (( $# < 2 || $# > 3 )); then
+  setopt localoptions functionargzero
+  print -ru2 "Usage: $0 <varname> <regexp> [<replacement>]"
+  return 2
+fi
 
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
-local MATCH MBEGIN MEND
+
+# $5 is the string to be matched
+5=${(P)1}
+
+local    MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $6, $7 reserved for later use
+  4=0 7=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "$5"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    6=${(Xe)3}
+    argv+=(${(s: :)ZPCRE_OP} "$6")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$(( argv[-2] + (argv[-3] == argv[-2]) ))
+  done
+
+  (( $# > 7 )) || return # no match
+
+  set +o multibyte
+
+  # $6 contains the result, $7 the current offset
+  6= 7=1
+  for 2 3 4 in "$@[8,-1]"; do
+    6+=${5[$7,$2]}$4
+    7=$(( $3 + 1 ))
+  done
+  6+=${5[$7,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+  while
+    if [[ $5 =~ $2 ]]; then
+      # append initial part and substituted match
+      6+=${5[1,MBEGIN-1]}${(Xe)3}
+      # truncate remaining string
+      if (( MEND < MBEGIN )); then
+	# zero-width match, skip one character for the next match
+	(( MEND++ ))
+	6+=${5[1]}
+      fi
+      5=${5[MEND+1,-1]}
+      # indicate we did something
+      7=1
+    fi
+    [[ -n $5 ]]
+  do
+    continue
+  done
+  [[ -n $7 ]] || return # no match
+  6+=$5
+fi
+
+# assign result to target variable if at least one substitution was
+# made.  At this point, if the variable was originally array or assoc, it
+# is converted to scalar. If $1 doesn't contain a valid lvalue
+# specification, an exception is raised (exits the shell process if
+# non-interactive).
+: ${(P)1::="$6"}

Follow-Ups:
- Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  - From: Bart Schaefer
- Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  - From: Lawrence Velázquez

References:
- Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  - From: Stephane Chazelas
- Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  - From: Bart Schaefer

Messages sorted by: Reverse Date, Date, Thread, Author