Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort

X-seq: zsh-workers 41228
From: Stephane Chazelas <stephane.chazelas@xxxxxxxxx>
To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
Subject: Re: Surprising behaviour with numeric glob sort
Date: Mon, 5 Jun 2017 12:54:39 +0100
Cc: Zsh hackers list <zsh-workers@xxxxxxx>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to:user-agent; bh=E1g9BKlreswAbv/94jLrrmyvOlli9z94wNBupB0l6Gs=; b=dTfboxDtI4Ro9BvKc+PJAipbKaMW2I8SJF7A02De/lVwEO94wnqOcCC/gR8Ss5NPKg gVUvqr4TumViTGjoXKmt9qgKKW/c4z7GKOXq/cFE4iW276kZwsGD+7+bbW7zDIo8zSzx IHSCENC33NU6zs47sBsPczNRQcbi/q//kMIFht+LWTGuCJxcrJ6mHlnkHy2u6LJzRJat ExiC3LMNLznBOEJpZsqqB5ibVfqLS5ByCPIBm3BwHjv4GvnOzcdoQOc+Awxex4ThO3MM WmWnfl1ebAaN6ybUycXgOSkVLDhcQb6C2w6s7l60TjnI0KsYyrAaDE/k6yejQo7DGCN5 Jsnw==
In-reply-to: <170604150135.ZM13291@torch.brasslantern.com>
List-help: <mailto:zsh-workers-help@zsh.org>
List-id: Zsh Workers List <zsh-workers.zsh.org>
List-post: <mailto:zsh-workers@zsh.org>
Mail-followup-to: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
References: <20170531212453.GA31563@chaz.gmail.com> <170601152943.ZM4783@torch.brasslantern.com> <20170602090332.GA6574@chaz.gmail.com> <170602161905.ZM10488@torch.brasslantern.com> <20170603211645.GA17785@chaz.gmail.com> <170603170724.ZM15645@torch.brasslantern.com> <20170604173157.GB9094@chaz.gmail.com> <170604150135.ZM13291@torch.brasslantern.com>

2017-06-04 15:01:35 -0700, Bart Schaefer:
> On Jun 4,  6:31pm, Stephane Chazelas wrote:
> }
> } "Slow" (though probably quite negligible compared to strcoll()
> } which already does things like in addition to a lot more hard
> } wark) but working.
> 
> According to one strcoll man page I have, the right thing to do is
> convert all the strings with strxfrm and then strcmp those.
> 
> It provides no advice on how to order the original array to match the
> sorted result of the xfrm'd array (the transform is not reversible),
> nor how to determine how much space to allocate for each transform.
> 
> Zsh has the additional complication of needing to deal with strings
> having embedded '\0' bytes, which neither strcoll nor strxfrm is able
> to deal with.  I'm not 100% confident that zsh's current algorithm
> deals correct with this either.

From what I can see (by using ltrace -e strcoll), zsh removes
the NUL bytes before calling strcoll, so $'\0\0x' sorts the same
as $'\0x' or 'x'.

That's as if $'\0' had IGNORE for all weights in the collation
order which I suppose is as valid as any. Per POSIX, such input
would not be text, so "sorting" them may not be very meaningful.
Same for strings that contain byte sequences that don't form
valid characters where there's no saying what strcoll() may do
with them.

Having NUL sort before any other character would be preferable
in the C locale that sorts by code point though (like to sort
UTF-16 text based on codepoint).

> A possible approach would be to pre-allocate a hash table and fill
> it on the fly with keys the original strings and values the strxfrm
> results.  An additional strxfrm could be called on the trailing bits
> after any embedded nul.  Then sort the original array by comparing
> the values in the hash.  Doesn't solve the question of how much space
> to reserve, but it would allow the current algorithm for picking out
> numeric substrings to be used.

You mean a Schwartzian transform
(https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
on (here using $'\0zsh\0-1.2\0x' as an example):

{
  original <= $'\0zsh\0-1.2\0x'
  parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
            {num: 1},
	    {string: strxfrm(".")},
	    {num: 2},
	    {string: $'\0' . strxfrm("x")}]
}, ...

With a comparison function that does memcmp() on the "string"
parts and a number comparison on the "num" parts?

Yes, that sounds like a good idea that trades memory for
efficiency (by doing the strxfrm() hard work up front and only
once), but if we're going to transform upfront anyway, the
memory is traded off already.

Or same using zero-padding

{
  original <= $'\0zsh\0-1.2\0x'
  transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
                 "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
}, ...

(and using memcmp() in the comparison function)

With same benefits (a-1 a+2 a-3 sorted in that order where "-"
and "+" are ignored in the first pass like without
numericglobsort) and same drawback (need to find out the right
padding width).

> For parameter array sorting we could extend the (m) flag to indicate
> that this transformation is required [it already means to count byte
> widths for (l), (r), etc.] so as to avoid the overhead when it is not
> needed.  For globbing, we'd have to rely on something else such as
> whether MULTIBYTE is set.

Note that for globbing, the "numeric" sort applies after the
"o+myfunc" or "oe:...:" transformation, so the strings to sort
on may still contain all sorts of things like NUL bytes and byte
sequences that don't form valid characters even if they were not
in the file names in the first place.

Like in a

by_content() REPLY=$mapfile[$REPLY]
echo *(no+by_content)

-- 
Stephane

Follow-Ups:
- Re: Surprising behaviour with numeric glob sort
  - From: Stephane Chazelas
- Re: Surprising behaviour with numeric glob sort
  - From: Bart Schaefer
- Re: Surprising behaviour with numeric glob sort
  - From: Stephane Chazelas

References:
- Surprising behaviour with numeric glob sort
  - From: Stephane Chazelas
- Re: Surprising behaviour with numeric glob sort
  - From: Bart Schaefer
- Re: Surprising behaviour with numeric glob sort
  - From: Stephane Chazelas
- Re: Surprising behaviour with numeric glob sort
  - From: Bart Schaefer
- Re: Surprising behaviour with numeric glob sort
  - From: Stephane Chazelas
- Re: Surprising behaviour with numeric glob sort
  - From: Bart Schaefer
- Re: Surprising behaviour with numeric glob sort
  - From: Stephane Chazelas
- Re: Surprising behaviour with numeric glob sort
  - From: Bart Schaefer

Messages sorted by: Reverse Date, Date, Thread, Author