Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort



2017-06-04 15:01:35 -0700, Bart Schaefer:
> On Jun 4,  6:31pm, Stephane Chazelas wrote:
> }
> } "Slow" (though probably quite negligible compared to strcoll()
> } which already does things like in addition to a lot more hard
> } wark) but working.
> 
> According to one strcoll man page I have, the right thing to do is
> convert all the strings with strxfrm and then strcmp those.
> 
> It provides no advice on how to order the original array to match the
> sorted result of the xfrm'd array (the transform is not reversible),
> nor how to determine how much space to allocate for each transform.
> 
> Zsh has the additional complication of needing to deal with strings
> having embedded '\0' bytes, which neither strcoll nor strxfrm is able
> to deal with.  I'm not 100% confident that zsh's current algorithm
> deals correct with this either.

From what I can see (by using ltrace -e strcoll), zsh removes
the NUL bytes before calling strcoll, so $'\0\0x' sorts the same
as $'\0x' or 'x'.

That's as if $'\0' had IGNORE for all weights in the collation
order which I suppose is as valid as any. Per POSIX, such input
would not be text, so "sorting" them may not be very meaningful.
Same for strings that contain byte sequences that don't form
valid characters where there's no saying what strcoll() may do
with them.

Having NUL sort before any other character would be preferable
in the C locale that sorts by code point though (like to sort
UTF-16 text based on codepoint).

> A possible approach would be to pre-allocate a hash table and fill
> it on the fly with keys the original strings and values the strxfrm
> results.  An additional strxfrm could be called on the trailing bits
> after any embedded nul.  Then sort the original array by comparing
> the values in the hash.  Doesn't solve the question of how much space
> to reserve, but it would allow the current algorithm for picking out
> numeric substrings to be used.

You mean a Schwartzian transform
(https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
on (here using $'\0zsh\0-1.2\0x' as an example):

{
  original <= $'\0zsh\0-1.2\0x'
  parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
            {num: 1},
	    {string: strxfrm(".")},
	    {num: 2},
	    {string: $'\0' . strxfrm("x")}]
}, ...

With a comparison function that does memcmp() on the "string"
parts and a number comparison on the "num" parts?

Yes, that sounds like a good idea that trades memory for
efficiency (by doing the strxfrm() hard work up front and only
once), but if we're going to transform upfront anyway, the
memory is traded off already.

Or same using zero-padding

{
  original <= $'\0zsh\0-1.2\0x'
  transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
                 "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
}, ...

(and using memcmp() in the comparison function)

With same benefits (a-1 a+2 a-3 sorted in that order where "-"
and "+" are ignored in the first pass like without
numericglobsort) and same drawback (need to find out the right
padding width).

> For parameter array sorting we could extend the (m) flag to indicate
> that this transformation is required [it already means to count byte
> widths for (l), (r), etc.] so as to avoid the overhead when it is not
> needed.  For globbing, we'd have to rely on something else such as
> whether MULTIBYTE is set.

Note that for globbing, the "numeric" sort applies after the
"o+myfunc" or "oe:...:" transformation, so the strings to sort
on may still contain all sorts of things like NUL bytes and byte
sequences that don't form valid characters even if they were not
in the file names in the first place.

Like in a

by_content() REPLY=$mapfile[$REPLY]
echo *(no+by_content)

-- 
Stephane



Messages sorted by: Reverse Date, Date, Thread, Author