Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort



2017-06-05 12:54:39 +0100, Stephane Chazelas:
[...]
> You mean a Schwartzian transform
> (https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
> on (here using $'\0zsh\0-1.2\0x' as an example):
> 
> {
>   original <= $'\0zsh\0-1.2\0x'
>   parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
>             {num: 1},
> 	    {string: strxfrm(".")},
> 	    {num: 2},
> 	    {string: $'\0' . strxfrm("x")}]
> }, ...
> 
> With a comparison function that does memcmp() on the "string"
> parts and a number comparison on the "num" parts?
> 
> Yes, that sounds like a good idea that trades memory for
> efficiency (by doing the strxfrm() hard work up front and only
> once), but if we're going to transform upfront anyway, the
> memory is traded off already.
> 
> Or same using zero-padding
> 
> {
>   original <= $'\0zsh\0-1.2\0x'
>   transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
>                  "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
> }, ...
[...]

That would not be valid I (now) believe.

The transformation done by strxfrm() is opaque. For all we now,
it could very well return:

  strxfrm("foo")    -> "455"
  strxfrm("bar")    -> "217"
  strxfrm("foobar") -> "45517"

So comparing "foo1bar" "foo6bar" "foobar" with our function
(with adapted width padding) would be memcmp()ing "4551217",
"4556217" and "45517" resulting in potentially surprising
orders.

Even only concatenating the result of several strxfrm() is
probably not valid an approach.

It's probably OK if they're interspersed with NUL bytes as that
byte can't be found in the result of strxfrm() (and sorts before
anything else for memcmp()), but even then, when doing "human
collation" sorting, it probably makes more sense to consider
them as padding (like space) than the thing that must sort
before anything else, so removing them (or replacing them with
spaces to have a deterministic order) instead of inserting them
as \0 in between several strxfrm()s is just as valid and maybe
even preferable.

I think that if I had to implement it, I'd take the lazy
approach of:
 - in the C locale, use memcmp() with sequences of digits
   replaced with their 20-width 0-padding.
 - in other locales, strip the NULs and use strcoll() with
   sequences of digits replaced with their 20-width 0-padding
   (or memcmp() of the stxfrm() of the above, but with that 20x
   factor for the padding combined with a typical 5x factor for
   strxfrm(), the memory usage penalty may be too high).

Or the first approach mentioned above even if Vincent wouldn't
like it.

-- 
Stephane



Messages sorted by: Reverse Date, Date, Thread, Author