Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort



2017-06-01 15:29:43 -0700, Bart Schaefer:
> On May 31, 10:24pm, Stephane Chazelas wrote:
> }
> } Maybe a better approach would be to break down the strings
> } between non-numeric and numeric parts and use strcoll() on the
> } non-numeric and number comparison on the numeric parts, stopping
> } at the first difference.
> 
> I don't think that helps, in the general case.  It would still mean
> the sort is not stable where the numeric parts are the same but the
> non-numeric part is partially-ordered.
> 
> To stabilize the sort we'd have to, for example, replace strcoll()
> with something that falls back to byte value ordering whenever the
> collation order of two characters is equivalent, but that requires
> lookahead (doesn't work on prefixes).
[...]

Sorry, my choice of words was poor. I shouldn't have used
"total" there.

OK, in a locale where A, B and C sort the same, globbing is
non-deterministic (with or without numericglobsort, with the
current situation or with the change I propose) but is possible.

But with the comparison algorithm of zsh's current *(n) that for
some values of A,B,C, have A < B, B < C, C < A, sorting is just
not possible. Some qsort() will give one result (that doesn't
satisfy all those), some have been known to SEGV, some might
loop indefinitely. But more importantly, it gives unexpected
results in real-life cases.

In a locale where A, B and C sort the same, with
numericglobsort, A2 B10 C1 should sort as C1 A2 B10, just like
without numericglobsort it should (and does) sort as C1 B10 A2¹
or print -l A B C | sort -u would give one line (A here because
of the last-resort memcmp() comparison). I have no problem with
that. That's the intention of the collation algorithm (though I
argue those locales are broken, locale collation algorithms, at
least the system ones should have a total order, that was more
or less the conclusion of a related discussion at the
opengroup). But:

$ echo *(n)
zsh-10 zsh2 zsh10 zsh-3

(here in my en_GB.UTF-8 GNU locale)

is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so
I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is
the basis of my proposal. In any case, zsh-3 should come before
zsh-10, nobody can argue against that.

In a locale where "zsh-" sorts the same as "zsh", *(n) currently
gives either zsh2 zsh-3 zsh10 zsh-10 or zsh2 zsh-3 zsh10 zsh-10,
both of which are fine with me. And it wouldn't change with my
proposal. It would be nice to have a consistent order, for
instance by implementing a last-resort memcmp()-based comparison
like "sort" does without -s, but that's nowhere as important a
problem as in my experience, real life file names don't have
parts that sort the same in any locale (and only GNU systems in
my experience have locales with such non-total orders, for the
most part non-intentionally like the ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ that
sort the same in GNU locales).

¹ Note that the fact that x-A xB x-C sorts in that order in GNU
non-C locales is not because "x" sorts the same as "x-" but
because the primary weight of a "-" is "IGNORE" so when
comparing x-A and xB, strcoll() first compares "xA" with "xB".
If it was xA against x-A, then the other weights would be
considered which would sort "xA" before "x-A"


-- 
Stephane



Messages sorted by: Reverse Date, Date, Thread, Author