Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort



2017-06-06 16:44:59 +0200, Vincent Lefevre:
> On 2017-06-03 22:16:46 +0100, Stephane Chazelas wrote:
> > 2017-06-02 16:19:05 -0700, Bart Schaefer:
> > > Well, one could argue that "-10" should be treated as negative ten
> > > and therefore should sort before negative three, but I'm not sure
> > > we want to get into that.
> > 
> > The (my at least) main usage for *(n) is to sort version numbers
> > like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers
> > wouldn't help in those cases.
> 
> I often uses "-" as a separator, so that I expect "foo-1" to come
> before "foo-2".
> 
> And note that in Unicode, the real minus sign is U+2212.

Maybe so, but in every programming language, hyphen-minus is
used instead for negation. In any case, I don't think Bart (who
brought it up in the first place) was suggesting we should
change the behaviour in that regard.

> > When comparing "zsh-3" with "zsh2", we compare the non-numeric
> > prefix: "zsh-" and "zsh". And already, at that point, "zsh" is
> > less than "zsh-", so we stop here (zsh2 < zsh-3)
> 
> I don't think this is the correct method. In some locales, digits
> come after "-". So, IMHO, "zsh-0" should be compared with "zsh0".
> 
> I expect numeric sort and the normal sort be equivalent when all
> numbers have a single digit. Numeric sort is just a generalization
> to an infinite digit set (a number being regarded as an element
> of this digit set).
[...]

Of the approaches mentioned so far, only the strcoll() (not
strxfrm()) ones with 0-padding of numbers would satisfy that. 

$ echo *
zsh0 zsh-0 zsh1 zsh-1 zsh10 zsh-10 zsh2 zsh-2
$ echo *(n)
zsh0 zsh-0 zsh1 zsh-1 zsh2 zsh10 zsh-2 zsh-10
$ n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}}
$ echo *(o+n)
zsh0 zsh-0 zsh1 zsh-1 zsh2 zsh-2 zsh10 zsh-10

The approaches that split into non-numerical and numerical parts
(even those that concatenate the result of several strxfrm()
with 0-padded numbers) would sort the above as

zsh0 zsh1 zsh2 zsh10 zsh-0 zsh-1 zsh-2 zsh-10

(personally, I prefer that order, but you might argue that I
should change the locale to C if I expect that order I suppose.
Do you have a real-life example where the other order may be
preferable?).

That 0-padding could still potentially be invalid if there were
collating elements that end in decimal digits in the locale.

Note that there are charsets like GB18030 that have characters
whose encoding ends in the 0x30 byte (the encoding of 0)

 (\uc2) (and several thousand others) is one of them:

$ LC_ALL=zh_CN.gb18030 zsh -c 'printf "\uc2"' | hd
00000000  81 30 87 30                                       |.0.0|
00000004

My

n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}}

zsh function would not replace those 0s that are not 0s because
the 0x30 are part of other characters, but the equivalent done
in C in zsh would have to take that into account I suppose (no
walking the byte values to find ASCII decimal digits)

-- 
Stephane



Messages sorted by: Reverse Date, Date, Thread, Author