Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort



On Jun 5, 12:54pm, Stephane Chazelas wrote:
}
} > Zsh has the additional complication of needing to deal with strings
} > having embedded '\0' bytes, which neither strcoll nor strxfrm is able
} > to deal with.  I'm not 100% confident that zsh's current algorithm
} > deals correct with this either.
} 
} From what I can see (by using ltrace -e strcoll), zsh removes
} the NUL bytes before calling strcoll, so $'\0\0x' sorts the same
} as $'\0x' or 'x'.

Like I said, I think it does this wrong.  If I'm reading the code
correctly, it first compares the strings for absolute identity while
searching for embedded nuls, and if they are identical up to the nul
it then orders the shorter string before the longer one; otherwise
it skips past the last nul and then relies on strcoll() for the rest
of both strings.  It would seem to me that the collation order should
be checked before any nul as well as after, otherwise the first loop
might conclude the strings differ when strcoll() would order them the
same.  (However, read below.)

} You mean a Schwartzian transform

Yes, much like that.  Src/sort.c already has a SortElt structure that
is used to sort metafied strings by comparing their unmetafied forms.
We only [*] need to add strxfrm() of the unmetafied strings in front
and remove strcoll() of transformed strings at comparison, and then
we're in business.  For example, following strxfrm() the assumptions
about absolute identity for nul handling suddenly become valid, so we
don't have to fix that separately -- we just have to strxfrm() all
the nul-separated substrings.

} With a comparison function that does memcmp() on the "string"
} parts and a number comparison on the "num" parts?

Equivlent to that, yes.  (I don't think zero-padding will work as we
don't know how many zeroes are needed to make the strings be the same
number of digits.)

} > For globbing, we'd have to rely on something else such as
} > whether MULTIBYTE is set.
} 
} Note that for globbing, the "numeric" sort applies after the
} "o+myfunc" or "oe:...:" transformation, so the strings to sort
} on may still contain all sorts of things

Whether there have been other globbing transforms turns out not to
matter.  The point about MULTIBYTE is that we have no glob flag we
can push around to indicate that the shell should assume there are
[not] wide characters in the comparions strings.

[*] That "only" is deceptive; it's actually a fairly hefty ask, for
reasons such as needing to handle case-insensitive comparisons too
(currently everything is forced through towlower() in that event).



Messages sorted by: Reverse Date, Date, Thread, Author