Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Surprising behaviour with numeric glob sort



On Jun 7,  9:41am, Stephane Chazelas wrote:
}
} I think that if I had to implement it, I'd take the lazy
} approach of:
}  - in the C locale, use memcmp() with sequences of digits
}    replaced with their 20-width 0-padding.
}  - in other locales, strip the NULs and use strcoll() with
}    sequences of digits replaced with their 20-width 0-padding
}    (or memcmp() of the stxfrm() of the above, but with that 20x
}    factor for the padding combined with a typical 5x factor for
}    strxfrm(), the memory usage penalty may be too high).

I looked at this for a while and decided that I have neither the free
time nor the correct runtime environment to tackle it.

By my reading of the code, the following steps are necessary:

A. Update configure.ac --
   1. check for availability of wcstrxfrm() or if not then strxfrm()
   2. check for wcstrcoll() or strcoll()
B. In sort.c:strmetasort() --
   1. If *unmetalenp is NULL, unmetafy all the strings and generate
      the corresponding lengths array
   2. Starting with the unmetafied strings, step through them wide-
      character-wise to
      a. find any null bytes
      b. find any numeric substrings
      c. peform the appropriate transformation, e.g. expansion to
         20 digits with leading zeros; this regenerates the *cmp and
         len values in each SortElt
   3. Apply any available strxfrm variant to every *cmp in SortElt
      array, updating len as needed
C. Update sort.c:eltpcmp() such that --
   1. In C locale or if there is a strxfrm or if there is no strcoll,
      apply strcmp(as->cmp, bs->cmp)
   2. Else apply available variant of strcoll(as->cmp, bs->cmp)
D. Similarly adapt zstrcmp(), which will have the side-effect of
   making the cases where it is applied a lot more computationally
   expensive because it can't precalculate like strmetasort().
E. Do all of the above while preserving various optimizations such as
   skipping step (B.2.b) when not numerically sorting, skipping (B.3)
   and using (C.1) when the original has no wide/metafied characters,
   etc.

And of course we need to decide whether the inconsistency noted by
Stephane that started this discussion is actually significant enough
to undertake this effort and eat the side-effect noted in (D) and
the memory consumption noted by Stephane.

We might want to do (A.2) even if nothing else.

Technically when we don't already have the strings unmetafied by the
caller we could do (B.1) and (B.2) simultaneously, but that adds
complexity to the code in the (B.2) loop.

Volunteers welcome, I'm dropping this as of now.



Messages sorted by: Reverse Date, Date, Thread, Author