Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: special characters in file names issue




On Fri, Nov 10, 2023 at 3:51 AM Roman Perepelitsa <roman.perepelitsa@xxxxxxxxx> wrote:
On Fri, Nov 10, 2023 at 12:17 AM Jim <linux.tech.guy@xxxxxxxxx> wrote:
-clip-
> ...
> for E (${(@)AllFileNames}) {
> [[ -v FileNameCkSum[$E] ]] || FileNameCkSum[$E]=${$(shasum -a 1 $E)[1]} }  # line that fails
> ...

-clip-

Associative arrays in zsh are finicky when it comes to the content of
their keys. The problem you are experiencing can be distilled to this:

    % typeset -A dict
    % key='('
    % [[ -v dict[$key] ]]
    zsh: invalid subscript

-clip-
Roman.

P.S.

From the description of your problem I would think that you want file
hashes as keys. Something like this:

    # usage: detect-dup-files [file]..
    function detect-dup-files() {
      emulate -L zsh
      (( ARGC )) || return 0
      local -A seen
      local i files fname hash orig
      files=( $(shasum -ba 256 -- "$@") ) || return
      (( 2 * ARGC == $#files )) || return
      for i in {1..$ARGC}; do
        fname=$argv[i]
        hash=${files[2*i-1]#\\}
        if [[ -n ${orig::=$seen[$hash]} ]]; then
          print -r -- "${(q+)fname} is a dup of ${(q+)orig}"
        else
          seen[$hash]=$fname
        fi
      done
    }

This code has an added advantage of forking only once. It also handles
file names with backslashes and linefeeds in them.

Was only expecting at best an answer to the one line, so thanks for the function.
After 58 years of working on computers and IT this old dog is open to new code.
I usually learn something from every example I get my hands on, and this
function was no exception.  So a big thanks. Parts of it will be used in the future.

Since everyone was working with limited information about what I was doing,
there are some issues. The files I'm working on are in excess of 96K, and most
utilities, including shasum, report the input line is too long. So a few changes
are needed. Even with "groups" of files, shasum takes over two and half hours
to do 96K.  So I implemented gdbm to store the results. So even when I
hit the "key: problem", I could skip all files that were already hashed.

And you were right, I was working on hashes as keys but in a different way.
Used the following to create a second associative array(also ztied).
Well over two hours to do. I thought it would be a lot faster, oh well.

for V ("${(@u)FileNameCkSum}") \
  CkSumFileNames[$V]=${(pj:\0:)${(o)${(@k)FileNameCkSum[(Re)$V]}}}

Each key's(hash) value has all the files with the same hash separated by NULLs.
The rest of the code uses this second associative array.

Again thanks, and best regards,

Jim


Messages sorted by: Reverse Date, Date, Thread, Author