Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Slurping a file (was: more spllitting travails)



On Sun, Jan 14, 2024 at 11:10 PM Bart Schaefer
<schaefer@xxxxxxxxxxxxxxxx> wrote:
>
> On Sun, Jan 14, 2024 at 2:34 AM Roman Perepelitsa
> <roman.perepelitsa@xxxxxxxxx> wrote:
> >
> > On Sat, Jan 13, 2024 at 9:02 PM Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> > >
> > >   IFS= read -rd '' file_content <file
> >
> > In addition to being unable to read files with nul bytes, this
> > solution suffers from additional drawbacks:
> >
> > - It's impossible to distinguish EOF from I/O error.
>
> Pretty sure you can do that by examining $ERRNO on nonzero status?

I wouldn't do that other than for debugging. In general, you can
examine errno only for functions that explicitly document how they set
it. If this part is not documented, you have to assume the function
may set errno to anything both on success and on error. Also, most
libc functions may set errno to anything on success.

In this specific case perhaps `read` calls `malloc` after an I/O
error, which may trash errno. Or perhaps at the end of `read <file`
the file descriptor is closed, which again may trash errno. I haven't
verified either of these things. I am merely suggesting why `read`
conceivably could fail to propagate errno from an I/O error in the
absence of explicit guarantees in the docs.

> I'm curious whether
>   setopt nomultibyte
>   read -u 0 -k 8192 ...
> is actually that much slower in a slurp-like loop.

It is slightly *faster*. For smaller files the difference is about
25%. From 512KB and up there is no discernible difference.

> Another thought:  Use -c count option to get number of bytes read and
> -s $size option to specify buffer size.  If (( $count == $size )) then
> double $size for the next read.

This does not seem to help, although this might be dependent on the
device and filesystem. Here's a benchmark for various file sizes
(rows) and various fixed buffer sizes (columns):

     n   fsize    1KB    2KB     4KB    8KB   16KB   32KB   64KB
     1       0     41     43      43     43     51     52     53
     2       1     47     48      49     48     57     57     59
     3       2     48     48      48     48     56     57     58
     4       4     49     49      48     49     62     61     59
     5       8     74     75      51     49     62     61     63
     6      16     47     51      49     49     57     61     63
     7      32     47     50      49     50     58     58     59
     8      64     54     53      49     50     59     58     71
     9     128     50     50      51     51     59     60     61
    10     256     49     52      51     51     60     61     63
    11     512     53     55      55     54     64     64     65
    12    1024     58     61      60     61     57     68     71
    13    2048     77     72      71     74     83     83     83
    14    4096    112    102      88     89    107    100    108
    15    8192    188    153     152    145    161    163    140
    16   16384    343    290     270    259    265    240    225
    17   32768    658    577     427    471    499    495    489
    18   65536   1281   1082     983    771    938    827    937
    19  131072   2659   2214    2046   1952   1893   1928   1506
    20  262144   4818   4608    4195   4254   3810   3955   3043
    21  524288  10174   8967    7502   6382   7632   6142   7148
    22 1048576  21591  18205   16424  15691  15243  14327  14889
    23 2097152  41156  36087   32731  31840  30104  30090  29913
    24 4194304  89814  72949   66447  62716  60998  60252  59485
    25 8388608 191579 147195  125987 116327 121544 122384 122631

4KB and 8KB buffers perform best in this benchmark across all file
sizes. Given that 8KB is the default for sysread, there is no apparent
reason to use `-s`.

> >       typeset -g REPLY=${(j::)content}
>
> Why the typeset here?  Just assign?

Just a habit from using warn_create_global in my scripts. It catches
typos and missing `local` declarations quite well.

> Sadly there's another utility named "slurp":
>
> slurp
>   cli utility to select a region in a Wayland compositor

That's too bad: "slurp" is a well-known moniker for reading the full
content of a file (https://www.google.com/search?q=file+slurp).

Perhaps zslurp?

Roman.




Messages sorted by: Reverse Date, Date, Thread, Author