Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: [PATCH] zsh/random module [UPDATED]



Clinton Bunch wrote on Mon, Nov 07, 2022 at 18:18:15 -0600:
> diff --git a/Completion/Zsh/Command/_getrandom b/Completion/Zsh/Command/_getrandom
> new file mode 100644
> index 000000000..d97e36918
> --- /dev/null
> +++ b/Completion/Zsh/Command/_getrandom
> @@ -0,0 +1,14 @@
> +#Thans to dana for providing this completion function

This crediting doesn't belong in the source file but in the log message.

> +local lmin=0 lmax=$(( 2 ** 32 - 2 )) umin=1 umax=$(( 2 ** 32 - 1 ))
> +
> +_arguments -s -S : \
> +  '(-r -s)-a+[assign result to specified array parameter]:array parameter:_parameters -g "*array*~*readonly*"' \
> +  '(-a)-s+[assign result to specified scalar parameter]:scalar parameter:_parameters -g "*(integer|scalar)*~*readonly*"' \
> +  '(-r)-i[produce random data as 32-bit unsigned integers]' \
> +  '-c+[specify count of data]: :_numbers -d8 -l1 -m64 -u "bytes or integer elements" "data length"' \
> +  '(-i -L -U)-r[produce random data as raw binary bytes]' \
> +  '(-r)-L+[specify integer lower bound (implies -i)]: :_numbers -d$lmin -l$lmin -m$max "integer lower bound"' \
> +  '(-r)-U+[specify integer upper bound (implies -i)]: :_numbers -d$umax -l$umin -m$max "integer upper bound"'

Suggest to have this state either "inclusive" or "exclusive".

> +++ b/Doc/Zsh/mod_random.yo
> @@ -0,0 +1,67 @@
> +startitem()
> +findex(getrandom)
> +cindex(random data, printing, array)
> +xitem(tt(getrandom) [ tt(-c) var(count) ] [ tt(-r) ] [ tt(-s) var(scalar) ] [tt(-i) ] [ tt(-a) var(array) ] [ tt(-L) var(lower_bound) ] [ tt(-U) var(upper_bound) ])

Never use xitem() without item() after it.

> +NL()Outputs a random hex-encoded string of var(count) bytes by default. NL()
> +
> +
> +startitem()
> +item(tt(-c) var(count)) (
> +Sets the number of returned random data. Defaults to 8, unless -i, -L, or
> +-U is specified without -a, in which case the default is 1. var(count) must be
> +between 1 and 64.
> +)
> +item(tt(-r)) (
> +Returns the raw binary bytes rather than a hex-encoded string.
> +)
> +item(tt(-s) var(scalar)) (
> +Saves the random data in var(scalar) instead of printing it. It is an error to
> +use this with -i, -L, or -U with a count greater than 1.
> +)
> +item(tt(-a) var(array)) (
> +Saves the random data as an array var(array) of var(count) containing a
> +decimal representation of the integers LPAR()tt(-i), tt(-L), or tt(-U)RPAR()
> +or individual bytes.
> +)
> +item(tt(-i)) (
> +Produces var(count) 32-bit unsigned integers.
> +)
> +item(tt(-L) var(lower_bound)) (
> +Outputs var(count) integers between var(lower_bound) and var(upper_bound).
> +Defaults to 0.  var(lower_bound) can not be negative and the maximum value is
> +4,294,967,294.
> +)
> +item(tt(-U) var(upper_bound)) (
> +Outputs var(count) integers between var(lower_bound) and var(upper_bound).
> +Defaults to 4,294,967,295. This is the maximum value of var(upper_bound) and
> +the minimum is 1.
> +)
> +enditem()
> +enditem()
> +

I don't understand this API.

- Magic numbers:

  + Why should -c default to 8 in the "random bytes" case?  Why
    shouldn't it default to some other value, or even be required to be
    specified explicitly by the user?

  + Why should -c0 be an error?  It should be valid to request an array
    of 0 random numbers, just like, say, «yes | head -n 0».

  + Why should -c65 be an error?  (Saw your comment elsethread that you
    imagine that would suffice.  That's not a good reason for this limit.)

  + Why should -L$(( 2**32 - 1)) be an error?  I should be able to request
    a random integer in the range { x | 42 ≤ x ≤ 42 }.  It's perfectly
    well-defined.

- Why should -c's default depend on anything at all?  You mentioned
  downthread you consider making -c1 the default in all cases; that'd be
  better.

- The builtin can generate either integers or bytes.  Most of the API
  (the type of "returned random data" that -c deals with, the default
  number of those, validity of -L/-U, etc.) is split right along this
  line.  Shouldn't this become two separate builtins, then?  One for
  bytes and one for integers?

  (More on this below, at the end of the implementation review.)

- Interdependencies.  This can generate either bytes or unsigned ints,
  and bytes can be emitted either in hex, raw, or in decimal, and either
  to stdout or to a scalar or to an array… but not all of these
  combinations are possible.  For instance, adding «-a» to «getrandom -c 1»
  changes the output encoding from hex to decimal AND the output channel
  from stdout to $reply.  It seems better to avoid coupling things like
  this.

> +subsect(Parameters)
> +
> +startitem()
> +vindex(SRANDOM)
> +item(tt(SRANDOM)) (
> +A random positive 32-bit integer between 0 and 4,294,967,295.  This parameter
> +is read-only.
> +)
> +enditem()
> +
> +subsect(Math Functions)
> +
> +startitem()
> +item(tt(zrandom+LPAR()RPAR())) (

Doesn't this need a findex()?

Do other math functions have index entries?  And if not (systell()
doesn't), shouldn't they?

> +++ b/Src/Modules/random.c
> @@ -0,0 +1,675 @@

There's a whole bunch of orthographic issues in this file (whitespace,
punctuation, spelling, etc.).  I won't point them out now since there
are bigger issues.

> +#include <stdbool.h>

C99-ism.  (zsh is written in C89.)

> +/* buffer to pre-load integers for SRANDOM to lessen the context switches */
> +uint32_t rand_buff[8];

C99-ism (the array element type).

Even in C99, wouldn't this require an explicit #include?  I guess it
works now due to transitive #include's, but I don't know that it's
guaranteed to work everywhere.

> +getrandom_buffer(void *buf, size_t len)
> +{
⋮
> +    do {
> +#ifdef HAVE_ARC4RANDOM
> +	/* Apparently, the creaters of arc4random didn't believe in errors */

Leave out the /ad homines/.

FWIW, I would sooner guess arc4random() was originally written on
a platform where it /couldn't/ error.

> +	arc4random_buf(buf,len);
> +	ret = len;
> +#elif defined(HAVE_GETRANDOM)
> +	ret=getrandom(bufptr,(len - val),0);
> +#else
> +	ret=read(randfd,bufptr,(len - val));

The premise of the module is to offer an interface to high-quality
random numbers.  Is /dev/urandom, which is the module falls back to
here, necessarily of sufficient quality?  If not, shouldn't there be
a way for the module's user to determine the source of randomness?

> +#endif
> +	if (ret < 0) {
> +	    if (errno != EINTR || errflag || retflag || breaks || contflag) {
> +		zwarn("Unable to get random data: %e.", errno);

Need to set «errno = 0» first, surely?

> +/*
> + * Implements the getrandom builtin to return a string of random bytes of
> + * user-specified length.  It either prints them to stdout or returns them
> + * in a parameter passed on the command line.
> + *
> + */
> +
> +/**/
> +static int
> +bin_getrandom(char *name, char **argv, Options ops, int func)
> +{
⋮
> +    bool integer_out = false;	 /*  boolean */

The target audience for comments knows C.  So, nuke this comment.

You might define inline enums here (for ints/bytes, scalar/array/stdout,
and decimal/hex/raw) to make the code more self-documenting.

Incidentally, this would also immediately raise the question of why some
combinations of these aren't valid (e.g., hex or raw bytes output to an
array).

> +    int i, j;		         /* loop counters */
> +
> +    /*maximum string length of a 32-bit integer + null */
> +    char int2str[11];

Have the comment point out that this doesn't include a sign byte?

FWIW, we have a 64-bit version of this in DIGBUFSIZE.

> +    /* Vailues for -U *nd -L */
> +    zulong   upper = UINT32_MAX, lower = 0;
> +    uint32_t diff  = UINT32_MAX;
> +    uint32_t min   = 0, rej = 0;  /* Reject below min and count */

«rej» is never read.

> +    bool     bound = false;       /* set if upper or lower bound specified */

Why is this variable needed?  Having it at all means that passing «-L 0
-U $((2**32-1))» is different to not passing any -L or -U option at all,
even though these values are ostensibly the defaults.

> +    if (OPT_ISSET(ops, 'U')) {
> +	if (!zstrtoul_underscore(OPT_ARG(ops, 'U'), &upper)) {
> +	    zwarnnam(name, "Couldn't convert -U %s to a number.",
> +		    OPT_ARG(ops, 'U'));
> +	    return 1;
> +	} else if (upper > UINT32_MAX || upper < 1) {
> +	    zwarnnam(name,
> +		    "Upper bound must be between 1 and %z you specified: %z.",
> +		     (zlong) UINT32_MAX, (zlong) upper);

If «upper > ZLONG_MAX», the downcast to zlong would result in
a negative number.

> +	    return 1;
⋮
> +    if (bound) {
⋮
> +	if (lower > upper) {
> +	    zwarnnam(name, "-U must be larger than -L.");
> +	    return 1;
> +	}
> +
> +	diff = upper == UINT32_MAX && lower == 0 ? UINT32_MAX :
> +		upper - lower + 1;

How can this possibly be right?  As written it means the meaning of
«diff» with the default values of «upper»/«lower» is different to its
meaning in literally every other case.

Ah, I see.  Without special-casing the default values, explicitly
passing «-L 0 -U $((2**32 - 1))» would incorrectly trigger the
zwarnnam() just below, because «diff» is a uint32_t.  Thus, the
special-casing is an incorrect fix to a real issue.

> +	if(!diff) {
> +	    zwarnnam(name, "Upper and Lower bounds cannot be the same.");
> +	    return 1;
> +	}

This warning will never be printed.  [Why?  Because «diff» is either
UINT32_MAX, which is true, or «(uint32_t)(zulong)(upper - lower + 1)»,
which, taking into account the range checks on «upper» and «lower», can
only work out to 0 if «upper == UINT32_MAX && lower == 0»… which is
precisely the case that's special-cased in the assignment to «diff» just
above.]

I suppose it was intended to be printed for «getrandom -L 42 -U 42».  In
practice, that incantation just prints 42.  As I mentioned in the API
review, I think that's desired behaviour.  In either case, please add
a test.

> +    }
> +
> +    if(!OPT_ISSET(ops,'c'))
> +	if ((!bound && !integer_out) || arrname)
> +	len = 8;

So one default value of -c is here and the other is up in the
declaration of «len».  That's somewhat confusing.

> +
> +
> +    if (!byte_len)
> +	byte_len = len;

The condition is always true.

Didn't you get a compiler warning for this?

> +
> +
> +    if (bound) {
> +	pad = len / 16 + 1;
> +	byte_len = (len + pad) * sizeof(uint32_t);

«pad» should be declared in this scope since it's not used elsewhere.
Ditto «outbuff» below and others.

Why «len / 16 + 1»?

> +    } else if (integer_out)
> +	byte_len = len * sizeof(uint32_t);
> +
> +    if (byte_len > 256)
> +	byte_len=256;
> +

Why not report an error to the user?

>
⋮
> +    min = -diff % diff;

See above about the special casing in the assignment to «diff».  I guess
it affects this line.

>
⋮
> +	sprintf(int2str, "%u", int_val);

Hold on.  %u expects an «unsigned int» and «int_val» is a «uint32_t».
So, this line is only correct when «unsigned int» happens to be a 32-bit
type… while there's nothing stopping that type from being a 64-bit type,
or even a 16-bit type, and in either case this line would read the wrong
number of bytes off the varargs.

> +	if (arrname) {
> +	    array[i]=ztrdup(int2str);
> +	} else if (scalar && len == 1) {
> +	    setiparam(scalar, int_val);
> +	} else {
> +	    printf("%s ",int2str);
> +	}
> +    }
> +    if (arrname) {
> +	setaparam(arrname,array);
> +    } else if (!scalar) {
> +	printf("\n");
> +    }
> +
> +
> +    zfree(buf, byte_len);
> +    return 0;
> +}
> +

In the API docs review I proposed to split «getrandom» into two separate
builtins.

Having read this function, I reiterate that proposal.  This function,
with all the if/then cases that try to keep track of which of the
function's different modes of operation was requested, is difficult to
read and to modify.

> +
> +/*
> + * Provides for the SRANDOM parameter and returns an unsigned 32-bit random
> + * integer.
> + */
> +
> +/**/
> +static zlong
> +get_srandom(UNUSED(Param pm)) {
> +
> +    if(buf_cnt < 0) {
> +	getrandom_buffer((void*) rand_buff,sizeof(rand_buff));
> +	buf_cnt=7;

Magic number.  You might want «sizeof(rand_buff) / sizeof(rand_buff[0]) - 1»…

…but it'd be more idiomatic to leave that «- 1» out and then change the
postfix decrement to a prefix decrement.  That would also be more
consistent with the variable's name.  (As written, «buf_cnt == 0» means
there's one more zlong in it; it's not a "count" but a "largest valid
index".)

> +    }
> +    return rand_buff[buf_cnt--];
> +}
> +
> +/*
> + * Implements the mathfunc zrandom and returns a random floating-point
> + * number between 0 and 1.  Does this by getting a random 32-bt integer
> + * and dividing it by UINT32_MAX.
> + *

Presumably this means to guarantee a /uniformly-distributed/ random
floating-point number in the given range.  So, why does "getting
a random 32-bit integer and dividing it by UINT32_MAX" actually
implement that guarantee?

Oh, wait, never mind.  The comment is out of date.  Fix it.

> + */
> +
> +/**/
> +static mnumber
> +math_zrandom(UNUSED(char *name), UNUSED(int argc), UNUSED(mnumber *argv),
> +	     UNUSED(int id))
> +{
> +    mnumber ret;
> +    double r;
> +
> +    r = random_real();
> +    if (r < 0) {
> +	zwarnnam(name, "Failed to get sufficient random data.");

Say why.  (Probably via errno?)

Rule of thumb: every error message should include at least one replaceable.

(Further instances of this elsewhere in the file.)

> +/**/
> +int
> +setup_(UNUSED(Module m))
> +{
> +#ifdef USE_URANDOM
> +    /* Check for the existence of /dev/urandom */
> +
> +    struct stat st;
> +
> +    if (lstat("/dev/urandom",&st) < 0) {

Why not stat()?

> +	zwarn("No kernel random pool found.");
> +	return 1;
> +    }
> +
> +    if (!(S_ISCHR(st.st_mode)) ) {
> +	zwarn("No kernel random pool found.");
> +	return 1;
> +    }
> +#endif /* USE_URANDOM */
> +    return 0;
> +}

> +/**/
> +int
> +_zclz64(uint64_t x) {
⋮
> +    if (!(x & 0xFFFFFFFF00000000)) {

Don't you need ZLONG_CONST() here?

> +/**/
> +uint64_t
> +random64(void) {
> +    uint64_t r;
> +
> +    if(getrandom_buffer(&r,sizeof(r)) < 0) {
> +	zwarn("zsh/random: Can't get sufficient random data.");
> +	return 1; /* 0 will cause loop */

Where is the docstring of random64() itself?  /That/ is what determines
whether or not returning 0 would be correct.

What loops and why?

Also, I suspect this name will collide with a third-party function one
day.  Recommending to rename.

> +    }
> +
> +    return r;
> +}
> +
> +/* Following code is under the below copyright, despite changes for error
> + * handling and removing GCCisms */
> +

-1.  Don't say "Following code"; identify the code explicitly, or better
yet, move it to its own *.c file.  Otherwise, any function we might
append to this file would be taken to be covered by this copyright
notice and license statement.

> +/*
> + * random_real: Generate a stream of bits uniformly at random and
> + * interpret it as the fractional part of the binary expansion of a
> + * number in [0, 1], 0.00001010011111010100...; then round it.
> + */
> +
> +/**/
> +double

Need «static».

(Further instances of this elsewhere in the file.)

> +random_real(void)
> +{
⋮
> +}
> +++ b/Test/V15random.ztst
> @@ -0,0 +1,51 @@
> +# Tests for the zsh/random module
> +%test
> +  getrandom -U 56
> +1f:Checking if system has a kernel random source
> +?(eval):1: No kernel random pool found.
> +

The test point's code has nothing to do with the test point's
description and expected errput.

The 'f' flag is why this test point's failure doesn't get reported.

> +  getrandom -U 64
> +0:Checking upper Bound is observed
> +*><0-64> 
> +
> +  tmpmsg="failed"
> +  getrandom -L 4 -U 5 -c 16 -a tmpa
> +  for i in ${tmpa[@]}
> +  do
> +    if (( i == 5 )); then
> +      tmpmsg="success"
> +      break
> +    fi
> +  done

Use ${tmpa[(I)5]} ?

> +  echo $tmpmsg
> +0:Checking that upper bound is inclusive
> +F:Try running the test again: make TESTNUM=V15 check

Baby and bathwater.  Say what the problem is first and what the
recommended fix is second.  In this case, something along the lines
of "This test WILL false positive once every 65536 runs on average.  To
rule this out, run the test again."

Oh, and bump that 16 to something 3 or 4 times as big, because a 1/65536
chance isn't really enough in a world where automated builds (CI,
distros' QA, etc.) is a thing.

> +>success
> +
> +  echo $(( zrandom() ))
> +0:Checking if zrandom() produces a floating-point number between 0 and 1
> +*>0(|.)[[:digit:]]#

Why is the period optional?  Something like "042" shouldn't be accepted.

> diff --git a/configure.ac b/configure.ac
> index 074141d38..f1fa01274 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -675,6 +675,7 @@ fi
>  AC_CHECK_HEADERS(sys/time.h sys/times.h sys/select.h termcap.h termio.h \
>  		 termios.h sys/param.h sys/filio.h string.h memory.h \
>  		 limits.h fcntl.h libc.h sys/utsname.h sys/resource.h \
> +                 sys/random.h \

Tabs/spaces are wrong.  Also in the next hunk.

>  		 locale.h errno.h stdio.h stdarg.h varargs.h stdlib.h \
>  		 unistd.h sys/capability.h \
>  		 utmp.h utmpx.h sys/types.h pwd.h grp.h poll.h sys/mman.h \
> @@ -1337,6 +1338,7 @@ AC_CHECK_FUNCS(strftime strptime mktime timelocal \
>  	       cygwin_conv_path \
>  	       nanosleep \
>  	       srand_deterministic \
> +               getrandom arc4random \
>  	       setutxent getutxent endutxent getutent)
>  AC_FUNC_STRCOLL
>  

Bottom line:

- Interfacing to kernel random number APIs seems reasonable enough.

- The API to shell scripts feels unintuitive.

- As the patch stands, it is not ready to be merged (even assuming
  arguendo the API is perfect as-is).

Thanks,

Daniel




Messages sorted by: Reverse Date, Date, Thread, Author