Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: bug in zsh wait builtin - rhbz#1150541

X-seq: zsh-workers 33531
From: Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
To: zsh-workers@xxxxxxx
Subject: Re: bug in zsh wait builtin - rhbz#1150541
Date: Sat, 25 Oct 2014 20:08:47 +0100
In-reply-to: <141023215041.ZM19768@torch.brasslantern.com>
List-help: <mailto:zsh-workers-help@zsh.org>
List-id: Zsh Workers List <zsh-workers.zsh.org>
List-post: <mailto:zsh-workers@zsh.org>
Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
References: <CAO7vJOgoC=6wxv6QgzDuWwEYU1vj1i4AybS47-U5_f=FpaDMnA@mail.gmail.com> <20141021210234.199eee3d@pws-pc.ntlworld.com> <141021235542.ZM14840@torch.brasslantern.com> <20141023093232.1f4201e2@pwslap01u.europe.root.pri> <141023215041.ZM19768@torch.brasslantern.com>

On Thu, 23 Oct 2014 21:50:41 -0700
Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> On Oct 23,  9:32am, Peter Stephenson wrote:
> }
> } On Tue, 21 Oct 2014 23:55:42 -0700
> } Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> } >  Is "kill" supposed to work the same way?
> } 
> } There's no indication kill needs to have this.  Presumably this is
> } because for kill you don't need to have a sensible exit status, just a
> } reasonable likelihood the job is dead (or wedged in some state where
> } that signal doesn't work, but that's an entirely different problem).
> 
> My implied point was that both commands accept either job identifiers
> (%3, %?sleep?) or PIDs and presumably should act the same way for the
> same child process regardless of how it was identified; or else PIDs
> are something entirely different than job identifiers and the rules
> are different.  But for "wait" treat PIDs magically while "kill" does
> not, seems inconsistent.

In any case there appears to be no call for kill to do this.

> } > Note also that this is partly handled by the POSIX_JOBS option:
> } > 
> } >      When the option is set, it becomes possible to use the wait
> } >      builtin to wait for the last job started in the background (as
> } >      given by $!) even if that job has already exited.  This works even
> } >      if the option is turned on temporarily around the use of the wait
> } >      builtin.
> } > 
> } > I would say that any further change made for this should also be under
> } > the auspices (so to speak) of POSIX_JOBS.
> } 
> } That would already cover the cases in the "bug" report, in fact.
> 
> I don't think it would, because the report starts two background jobs
> and then waits for the one started first.  The current implementation
> only allows the most recent $! to be waited for after it exits.

I missed that.
 
> } I'm not really sure why we wouldn't just implement this particular
> } feature generally, despite the current status.  Is there any reason why
> } you'd *want* "wait" to give you an error (which isn't a particularly
> } useful message) owing to a race condition you can't control?
> 
> There are a lot of error messages that a script probably doesn't want
> but an interactive user might.  Why do you want "wait %3" to report
> "%3: no such job"?  If nobody wants it, why did it take us 25 years
> to figure that out?

The point I was making wasn't that errors were necessarily bad, if they
could detect something useful, but that the error wasn't detecting
anything particularly useful as there was an intrinsic race, and was
thereby claiming by incorrect guesswork that the PID wasn't a child of
the shell (and trying to claim that's OK because it no longer is is the
sort of argument that will annoy users, who want things to work without
long justifications).  It's incorrect no matter how long it's been
there.

In any case I don't think using wait in this fashion is useful
interactively.  It's replaced by job control.  If you turn off job
control to rely on a system designed for scripts that therefore has no
notifications you are shooting yourself in the foot.  If you use the
scripting-style mechanism you accept the consequences.

(Not sure it's even worth arguing about, but it's Saturday and I'm no
longer on git duty...)

Here's an implementation.  I've given it the obvious finger test, but
there may be some more stressful tests we could apply.  Note I'm only
recording background PIDs, since the user can't explicitly wait for
foreground PIDs; it's possible I've missed a case where something can
be in the background but that would suggest the job is wrongly recorded.

One piece of unfinished business: I think lastpid_status can go, but the
logic associated with it is rather different from what I just
implemented so I'd like some further thoughts over whether there's a
case the latter doesn't cover.  As I've written it lastpid_status is not
an optimisation because you need to remove it from the list anyway ---
you can only wait for a given PID once, after which returning an error
is the correct behaviour.

Note it's not useful to store a job, as opposed to a process, in this
fashion, because we reuse jobs immediately, which is standard shell
behaviour (it would be annoying for job numbers to increment rapidly).
That fits with the fact this isn't really designed for use with job
control.  I've tried to clarify the point in the manual entry for
"wait".

diff --git a/Doc/Zsh/builtins.yo b/Doc/Zsh/builtins.yo
index 46f40cc..edc335e 100644
--- a/Doc/Zsh/builtins.yo
+++ b/Doc/Zsh/builtins.yo
@@ -2059,6 +2059,22 @@ then all currently active child processes are waited for.
 Each var(job) can be either a job specification or the process ID
 of a job in the job table.
 The exit status from this command is that of the job waited for.
+
+It is possible to wait for recent processes (specified by process ID,
+not by job) that were running in the background even if the process has
+exited.  Typically the process ID will be recorded by capturing the
+value of the variable tt($!) immediately after the process has been
+started.  There is a limit on the number of process IDs remembered by
+the shell; this is given by the value of the system configuration
+parameter tt(CHILD_MAX).  When this limit is reached, older process IDs
+are discarded, least recently started processes first.
+
+Note there is no protection against the process ID wrapping, i.e. if the
+wait is not executed soon enough there is a chance the process waited
+for is the wrong one.  A conflict implies both process IDs have been
+generated by the shell, as other processes are not recorded, and that
+the user is potentially interested in both, so this problem is intrinsic
+to process IDs.
 )
 findex(whence)
 item(tt(whence) [ tt(-vcwfpams) ] var(name) ...)(
diff --git a/Doc/Zsh/options.yo b/Doc/Zsh/options.yo
index 068a253..452b258 100644
--- a/Doc/Zsh/options.yo
+++ b/Doc/Zsh/options.yo
@@ -1434,10 +1434,10 @@ shell is saved for output within a subshell (for example, within a
 pipeline).  When the option is set, the output of tt(jobs) is empty
 until a job is started within the subshell.
 
-When the option is set, it becomes possible to use the tt(wait) builtin to
-wait for the last job started in the background (as given by tt($!)) even
-if that job has already exited.  This works even if the option is turned
-on temporarily around the use of the tt(wait) builtin.
+In previous versions of the shell, it was necessary to enable
+tt(POSIX_JOBS) in order for the builtin command tt(wait) to return the
+status of background jobs that had already exited.  This is no longer
+the case.
 )
 enditem()
 
diff --git a/Src/jobs.c b/Src/jobs.c
index bd95afb..ec76c4f 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -1940,6 +1940,122 @@ maybeshrinkjobtab(void)
     unqueue_signals();
 }
 
+/*
+ * Definitions for the background process stuff recorded below.
+ * This would be more efficient as a hash, but
+ * - that's quite heavyweight for something not needed very often
+ * - we need some kind of ordering as POSIX allows us to limit
+ *   the size of the list to the value of _SC_CHILD_MAX and clearly
+ *   we want to clear the oldest first
+ * - cases with a long list of background jobs where the user doesn't
+ *   wait for a large number, and then does wait for one (the only
+ *   inefficient case) are rare
+ * - in the context of waiting for an external process, looping
+ *   over a list isn't so very inefficient.
+ * Enough excuses already.
+ */
+
+/* Data in the link list, a key (process ID) / value (exit status) pair. */
+struct bgstatus {
+    pid_t pid;
+    int status;
+};
+typedef struct bgstatus *Bgstatus;
+/* The list of those entries */
+LinkList bgstatus_list;
+/* Count of entries.  Reaches value of _SC_CHILD_MAX and stops. */
+long bgstatus_count;
+
+/*
+ * Remove and free a bgstatus entry.
+ */
+static void rembgstatus(LinkNode node)
+{
+    zfree(remnode(bgstatus_list, node), sizeof(struct bgstatus));
+    bgstatus_count--;
+}
+
+/*
+ * Record the status of a background process that exited so we
+ * can execute the builtin wait for it.
+ *
+ * We can't execute the wait builtin for something that exited in the
+ * foreground as it's not visible to the user, so don't bother recording.
+ */
+
+/**/
+void
+addbgstatus(pid_t pid, int status)
+{
+    static long child_max;
+    Bgstatus bgstatus_entry;
+
+    if (!child_max) {
+#ifdef _SC_CHILD_MAX
+	child_max = sysconf(_SC_CHILD_MAX);
+	if (!child_max) /* paranoia */
+#endif
+	{
+	    /* Be inventive */
+	    child_max = 1024L;
+	}
+    }
+
+    if (!bgstatus_list) {
+	bgstatus_list = znewlinklist();
+	/*
+	 * We're not always robust about memory failures, but
+	 * this is pretty deep in the shell basics to be failing owing
+	 * to memory, and a failure to wait is reported loudly, so test
+	 * and fail silently here.
+	 */
+	if (!bgstatus_list)
+	    return;
+    }
+    if (bgstatus_count == child_max) {
+	/* Overflow.  List is in order, remove first */
+	rembgstatus(firstnode(bgstatus_list));
+    }
+    bgstatus_entry = (Bgstatus)zalloc(sizeof(*bgstatus_entry));
+    if (!bgstatus_entry) {
+	/* See note above */
+	return;
+    }
+    bgstatus_entry->pid = pid;
+    bgstatus_entry->status = status;
+    if (!zaddlinknode(bgstatus_list, bgstatus_entry)) {
+	zfree(bgstatus_entry, sizeof(*bgstatus_entry));
+	return;
+    }
+    bgstatus_count++;
+}
+
+/*
+ * See if pid has a recorded exit status.
+ * Note we make no guarantee that the PIDs haven't wrapped, so this
+ * may not be the right process.
+ *
+ * This is only used by wait, which must only work on each
+ * pid once, so we need to remove the entry if we find it.
+ */
+
+static int getbgstatus(pid_t pid)
+{
+    LinkNode node;
+    Bgstatus bgstatus_entry;
+
+    if (!bgstatus_list)
+	return -1;
+    for (node = firstnode(bgstatus_list); node; incnode(node)) {
+	bgstatus_entry = (Bgstatus)getdata(node);
+	if (bgstatus_entry->pid == pid) {
+	    int status = bgstatus_entry->status;
+	    rembgstatus(node);
+	    return status;
+	}
+    }
+    return -1;
+}
 
 /* bg, disown, fg, jobs, wait: most of the job control commands are     *
  * here.  They all take the same type of argument.  Exception: wait can *
@@ -2085,10 +2201,24 @@ bin_fg(char *name, char **argv, Options ops, int func)
 		}
 		if (retval == 0)
 		    retval = lastval2;
-	    } else if (isset(POSIXJOBS) &&
-		       pid == lastpid && lastpid_status >= 0L) {
+	    } else if (pid == lastpid && lastpid_status >= 0L) {
+		/*
+		 * Historical note: this used to be covered by
+		 * isset(POSIXJOBS), but reporting that the last
+		 * PID to exist isn't a child of the shell is not
+		 * obviously useful behaviour.
+		 */
 		retval = (int)lastpid_status;
-	    } else {
+		/*
+		 * We can't wait for a PID twice so ensure it's
+		 * not on the list, either.
+		 *
+		 * TODO: We could optimise this because pid must be at
+		 * the end of the list, if present, but I think we now
+		 * can get rid of lastpid_status anyway.
+ 		 */
+		(void)getbgstatus(pid);
+	    } else if ((retval = getbgstatus(pid)) < 0) {
 		zwarnnam(name, "pid %d is not a child of this shell", pid);
 		/* presumably lastval2 doesn't tell us a heck of a lot? */
 		retval = 1;
diff --git a/Src/linklist.c b/Src/linklist.c
index 1e364fb..3aa8125 100644
--- a/Src/linklist.c
+++ b/Src/linklist.c
@@ -118,6 +118,8 @@ znewlinklist(void)
     LinkList list;
 
     list = (LinkList) zalloc(sizeof *list);
+    if (!list)
+	return NULL;
     list->list.first = NULL;
     list->list.last = &list->node;
     list->list.flags = 0;
@@ -152,6 +154,8 @@ zinsertlinknode(LinkList list, LinkNode node, void *dat)
 
     tmp = node->next;
     node->next = new = (LinkNode) zalloc(sizeof *tmp);
+    if (!new)
+	return NULL;
     new->prev = node;
     new->dat = dat;
     new->next = tmp;
diff --git a/Src/signals.c b/Src/signals.c
index 2df69f9..7b212e6 100644
--- a/Src/signals.c
+++ b/Src/signals.c
@@ -530,6 +530,12 @@ wait_for_processes(void)
 	 */
 	if (pn != NULL && pid == lastpid && lastpid_status != -1L)
 	    lastpid_status = lastval2;
+	/*
+	 * Accumulate a list of older jobs (the above is basically an
+	 * optimisation for the last job.
+	 */
+	if (!(jn->stat & (STAT_CURSH|STAT_BUILTIN)) && jn - jobtab != thisjob)
+	    addbgstatus(pid, (int)lastval2);
     }
 }
 
 
-- 
Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/

Follow-Ups:
- Re: bug in zsh wait builtin - rhbz#1150541
  - From: Bart Schaefer
- Re: bug in zsh wait builtin - rhbz#1150541
  - From: Bart Schaefer

References:
- bug in zsh wait builtin - rhbz#1150541
  - From: Tim Speetjens
- Re: bug in zsh wait builtin - rhbz#1150541
  - From: Peter Stephenson
- Re: bug in zsh wait builtin - rhbz#1150541
  - From: Bart Schaefer
- Re: bug in zsh wait builtin - rhbz#1150541
  - From: Peter Stephenson
- Re: bug in zsh wait builtin - rhbz#1150541
  - From: Bart Schaefer

Messages sorted by: Reverse Date, Date, Thread, Author