Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: 5.0.8 regression when waiting for suspended jobs

Long ramble with a question at the end.

On Aug 12, 10:31am, Bart Schaefer wrote:
} This is a bit ugly, suggestions welcome.  In the "is a child but has no
} job table entry" case, this won't SIGCONT the child where previous it
} would have, so there's that.

I've been trying to clean this up and have some ideas, but it occurs to
me that kill(pid, SIGCONT) might not be harmless if the process belongs
to the same user but is not a child of the current shell, or even if it
is a child of the current shell but for some reason has trapped CONT.

Practically speaking "wait" never gets as far as calling waitforpid()
on a process that is not a child, so we shouldn't need to worry about
that except in the abstract (e.g., some future programmer decides to
call waitforpid() in some other context).

However waitforpid() is used by getoutput() and getoutputfile() which
do not create job table entries, so we DO need to deal with that, and
if one of those is stopped it seems wise to continue it.  Problem is
we can't tell whether one of those is stopped or not, so it's a toss-
up whether to send the SIGCONT.

Doing so only on the second+ time around the loop (the old behavior)
seems to be making the assumption that we got SIGCHLD as a result of
the job being stopped.  If we got SIGCHLD because the job actually
exited, we'd have failed the kill(pid, 0) on the next attempt and
left the loop.  However, if the job is already stopped before we go
into waitpid(), that SIGCHLD has likely already been processed, so
we'll block in signal_suspend() without sending SIGCONT.

[Another bash aside --

"wait $pid" for a stopped job returns immediately with no message,

$ wait %1
bash: warning: wait_for_job: job 1 is stopped

This warning doesn't appear if the wait is already running at the
time the child is stopped; in that case you get "[1]+  Stopped ..."
as if the job had been running in the foreground.]

Now consider this pathological case:

% ( while : ; do kill -STOP $sysparams[pid]; done ) &

Even with the SIGTT* test in place, waiting for that is an infinite
loop if SIGCONT is being sent.

Finally, and here's the kicker, bin_fg() ALREADY attempts to SIGCONT
the job before calling waitforpid().

Thus I think we have two choices:

(1) Go for a minimal change of sending the SIGCONT only when !wait_cmd,
which ends up with us blocking forever as with 5.0.7 and before (but
blocked on signal_suspend() rather than busy-waiting, which is good).

(2) Actually determine whether the job is WIFSTOPPED() and break the
loop; the job status is printed when returning to the prompt.

Patch here for the minimal change.  Man that's a lot of analysis to
come up with a two-word edit.

diff --git a/Src/jobs.c b/Src/jobs.c
index ed647b8..b47ba8c 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -1384,10 +1384,17 @@ waitforpid(pid_t pid, int wait_cmd)
     child_block();		/* unblocked in signal_suspend() */
+    /* This function should never be called with a pid that is not a
+     * child of the current shell.  Consequently, if kill(0, pid)
+     * fails here with ESRCH, the child has already been reaped.  In
+     * the loop body, we expect this to happen in signal_suspend()
+     * via zhandler(), after which this test terminates the loop.
+     */
     while (!errflag && (kill(pid, 0) >= 0 || errno != ESRCH)) {
 	if (first)
 	    first = 0;
-	else
+	else if (!wait_cmd)
 	    kill(pid, SIGCONT);
 	last_signal = -1;

Messages sorted by: Reverse Date, Date, Thread, Author