Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Deadlock when receiving kill-signal from child process



On Aug 5, 12:52am, Mathias Fredriksson wrote:
}
} I have however managed to get a dump with strace on Gentoo

Based on this strace plus a GDB stack trace Mathias sent me off-list,
I think the problem may be here:

 1415 zwaitjob(int job, int wait_cmd)
 1416 {
 1417     int q = queue_signal_level();
 1418     Job jn = jobtab + job;
 1419 
 1420     dont_queue_signals();
 1421     child_block();               /* unblocked during signal_suspend() */
 1422     queue_traps(wait_cmd);
...
 1440         while (!errflag && jn->stat &&
 1441                !(jn->stat & STAT_DONE) &&
 1442                !(interact && (jn->stat & STAT_STOPPED))) {
 1443             signal_suspend(SIGCHLD, wait_cmd);

I suspect what's happening is that the child represented by "job" exits
during dont_queue_signals(), which is a macro that expands to a loop
calling zhandler(), which will process TRAPUSR1 (or other traps).

Somehow this results in jn->stat never being marked STAT_DONE.  Perhaps
this happens because the "thisjob" global gets temporarily changed in
the TRAP* function?  Anyway signal_suspend(SIGCHLD, wait_cmd) is then
called when there are no children left, so we never receive another
SIGCHLD to break out of the while-loop, and even if we do come out of
signal_suspend() the while-loop goes around and we block again.

I'm not sure what to do if this is in fact the problem, because it
e.g. calling child_block() is before dont_queue_signals() has other
problems.

However, it's also possible that a child has exited even before its
job table entry has been created.  One way to find out if that has
happened is this patch:

diff --git a/Src/signals.c b/Src/signals.c
index 3950ad1..d72c7d6 100644
--- a/Src/signals.c
+++ b/Src/signals.c
@@ -519,6 +519,7 @@ wait_for_processes(void)
 	     * will get added on to the next found process that
 	     * terminates.
 	     */
+	    zwarn("no job table entry for pid %d", pid);
 	    get_usage();
 	}
 	/*

Mathias, if you could apply that patch and try again to reproduce the
deadlock, it might tell us something.

-- 
Barton E. Schaefer



Messages sorted by: Reverse Date, Date, Thread, Author