Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

zsh hangs sometimes continued.



Since about 2 years we are suffering from the same bug that was reported
in: http://www.zsh.org/mla/users/2008/msg00432.html.

After adding more and more debug info to the zsh-4.3.10 sources I
figured out that the problem is in the findjob returning the pid of
a terminated process.

Geert (Cc) then pointed out that this problem matches perfectly with the
description in the msg00432.html above,
however the fix made at that time was insufficient in our case.

The previous fix was to stop looking in jobs that were in status
STAT_DONE, i.e. jobs that do not contain any process in status
SP_RUNNING :

+	/*
+	 * We are only interested in jobs with processes still
+	 * marked as live.  Careful in case there's an identical
+	 * process number in a job we haven't quite got around
+	 * to deleting.
+	 */
+	if (jobtab[i].stat & STAT_DONE)
+	    continue;
+
 	for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs;
 	     pn; pn = pn->next)
 	    if (pn->pid == pid) {

However, this does not prevent from findjob returning a process that is
no longer running. In our case there was a job
containing 2 processes, one of them running and one of them terminated.
In that case the job is not "STAT_DONE", but the
for loop above still happily returns the pid of the process that was
already terminated, leading to the same deadlock
situation is in the original description.

So I added a condition to check that the returned pid is still running :

	for (pn = aux ? jobtab[i].auxprocs : jobtab[i].procs;
             pn; pn = pn->next)
          if ((pn->pid == pid)
               && (pn->status == SP_RUNNING)
              /* Additional condition required to avoid INC035 : When a
job contains two
                 pids, one terminated pid and one running pid, then the
condition above
                 jobtab[i].stat & STAT_DONE will not stop these pids
from being candidates
                 for the findproc result (which is supposed to be a
RUNNING pid), and if
                 the terminated pid is an identical process number for
the pid identifying the
                 running process we are trying to find (after pid number
wrapping), then we
                 need to avoid returning the terminated pid, otherwise
the shell would block
                 and wait forever for the termination of the process
which pid we were supposed
                 to return in a different job.
               */
             ) {

We had 2 scripts that suffered from this problem, the simplest one did
something like :

cat <file> | uniq | while read LINE
do
  quite a bit of fork-exec
done

From the traces I understood that the cat terminates well before the
uniq and inside the loop
(different job) a new process was created that had the same pid as the
cat, but that job was
not complete (because of the uniq), and hence the findproc code above
concluded that the cat
had died a second time (called from zhander handling SIGCHLD).

Obviously this problem was not easy to reproduce, because it depended a
lot on all parallel
fork activity to make the pid numbers advance. Executing a "while true
do usleep 100 done"
significantly increased the frequency of the script getting stuck
(usually after 1..3 hours)
but with the fix above the script now ran in a loop over 2 days, so the
fix looks promising.

We would appreciate if this fix could be improved (if needed) and
validated/integrated.

Note : The problem was submitted to RedHat via HP, so you have probably
received the script and
the input file before (it is very large to I don't attach it here).
Anyway, now that you understand
the problem I guess it is not very difficult to produce it
systematically, if the cat just
echos its pid to a file and terminates, then the loop only needs to wait
until it forked a child
with the same pid and then break, which should trigger the bug as well.

____
 
This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful.
 
Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy.
 
Any views expressed in this message are those of the sender.



Messages sorted by: Reverse Date, Date, Thread, Author