Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Final (?) info on signals/crashes when suspending "mutt" function



[I sent this once before but it seems to have vanished.  Sorry if it shows
up twice.]

Jump to the end for the big news that may finally get this fixed.  I've
been writing this message incrementally between debugging passes, so you
might as well get the whole play-by-play.

Recall that Jos Backus reported that suspending the function

    mutt () {
	command mutt "$@"
	echotc rs
    }

cause zsh to behave badly.  Sven has sent several patches but none of them
have completely fixed the problem.  Attempting to debug this, I've been
running gdb on zsh.  I reproduced the problem but so far I'm only able to
break at the point at which the SIGSTOP is received, so I'm not sure who
is sending that signal -- however, the parent zsh received first SIGSTOP
and *then* SIGTSTP when I hit ^Z, which is very suspicious.

However, because I was in gdb (attached to a PID from another xterm) I was
able to make zsh continue after each signal (so zsh's xterm never got hung).
Continuing through the second (TSTP) signal, I ended up with this:

zagzig% mutt () {
function>       command mutt "$@"
function>       echotc rs
function> }
zagzig% mutt
zsh: suspended (signal)  mutt
zagzig% pstree $$
zsh-+-mutt
    `-pstree
zagzig% fg
[1]  - trace trap (core dumped)  mutt

Simultaneously in the gdb terminal, the parent zsh got a SIGSEGV because it
tried to strcmp() a bad job table entry.  Here's the stack trace:

(gdb) where
#0  strcmp (p1=0x0, p2=0x80bfe70 "/usr/src/local/zsh/zsh-3.0.6-pre")
    at ../sysdeps/generic/strcmp.c:36
#1  0x804ba8b in bin_fg (name=0x80c25d8 "fg", argv=0x80c2770, 
    ops=0xbffff1a8 "", func=2) at builtin.c:629
#2  0x804a8c3 in execbuiltin (args=0x80c2710, bn=0x80b0ea0) at builtin.c:186
#3  0x805d7d3 in execcmd (cmd=0x80c26f0, input=0, output=0, how=2, last1=2)
    at exec.c:1779
#4  0x805af5e in execpline2 (pline=0x80c2740, how=2, input=0, output=0, 
    last1=0) at exec.c:912
#5  0x805a5b0 in execpline (l=0x80c26d8, how=2, last1=0) at exec.c:739
#6  0x805a183 in execlist (list=0x80c2750, dont_change_job=0, exiting=0)
    at exec.c:612
#7  0x806bee0 in loop (toplevel=1, justonce=0) at init.c:143
#8  0x806bbe4 in main (argc=2, argv=0xbffff6ec) at init.c:75
(gdb) up
#1  0x804ba8b in bin_fg (name=0x80c25d8 "fg", argv=0x80c2770, 
    ops=0xbffff1a8 "", func=2) at builtin.c:629
629			if (strcmp(jobtab[job].pwd, pwd)) {
(gdb) p job
$1 = 1
(gdb) p jobtab[1]
$3 = {gleader = 0, other = 0, stat = 0, pwd = 0x0, procs = 0x0, 
  filelist = 0x0, stty_in_env = 0, ty = 0x0}
(gdb) p jobtab[0]
$4 = {gleader = 0, other = 0, stat = 0, pwd = 0x0, procs = 0x0, 
  filelist = 0x0, stty_in_env = 0, ty = 0x0}
(gdb) p curjob
$5 = 2

Somewhere zsh has completely lost track of two (?) jobs, and failed to reset
curjob to -1.

Now, oddly, if I change the function to be:

    mutt() {
	cd /tmp
	command mutt "$@"
	echotc rs
    }

I still get the SIGSTOP followed by the SIGTSTP, but now zsh is able to
correctly "fg" the job:

zagzig% mutt () {
        cd /tmp
        command mutt "$@"
        echotc rs
}
zagzig% mutt
zsh: suspended (signal)  mutt
(pwd now: /tmp)
zagzig% cd -
/usr/src/local/zsh/zsh-3.0.6-pre
zagzig% fg
[1]  - continued  mutt
zsh: suspended (signal)  mutt
zagzig% fg
[1]  - continued  mutt

The extra builtin has caused something different to happen.  Following
the second "fg" I quit mutt with "q" -- and now zsh is hung, blocked in
sigsuspend() called from waitjob(); but that may be a side effect of gdb.

The strange thing is, I can't tell where the heck that SIGSTOP is coming
from.  I've even tried putting in debug print statements around places
where zsh performs a kill() or killpg(), and I don't get any output!  Is
some other process (mutt itself?) sending a SIGSTOP to the process group?

YES!  That's IT!  MUTT is calling kill(0, SIGSTOP) and blowing its parent
zsh out of the water!  Confirmed by changing "command" to "strace" in the
function above.  Mutt expects to be the process group leader, but is not.

So that pretty much tears it.  There is no way short of forking a "watcher"
subshell for EVERY external process to handle both:
(1) badly-behaved programs whose exit status does not reveal that they died
    from a signal, and
(2) badly-behaved programs that send uncatchable signals to their entire
    process group even when they are not the group leader.

The failure in case (1) is far less catastrophic than case (2), so I think
the right solution is to back off to the behavior from patch 6707 (that is,
scrap 6819 and most of 6824, but 6848 and 6850 are orthogonal and good).

I don't know, however, if that's directly related to the bogus curjob value
and "fg" crash noted above.  Probably so, but ...

-- 
Bart Schaefer                                 Brass Lantern Enterprises
http://www.well.com/user/barts              http://www.brasslantern.com



Messages sorted by: Reverse Date, Date, Thread, Author