Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Zsh killed when autoloaded function calls mislinked program



Travis Spencer wrote:
> I've found that invoking an autoloaded function that calls a program
> that isn't linked correctly kills zsh.

I get this, too, actually from Solaris 2.6 since I have lots of
conveniently unloadable Solaris 8 binaries lying around.  I've
simplified it to this:

% fn() { if ~/solaris8/bin/touch /dev/null 2>/dev/null; then true; fi }
% echo | fn
zsh: killed     TEST_MODULES=1 ./zsh

The "if" and the function are both crucial.

You can get the same effect on Linux (and therefore presumably more
generally) with the following code:

% fn() { if sh -c 'kill -9 $$'; then true; fi }
% echo | fn
zsh: killed     zsh

so this is quite bad.

Good news... I think I've found out what's doing it.

Bad news... it's in Sven's hacks for being clever with jobs when stuff
is running in the last part of a pipeline and I've only a vague idea
what's going on.

The culprit appears to be this chunk in execpline, around line 1236 of
exec.c:

	    if (list_pipe && (lastval & 0200) && pj >= 0 &&
		(!(jn->stat & STAT_INUSE) || (jn->stat & STAT_DONE))) {
		deletejob(jn);
		jn = jobtab + pj;
		killjb(jn, lastval & ~0200);
	    }

pj is the old value of "thisjob" at the start of execpline(). jn refers
to the job created with the new process.  list_pipe is the extra special
Sven flag indicating we are doing something extra special with the
current process.

In that call to killjb, we send the signal which killed the failed
process (touch in my case, grep in Travis's) to the process group
including that process (the PID of the group leader).  This is
presumably some hack to pass the signal to a group when the shell
assumes it should get it.  I don't know why it assumes that here.

In this case the group leader is PID 0.  This is presumably the current
process group (the killpg documentation for Solaris isn't explicit but
this is normal) including the shell.  The signal is 9 (SIGKILL).  From
this point on it's all easy to understand.

This seems to fix the immediate problem, but I don't even know if it's
in the right target area.  Do we ever want to kill a process group where
the group leader is marked as 0?  Or is this working because it's not
killing things that should be killed?  Or is that entire chunk I quoted
misguided?  What has the old "thisjob", to which jn is being set, got to
do with the preceeding jn at this point anyway, such that it needs
killing?

Help.

Index: Src/exec.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/exec.c,v
retrieving revision 1.79
diff -u -r1.79 exec.c
--- Src/exec.c	7 Dec 2004 16:55:03 -0000	1.79
+++ Src/exec.c	21 Dec 2004 11:03:29 -0000
@@ -1233,7 +1233,8 @@
 		(!(jn->stat & STAT_INUSE) || (jn->stat & STAT_DONE))) {
 		deletejob(jn);
 		jn = jobtab + pj;
-		killjb(jn, lastval & ~0200);
+		if (jn->gleader)
+		    killjb(jn, lastval & ~0200);
 	    }
 	    if (list_pipe_child ||
 		((jn->stat & STAT_DONE) &&

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************



Messages sorted by: Reverse Date, Date, Thread, Author