Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: Bug related to stdin/always/jobcontrol

X-seq: zsh-workers 39359
From: Peter Stephenson <p.stephenson@xxxxxxxxxxx>
To: zsh-workers@xxxxxxx
Subject: Re: Bug related to stdin/always/jobcontrol
Date: Fri, 16 Sep 2016 13:02:37 +0100
Cms-type: 201P
In-reply-to: <160915180855.ZM3928@torch.brasslantern.com>
List-help: <mailto:zsh-workers-help@zsh.org>
List-id: Zsh Workers List <zsh-workers.zsh.org>
List-post: <mailto:zsh-workers@zsh.org>
Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
Organization: Samsung Cambridge Solution Centre
References: <87r392jgd0.fsf@juno.home.vuxu.org> <20160905164207.4630643b@pwslap01u.europe.root.pri> <CGME20160914173110eucas1p1f0567e319ae439b975b19f4e55676fed@eucas1p1.samsung.com> <20160914183105.69862fa9@pwslap01u.europe.root.pri> <20160914223553.3173c8ca@ntlworld.com> <160914202450.ZM32479@torch.brasslantern.com> <20160915092721.3d669bc9@pwslap01u.europe.root.pri> <20160915113315.1655019c@pwslap01u.europe.root.pri> <20160915184040.5fc1b9d0@pwslap01u.europe.root.pri> <160915180855.ZM3928@torch.brasslantern.com>

On Thu, 15 Sep 2016 18:08:55 -0700
Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> On Sep 15,  6:40pm, Peter Stephenson wrote:
> } Subject: Re: Bug related to stdin/always/jobcontrol
> }
> } The race is related to the fact that when the new SUPERJOB takes over
> } it's actually in the same process group as the SUBJOB it's taking over,
> } which is different from it's own PID, list_pipe_pid [...]
> } 
> } However, fixing that up doesn't help --- apparently attaching it to the
> } TTY after the SUBJOB is finished works OK, sending it SIGCONT seems to
> } be OK, but then it always gets SIGSTOP
> 
> Is it definitely getting SIGSTOP and not SIGTSTP ?
> 
> The operating system will never[*] automatically send a true STOP signal
> because it's un-catchable.  If a STOP is being received, it's probably
> being sent from the "for (; !nowait;)" loop in execpline(), either at
> line 1674 [ killpg(jobtab[list_pipe_job].gleader, SIGSTOP); ] or at
> line 1688 [ kill(getpid(), SIGSTOP); ].

I've found the double SIGSTOP --- it was getting *both* of the above.

In the case in question, the newly forked child zsh was picking up the
pgrp of the vim process, so when we sent SIGSTOP to that pgrp it got
stopped, too.  When it got restarted, it immediately stopped itself.

There's some code in the child to decide on the pgrp.  It looks like it
needs to have its own pgrp if the old superjob has already gone, the
basis of the problems we are seeing here.

I'm not sure if this leaves some more races but, again, it looks like
it's an improvement.

The rest of the patch is cosmetic.

The bad news is there are other problems with this logic that have crept
in since 5.2 --- this appears to be unrelated, or at least not made any
worse by this change, so I'll start a separate thread.

pws

diff --git a/Src/exec.c b/Src/exec.c
index 21270c8..0755d55 100644
--- a/Src/exec.c
+++ b/Src/exec.c
@@ -1652,7 +1652,13 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 		    int synch[2];
 		    struct timeval bgtime;
 
+		    /*
+		     * A pipeline with the shell handling the right
+		     * hand side was stopped.  We'll fork to allow
+		     * it to continue.
+		     */
 		    if (pipe(synch) < 0 || (pid = zfork(&bgtime)) == -1) {
+			/* Failure */
 			if (pid < 0) {
 			    close(synch[0]);
 			    close(synch[1]);
@@ -1666,6 +1672,18 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 			thisjob = newjob;
 		    }
 		    else if (pid) {
+			/*
+			 * Parent: job control is here.  If the job
+			 * started for the RHS of the pipeline is still
+			 * around, then its a SUBJOB and the job for
+			 * earlier parts of the pipeeline is its SUPERJOB.
+			 * The newly forked shell isn't recorded as a
+			 * separate job here, just as list_pipe_pid.
+			 * If the superjob exits (it may already have
+			 * done so, see child branch below), we'll use
+			 * list_pipe_pid to form the basis of a
+			 * replacement job --- see SUBLEADER code above.
+			 */
 			char dummy;
 
 			lpforked =
@@ -1692,9 +1710,33 @@ execpline(Estate state, wordcode slcode, int how, int last1)
 			break;
 		    }
 		    else {
+			Job pjn = jobtab + list_pipe_job;
 			close(synch[0]);
 			entersubsh(ESUB_ASYNC);
-			if (jobtab[list_pipe_job].procs) {
+			/*
+			 * At this point, we used to attach this process
+			 * to the process group of list_pipe_job (the
+			 * new superjob) any time that was still available.
+			 * That caused problems when the fork happened
+			 * early enough that the subjob is in that
+			 * process group, since then we will stop this
+			 * job when we signal the subjob, and we have
+			 * no way here to know that we shouldn't also
+			 * send STOP to itself, resulting in two stops.
+			 * So don't do that if the original
+			 * list_pipe_job has exited.
+			 *
+			 * The choice here needs to match the assumption
+			 * made when handling SUBLEADER above that the
+			 * process group is our own PID.  I'm not sure
+			 * if there's a potential race for that.  But
+			 * setting up a new process group if the
+			 * superjob is still functioning seems the wrong
+			 * thing to do.
+			 */
+			if (pjn->procs &&
+			    (pjn->stat & STAT_INUSE) &&
+			    !(pjn->stat & STAT_DONE)) {
 			    if (setpgrp(0L, mypgrp = jobtab[list_pipe_job].gleader)
 				== -1) {
 				setpgrp(0L, mypgrp = getpid());
diff --git a/Src/jobs.c b/Src/jobs.c
index 9284c71..d1b98ac 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -232,11 +232,12 @@ super_job(int sub)
 static int
 handle_sub(int job, int fg)
 {
+    /* job: superjob; sj: subjob. */
     Job jn = jobtab + job, sj = jobtab + jn->other;
 
     if ((sj->stat & STAT_DONE) || (!sj->procs && !sj->auxprocs)) {
 	struct process *p;
-		    
+
 	for (p = sj->procs; p; p = p->next) {
 	    if (WIFSIGNALED(p->status)) {
 		if (jn->gleader != mypgrp && jn->procs->next)
diff --git a/Src/signals.c b/Src/signals.c
index 2eefc07..30dde71 100644
--- a/Src/signals.c
+++ b/Src/signals.c
@@ -723,7 +723,7 @@ killjb(Job jn, int sig)
 {
     Process pn;
     int err = 0;
- 
+
     if (jobbing) {
         if (jn->stat & STAT_SUPERJOB) {
             if (sig == SIGCONT) {
@@ -731,11 +731,21 @@ killjb(Job jn, int sig)
                     if (killpg(pn->pid, sig) == -1)
 			if (kill(pn->pid, sig) == -1 && errno != ESRCH)
 			    err = -1;
- 
+
+		/*
+		 * Note this does not kill the last process,
+		 * which is assumed to be the one controlling the
+		 * subjob, i.e. the forked zsh that was originally
+		 * list_pipe_pid...
+		 */
                 for (pn = jn->procs; pn->next; pn = pn->next)
                     if (kill(pn->pid, sig) == -1 && errno != ESRCH)
 			err = -1;
 
+		/*
+		 * ...we only continue that once the external processes
+		 * currently associated with the subjob are finished.
+		 */
 		if (!jobtab[jn->other].procs && pn)
 		    if (kill(pn->pid, sig) == -1 && errno != ESRCH)
 			err = -1;
@@ -744,7 +754,7 @@ killjb(Job jn, int sig)
             }
             if (killpg(jobtab[jn->other].gleader, sig) == -1 && errno != ESRCH)
 		err = -1;
-		
+
 	    if (killpg(jn->gleader, sig) == -1 && errno != ESRCH)
 		err = -1;

References:
- Bug related to stdin/always/jobcontrol
  - From: Christian Neukirchen
- Re: Bug related to stdin/always/jobcontrol
  - From: Peter Stephenson
- Re: Bug related to stdin/always/jobcontrol
  - From: Peter Stephenson
- Re: Bug related to stdin/always/jobcontrol
  - From: Peter Stephenson
- Re: Bug related to stdin/always/jobcontrol
  - From: Bart Schaefer
- Re: Bug related to stdin/always/jobcontrol
  - From: Peter Stephenson
- Re: Bug related to stdin/always/jobcontrol
  - From: Peter Stephenson
- Re: Bug related to stdin/always/jobcontrol
  - From: Peter Stephenson
- Re: Bug related to stdin/always/jobcontrol
  - From: Bart Schaefer

Messages sorted by: Reverse Date, Date, Thread, Author