Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: 5.0.8 regression when waiting for suspended jobs



On Tue, 11 Aug 2015 16:56:55 -0700
Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> On Jul 31,  8:56am, Bart Schaefer wrote:
> I still only suspect what changed to make 5.0.8 different from 5.0.7 in
> this regard, but here's what's going on:

> - "wait $!" -
> } zsh-5.0.7
> }  - "wait $!" blocks (looping on repeated wait3() nonzero)
> } zsh-5.0.8
> }  - "wait $!" loops but also printing status every time
> 
> bin_fg() calls waitforpid() which discovers the job is stopped and goes
> into a loop calling kill(pid, SIGCONT) to try to get the job to run
> again.  In the 5.0.8 case, each time this happens the job briefly wakes
> up, gets stopped with SIGTTIN, thus causes another SIGCHLD to go to the
> parent zsh, which then prints the "suspended" message and loops right
> back to kill(pid, SIGCONT) again.
> 
> All of this is exactly the same as in 5.0.7 except that because of the
> SIGCONT change in workers/35032 we notice the stopped -> continued ->
> stopped again status change and therefore print the new status even
> though it's actually the same as the last time we printed the status,
> because we skipped printing the "continued" status.  Or so I surmise.

So you might have thought the right thing to do was note it had been
stopped immediately, possibly warn the user, and not try to continue it
again without further user action?  Is that easy?  Can we pin down
"immediately" well enough?  Clearly there's a race in the real world
where the programme could get SIGTTIN at any time, but in the general
case (i.e. where a background process got SIGTTIN when the foreground
was doing something irrelevant) you clearly *don't* want it to continue
every time.

In that case the difference between 5.0.7 and 5.0.8 becomes
basically moot (it's different but in a sane fashion).

Do we even understand what the loop with SIGCONT is doing for us?  Under
what circumstances would this help?  Some (other sort of) race where
something else (what?  Not zsh and not the process that's suspended)
takes a while to get going, so the SIGCONT only succeeds after a few
attempts?

> - wait %1" -
> 
> bin_fg() calls zwaitjob() which does NOT do kill(pid, SIGCONT) instead
> simply blocking forever waiting for a SIGCHLD that will never arrive.

Hmm... I can't think of a good reason from the user point of view why
this should behave differently.  It just seems confusing.  It's
certainly not documented as a zsh feature, is it?

> - "wait" -
> 
> bin_fg() goes into a loop calling zwaitjob() on every entry in the job
> table; i.e., identical to "wait %1" repeated for every job number.

In which case I think the same reaction arises.

pws



Messages sorted by: Reverse Date, Date, Thread, Author