Crashing engines (Linux)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Crashing engines (Linux)

Post by hgm »

I received a complaint that XBoard does not notice when engines exit (or are killed). XBoard has always relied on failure of the communication pipes to detect this: when the sender process dies, readers of the pipe are supposed to receive an EOF, while writing on a pipe with no receivers gives a SIGPIPE signal.

Now a GUI is not normally writing to egines just to test if they are stillalive; when the engine is thinking this could actually cause the search to be aborted (according to the protocol specs). So it is an absolute no-no. It is reading all the time, however, so it depends on getting an EOF there.

Unfortunately killing the thinking engine doesn't appear to produce one. In fact the engine process does not even seem to die when you kill it. (And this was not even with the Immortal engine...) If you do a "ps l" after the kill, the process with that ID is still there; it just had its command line erased, and has now a WCHAN of exit. Apparently this is not enough to cause an EOF on the reading end.

So how can you ever know if the child process is unexpectedly terminated or not?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crashing engines (Linux)

Post by bob »

hgm wrote:I received a complaint that XBoard does not notice when engines exit (or are killed). XBoard has always relied on failure of the communication pipes to detect this: when the sender process dies, readers of the pipe are supposed to receive an EOF, while writing on a pipe with no receivers gives a SIGPIPE signal.

Now a GUI is not normally writing to egines just to test if they are stillalive; when the engine is thinking this could actually cause the search to be aborted (according to the protocol specs). So it is an absolute no-no. It is reading all the time, however, so it depends on getting an EOF there.

Unfortunately killing the thinking engine doesn't appear to produce one. In fact the engine process does not even seem to die when you kill it. (And this was not even with the Immortal engine...) If you do a "ps l" after the kill, the process with that ID is still there; it just had its command line erased, and has now a WCHAN of exit. Apparently this is not enough to cause an EOF on the reading end.

So how can you ever know if the child process is unexpectedly terminated or not?
This is an easy one. Catch SIGCHLD signals. Whenever a process you start via fork() (which is what xboard has always used) terminates, you get a SIGCHLD signal delivered. Since you will usually have more than one process running with board, you then use the waited() system library routine to obtain the PID of the process that terminated.
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Crashing engines (Linux)

Post by hgm »

Indeed, this works. Thanks. Even when the engine process stays in this limbo state, XBoard already gets the SIGCHLD. So I will base the error detection on that. (This could be a bit cumbersome, as there are plenty of occasions where engne processes are supposed to terminate. E.g. with /xreuse.)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crashing engines (Linux)

Post by bob »

hgm wrote:Indeed, this works. Thanks. Even when the engine process stays in this limbo state, XBoard already gets the SIGCHLD. So I will base the error detection on that. (This could be a bit cumbersome, as there are plenty of occasions where engne processes are supposed to terminate. E.g. with /xreuse.)
One thing that is critical. If you catch SIGCHILD, you MUST do a waitpid(-1, etc) to clear the zombie/defunct process out of the system. If you don't, it will remain, and eventually fork() will fail with "no proc table entry". the wait() family of system calls returns the process's final exit status (i.e. return 1 or return 0) and also gets rid of it completely...