Killing zombies (POSIX)

lucasart · Post by **lucasart** » Sun Nov 08, 2020 3:56 am

Normal engines will quit by themselves, when stdin becomes a broken pipe (which happens because the OS closes the write-end of the pipe when the parent process terminates). But some annoying ones don't play along, and remain as zombie processes (eg. Cheng).

Let's break this down, step by step:

Install a signal handler

First we install a signal handler for SIGINT and SIGTERM in main(). Simple enough RTM job:

Code: Select all

static void sig_handler(int sig, siginfo_t *siginfo, void *context)
{
    // Kill every process with the same PGID (should include all child processes)
    kill(0, SIGKILL);  
    _Exit(0);
}

static void set_sig_handler(void)
{
    struct sigaction act = {0};
    act.sa_flags = SA_SIGINFO;
    act.sa_sigaction = sig_handler;
    sigaction(SIGINT, &act, NULL);
    sigaction(SIGTERM, &act, NULL);
}

int main(...)
{
    set_sig_handler();
    ...
}

Block signals in worker threads

The problem is that the OS will send signals in any thread, and we can't know which one. Could be the main thread, or any of the worker thread (each running games in parrallel). So we must tell the OS to block signals in each worker thread, so that only the main thread can receive them. Here thread_start is the function given to pthread_create(), where every worker thread begins its journey.

Code: Select all

static void set_sig_blocker(void)
{
    sigset_t mask;
    sigemptyset(&mask);
    sigaddset(&mask, SIGINT);
    sigaddset(&mask, SIGTERM);
    pthread_sigmask(SIG_BLOCK, &mask, NULL);
}

static void *thread_start(void *arg)
{
    set_sig_blocker();
    ...
}

All of the above works fine. I've tested it.

Spawning child processes with the PGID of the parent

I'll spare you all the boring details, and highlight ony the relevant part:

Code: Select all

static void engine_spawn(const Worker *w, Engine *e, const char *cwd, const char *run, char **argv)
{
    ...
    e->pid = fork();

    if (e->pid == 0) {
        // in the child process
        ...

        // Inherit process ID and group ID from parent
        setpgid(0, 0);

        // Set cwd as current directory, and execute run
        chdir(cwd);
        execvp(run, argv);
    } else {
        // in the parent process
        ...
    }
}

The setpgid(0, 0) should set the PGID of the child process to be the same as the parent process. But this doesn't work. Engines get the same PPID as the parent, but not the same PGID. Here's what I can see, when I run it with Cheng. I want all these Cheng processes to have the same PGID as c-chess-cli, , but they don't:

Code: Select all

$ ps xao pid,ppid,pgid,comm |head -1
    PID    PPID    PGID COMMAND
$ ps xao pid,ppid,pgid,comm |grep c-chess-cli
  60979    1855   60979 c-chess-cli
$ ps xao pid,ppid,pgid,comm |grep cheng
  60985   60979   60985 cheng_4.39
  60987   60979   60987 cheng_4.39
  60989   60979   60989 cheng_4.39
  60991   60979   60991 cheng_4.39
  60992   60979   60992 cheng_4.39
  60993   60979   60993 cheng_4.39
  60994   60979   60994 cheng_4.39
  60995   60979   60995 cheng_4.39

And when I Ctrl+C, which is handled by sig_handler() as SIGINT, the kill(0, SIGKILL) doesn't work as a result (since they don't have the PGID of the parent). Indeed, I see zombie processes, inherited by the init process 1:

Code: Select all

$ ps xao pid,ppid,pgid,comm |grep cheng
  79473       1   79473 cheng_4.39
  79475       1   79475 cheng_4.39
  79479       1   79479 cheng_4.39
  79482       1   79482 cheng_4.39

mar · Post by **mar** » Sun Nov 08, 2020 4:26 am

This has been fixed 4 years ago but that was after the last official release of 4.39 (thanks to Peter Osterlund I think? or maybe Michael Byrne, I'm not quite sure)
My apologies, I simply forgot to check for this in the main loop.

since 4.39 comes with source code, those who want to use it may fix it themselves:
just edit main.cpp and change the main loop to look like this this (add these two lines):

Code: Select all

	while ( !proto->shouldQuit() )
	{
		std::string line;
		std::getline( std::cin, line );

		if ( !std::cin.good() )
			line = "quit";

		proto->parse( line );
	}

So I think there's no need to "fix" c-chess-cli as this is apparently a rather stupid bug in Cheng.

lucasart · Post by **lucasart** » Sun Nov 08, 2020 4:36 am

lucasart wrote: ↑Sun Nov 08, 2020 3:56 am The setpgid(0, 0) should set the PGID of the child process to be the same as the parent process. But this doesn't work. Engines get the same PPID as the parent, but not the same PGID. Here's what I can see, when I run it with Cheng. I want all these Cheng processes to have the same PGID as c-chess-cli, , but they don't:
Code: Select all
$ ps xao pid,ppid,pgid,comm |head -1
    PID    PPID    PGID COMMAND
$ ps xao pid,ppid,pgid,comm |grep c-chess-cli
  60979    1855   60979 c-chess-cli
$ ps xao pid,ppid,pgid,comm |grep cheng
  60985   60979   60985 cheng_4.39
  60987   60979   60987 cheng_4.39
  60989   60979   60989 cheng_4.39
  60991   60979   60991 cheng_4.39
  60992   60979   60992 cheng_4.39
  60993   60979   60993 cheng_4.39
  60994   60979   60994 cheng_4.39
  60995   60979   60995 cheng_4.39

Answering my own question. Sorry for spamming

The problem is that

Code: Select all

setpgid(0, 0);

is called in the context of the chile process, when fork()==0. I need to set it explicitly to the PGID of the parent process, like so:

Code: Select all

// In the parent process, before fork()
const pid_t pgid = getpgid(0);

// In the child process, when fork() == 0
setpgid(0, pgid);

And it works.

Now the approach is still a bit crude. I have a single process group, which contains c-chess-cli itself, and all the engine processes it spawns. So, when I do kill(0, SIGKILL) from the parent process, the OS will kill everything, including the parent process itself! I suppose, I should use 2 PGID: one for c-chess-cli, and one for all engines. So I can SIGTERM them, wait, SIGKILL, do some cleanup in the parent process, and exit gracefully.

lucasart · Post by **lucasart** » Sun Nov 08, 2020 8:39 am

mar wrote: ↑Sun Nov 08, 2020 4:26 am This has been fixed 4 years ago but that was after the last official release of 4.39 (thanks to Peter Osterlund I think? or maybe Michael Byrne, I'm not quite sure)
My apologies, I simply forgot to check for this in the main loop.

since 4.39 comes with source code, those who want to use it may fix it themselves:
just edit main.cpp and change the main loop to look like this this (add these two lines):
Code: Select all
	while ( !proto->shouldQuit() )
	{
		std::string line;
		std::getline( std::cin, line );

		if ( !std::cin.good() )
			line = "quit";

		proto->parse( line );
	}
So I think there's no need to "fix" c-chess-cli as this is apparently a rather stupid bug in Cheng.

Thanks. But I will still "fix" c-chess-cli. I want it to be as robust as possible, and guarantee termination of engines, regardless of what these engines are doing (even if it's objectively wrong).

hgm · Post by **hgm** » Sun Nov 08, 2020 9:46 am

I am not sure what problem you are addressing here. It seems you give a prescription for what how Linux engines would have to be modified for making them easily killable by the GUI. But if an engine is to be modified, it should of course be modified to simply respond to the 'quit' command. Not by making it suitable for work-arounds that are applied when 'quit' is not working properly.

In WinBoard I never managed to make the killing of engines work properly. There are no Unix-like signals on Windows, but the Windows API does have a call TerminateProcess, which should be the equivalent of a hard kill. But I never got it to work reliably. The call never reports an error, but yet engine processes tend to survive. In Windows orphaned processes survive much more easily than in Linux anyway, at it apparently isn't a fatal error there to write to a broken pipe. But even in Linux an engine process can survive indefinitely if it just refrains from writing, and doesn't terminate on EOF of stdin.

Another nasty problem is engines that run through adapters: the GUI would kill the adapter, rather than the engine.

lucasart · Post by **lucasart** » Sun Nov 08, 2020 10:33 am

hgm wrote: ↑Sun Nov 08, 2020 9:46 am I am not sure what problem you are addressing here. It seems you give a prescription for what how Linux engines would have to be modified for making them easily killable by the GUI. But if an engine is to be modified, it should of course be modified to simply respond to the 'quit' command. Not by making it suitable for work-arounds that are applied when 'quit' is not working properly.

In WinBoard I never managed to make the killing of engines work properly. There are no Unix-like signals on Windows, but the Windows API does have a call TerminateProcess, which should be the equivalent of a hard kill. But I never got it to work reliably. The call never reports an error, but yet engine processes tend to survive. In Windows orphaned processes survive much more easily than in Linux anyway, at it apparently isn't a fatal error there to write to a broken pipe. But even in Linux an engine process can survive indefinitely if it just refrains from writing, and doesn't terminate on EOF of stdin.

Another nasty problem is engines that run through adapters: the GUI would kill the adapter, rather than the engine.

The signal based solution described above is a hellish mess, and is not (and cannot be!) a full proof solution. For example, it can't handle SIGKILL, so you'd still get zombies if c-chess-cli is killed this way. And it can't really handle SIGTERM either, or it would enter in a SIGTERM handling recursion (with the handler throwing SIGTERM to all the processes in the group, including itself...).

I think the only correct (and simple) solution is the Linux-only:

Code: Select all

prctl(PR_SET_PDEATHSIG, SIGTERM);

Unfortunately, this is not POSIX, and only works on Linux and Android, but not MacOS.

lucasart · Post by **lucasart** » Sun Nov 08, 2020 11:32 am

lucasart wrote: ↑Sun Nov 08, 2020 10:33 am
hgm wrote: ↑Sun Nov 08, 2020 9:46 am I am not sure what problem you are addressing here. It seems you give a prescription for what how Linux engines would have to be modified for making them easily killable by the GUI. But if an engine is to be modified, it should of course be modified to simply respond to the 'quit' command. Not by making it suitable for work-arounds that are applied when 'quit' is not working properly.

In WinBoard I never managed to make the killing of engines work properly. There are no Unix-like signals on Windows, but the Windows API does have a call TerminateProcess, which should be the equivalent of a hard kill. But I never got it to work reliably. The call never reports an error, but yet engine processes tend to survive. In Windows orphaned processes survive much more easily than in Linux anyway, at it apparently isn't a fatal error there to write to a broken pipe. But even in Linux an engine process can survive indefinitely if it just refrains from writing, and doesn't terminate on EOF of stdin.

Another nasty problem is engines that run through adapters: the GUI would kill the adapter, rather than the engine.
The signal based solution described above is a hellish mess, and is not (and cannot be!) a full proof solution. For example, it can't handle SIGKILL, so you'd still get zombies if c-chess-cli is killed this way. And it can't really handle SIGTERM either, or it would enter in a SIGTERM handling recursion (with the handler throwing SIGTERM to all the processes in the group, including itself...).

I think the only correct (and simple) solution is the Linux-only:
Code: Select all
prctl(PR_SET_PDEATHSIG, SIGTERM);
Unfortunately, this is not POSIX, and only works on Linux and Android, but not MacOS.

A nice consequence of this solution is that killing any of the processes (the N instances of engine A, the N instances of engine B, or their parent process c-chess-cli), eventually kills all of them:

terminating the parent c-chess-cli (however it happens, programmatically or through any lethal signal), causes the kernel to send SIGTERM to all the children.
terminating any child, will be detected by reading EOF in the parent, which will exit(), and the above kicks in to terminate all other engine instances.

hgm · Post by **hgm** » Sun Nov 08, 2020 11:35 am

This is one area where I doubt whether it is really a good thing to have it. It is important that engines terminate when they receive 'quit', and do not leave rogue processes. Engines that cannot manage that should not be used. Allowing them to be used without problems through complicated work-arounds just encourages release of buggy engines. In the end everyone will be off worse. It is like elevating a bug to a standard.

lucasart · Post by **lucasart** » Sun Nov 08, 2020 12:08 pm

hgm wrote: ↑Sun Nov 08, 2020 11:35 am This is one area where I doubt whether it is really a good thing to have it. It is important that engines terminate when they receive 'quit', and do not leave rogue processes. Engines that cannot manage that should not be used. Allowing them to be used without problems through complicated work-arounds just encourages release of buggy engines. In the end everyone will be off worse. It is like elevating a bug to a standard.

I know. That is why I started this post by saying that normal engines don't need this, because they check return values (eg. fgets() returns EOF, engine must exit).

But broken engines are a fact of life. Some coded by amateurs, that have no clue what they're doing. Others by competent programmers, who can still make mistakes by oversight (eg. Cheng). In the end the user will just complain and say the (G)UI is at fault, because they don't understand the whole chain of events as we do...

I played around with cutechess-cli and Cheng, and managed to create zombies

To be fair, cutechess-cli does everything that is possible to do, in a portable way, with signals. But there is one thing that it cannot do that way. If you send SIGKILL to cutechess-cli, it cannot handle it, and Cheng is left as a zombie.

hgm · Post by **hgm** » Sun Nov 08, 2020 12:26 pm

Users even complain the GUI is at fault when it allows e.p. capture. Or when it refuses to start an engine they have not on their computer.

Catering to bugs is really harmful. XBoard accepts moves with a missing promotion suffix, using a default choice (Queen in Chess). As a result many engines do not bother to add promotion suffixes to their moves anymore...

Killing zombies (POSIX)

Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)

Re: Killing zombies (POSIX)