Page 1 of 2

threads vs processes again

Posted: Tue Aug 05, 2008 2:57 am
by bob
Since I seem to be plagued by odd things, here is yet another. I recently moved back to posix threads since the newer linux kernels have a much more stable implementation. I had run a couple of tests for different reasons on one of the dual-processor quad-core boxes on a cluster here, and had noticed in passing that suddenly the NPS seemed higher. Decided to go back and investigate and here's what I found.

First, I tested a bunch of positions using 8 processors to compare NPS speeds, and I was quite surprised to find that the thread-based version is not just faster, but is _significantly_ faster. Below I will give output from one position (all show about the same ratio). I ran the same position 4 times with the process (fork) based version, and 4 times with the new thread-based version. Everything else is identical between the two.

Code: Select all


log.001:              time=30.12  mat=0  n=403743991  fh=94%  nps=13.4M
log.002:              time=31.35  mat=0  n=425472609  fh=94%  nps=13.6M
log.003:              time=38.80  mat=0  n=515449589  fh=94%  nps=13.3M
log.004:              time=31.21  mat=0  n=416896300  fh=94%  nps=13.4M
log.005:              time=19.06  mat=0  n=360009325  fh=94%  nps=18.9M
log.006:              time=19.35  mat=0  n=365336707  fh=94%  nps=18.9M
log.007:              time=25.68  mat=0  n=467414358  fh=94%  nps=18.2M
log.008:              time=16.90  mat=0  n=320243950  fh=94%  nps=18.9M

Bottom line is that for Crafty, there is a nearly 50% speed improvement using threads over processes. I am not sure why. I certainly understand how linux creates processes with fork() and threads with clone(), and how the memory management works. Yet this is one of those things that make you go hmmm...

Again, to recap, all 8 runs above are using the same position, same .craftyrc to set the same hash size, etc. Only difference is that the first 4 are version 22.1, last 4 runs are version 22.1 + threads rather than using fork().

huge difference.

Re: threads vs processes again

Posted: Tue Aug 05, 2008 3:10 am
by Zach Wegner
Weird. I actually finished my processes-to-threads conversion, and I came to the opposite conclusion. Threads, for some reason, are extremely slow on my system. I'm not really sure if it's my OS or a bug, but something weird is definitely going on there.

Re: threads vs processes again

Posted: Tue Aug 05, 2008 3:28 am
by bob
Zach Wegner wrote:Weird. I actually finished my processes-to-threads conversion, and I came to the opposite conclusion. Threads, for some reason, are extremely slow on my system. I'm not really sure if it's my OS or a bug, but something weird is definitely going on there.
what system? recent linux kernels have a really good thread library finally, the old posix threads library is long-gone. NPTL addresses every complaint I had, including the "control thread" that I always thought was a stupid idea. They also now use a common process id mechanism which might help memory address translation some...

I suppose I am going to have to start looking thru the kernel again and see what is happening with memory management that might cause this. When NPTL was announced, I remember them talking about the speed of thread creation, where they created a bunch of threads (100K or 200K in 2 seconds on a 32 bit processor, where the old thread library approach took 15 minutes for the same task. But I create the threads once and am done, except with the "smpnice=1" mode and no pondering where I create them at the start of a normal search (not an iteration) and terminate them before making the move. So that doesn't explain anything for me. But that is a huge difference. I tried it on multiple processors but did only use the Intel compiler which is all I use unless forced to use GCC.

Might be they are doing something clever in placing things in memory for threads, where they don't for processes...

Re: threads vs processes again

Posted: Tue Aug 05, 2008 4:52 am
by plattyaj
Hmm, a few things spring to mind. Memory overhead for processes would be one though I can't see that should be a huge issue for a chess engine given that limited memory is needed outside the main hash tables.

But I suspect that the touted advantages of NTPL are showing up:

Locking primitives (fast mutex lock)
Context switching (faster than processes)

Andy.

Re: threads vs processes again

Posted: Tue Aug 05, 2008 6:34 am
by Zach Wegner
bob wrote:what system?
NetBSD. After writing the post, I did a bit more research, and I realized that pthreads use an environment variable PTHREAD_CONCURRENCY, at least on NetBSD. I had never seen that before--I never remember pthreads acting like that. Anyways, after that change, threads are much closer to processes-but still significantly slower. The idle time is two or three times greater, and the NPS is about half. It's possible that there's some debug code laying around somewhere, but since single-threaded mode is up to speed, I doubt it. It could just be NetBSD and pthreads--though I hope not. One other issue is that the above results are with assembler spinlocks, pthread mutexes are very very slow. That's possibly a configuration problem too... I'll get the sources updated so that others may run tests on other systems, all I have here is NetBSD, until I get my new laptop.

EDIT: it's there now, you can check out at the normal place with anonymous CVS, under the tag zct0_3_2472_threads

Re: threads vs processes again

Posted: Tue Aug 05, 2008 7:16 am
by bob
Zach Wegner wrote:
bob wrote:what system?
NetBSD. After writing the post, I did a bit more research, and I realized that pthreads use an environment variable PTHREAD_CONCURRENCY, at least on NetBSD. I had never seen that before--I never remember pthreads acting like that. Anyways, after that change, threads are much closer to processes-but still significantly slower. The idle time is two or three times greater, and the NPS is about half. It's possible that there's some debug code laying around somewhere, but since single-threaded mode is up to speed, I doubt it. It could just be NetBSD and pthreads--though I hope not. One other issue is that the above results are with assembler spinlocks, pthread mutexes are very very slow. That's possibly a configuration problem too... I'll get the sources updated so that others may run tests on other systems, all I have here is NetBSD, until I get my new laptop.

EDIT: it's there now, you can check out at the normal place with anonymous CVS, under the tag zct0_3_2472_threads
What kernel is that based on. Not Linux if I recall? If so, that is the problem. BSD/solaris threads are not particularly efficient. And you might have to use pthread_attr() to make sure that logical threads and physical processes are matched up... Solaris doesn't do that by default, but NetBSD I don't know about.

Re: threads vs processes again

Posted: Tue Aug 05, 2008 7:16 am
by bob
Zach Wegner wrote:
bob wrote:what system?
NetBSD. After writing the post, I did a bit more research, and I realized that pthreads use an environment variable PTHREAD_CONCURRENCY, at least on NetBSD. I had never seen that before--I never remember pthreads acting like that. Anyways, after that change, threads are much closer to processes-but still significantly slower. The idle time is two or three times greater, and the NPS is about half. It's possible that there's some debug code laying around somewhere, but since single-threaded mode is up to speed, I doubt it. It could just be NetBSD and pthreads--though I hope not. One other issue is that the above results are with assembler spinlocks, pthread mutexes are very very slow. That's possibly a configuration problem too... I'll get the sources updated so that others may run tests on other systems, all I have here is NetBSD, until I get my new laptop.

EDIT: it's there now, you can check out at the normal place with anonymous CVS, under the tag zct0_3_2472_threads
What kernel is that based on. Not Linux if I recall? If so, that is the problem. BSD/solaris threads are not particularly efficient. And you might have to use pthread_attr() to make sure that logical threads and physical processes are matched up... Solaris doesn't do that by default, but NetBSD I don't know about.

BTW I can run your code on our 8-core box to test if you want...

Re: threads vs processes again

Posted: Tue Aug 05, 2008 7:20 am
by bob
plattyaj wrote:Hmm, a few things spring to mind. Memory overhead for processes would be one though I can't see that should be a huge issue for a chess engine given that limited memory is needed outside the main hash tables.

But I suspect that the touted advantages of NTPL are showing up:

Locking primitives (fast mutex lock)
Context switching (faster than processes)

Andy.
On good kernels this should not be an issue. Copy-on-write means that when you do a fork, you already share the executable code just as you do with threads, you share the data that was initialized before the fork and not modified after the fork. So you end up with almost the same virtual address space either way, and I have confirmed that the VM space is roughly equivalent for the two, except if you use EGTB code which I did not on this test or compile.

I don't use mutex/futex locks just my normal spin locks.

and there is no context switching since I run 8 threads and use 8 cores.

This is not an easy one to explain, but there are some thread-specific additions to the 2.6.x linux kernels, and I plan on looking at them in detail to see if anything there might explain this (although I have no idea how one might lose/gain that much speed with memory tweaks or whatever...

Re: threads vs processes again

Posted: Tue Aug 05, 2008 7:42 am
by Aleks Peshkov
I suspect cross-process Transposition Table creates some memory management overhead.

Re: threads vs processes again

Posted: Tue Aug 05, 2008 7:50 am
by bob
Aleks Peshkov wrote:I suspect cross-process Transposition Table creates some memory management overhead.
It is simply shared memory either way (threads or processes). Both threads have the same hash table mapped into their virtual address spaces. Ditto for processes as I used the system V shared memory approach (shmget/shmat/etc).