Parallel search once more

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Parallel search once more

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:I have started re-running my SMP tests since the Intel compiler provides such a nice NPS improvement (multiple threads only, single thread seems to be the same as before).

Here's my first 20 core test runs (same 24 positions, each run 4 times). All results are geometric mean for a 24 position run.

Code: Select all

            run 1    run 2   run 3   run 4   avg
speedup     15.0     15.0    15.0    15.3   15.1
nps         16.3     16.4    16.3    16.3   16.3
Net results, NPS scaling improved from 14.7 or so to 16.3x. I know part of the bottleneck is 64GB of hash and TLB thrashing, so I do have plans to try mmap() and using the rather clumsy 1gb huge pages. But I have not done this as of yet, still using the automatic 2mb pages (transparent huge pages in current linux kernels). Speedup (geomean) went from 12.45 to 16.3 due to the intel issue plus a few further SMP refinements. BTW the 20 cpu search looked at about 5% more nodes (average) compared to the 1 cpu test, so overhead is pretty well managed for the moment.

The speedup has really captured my attention, because it is right at the theoretical max (15.1 with max of 16.3 based on NPS numbers). I am fixing to create another set of test positions just to be sure that these positions don't somehow happen to artificially inflate the speedup results. I am not sure exactly how I am going to choose this test set. I was leaning toward either (a) taking a set of positions from a long time control game played on ICC, or else (b) a random set of positions extracted from GM games (the way I extract starting positions for cluster testing).

A much larger set of positions would be better, statistically, but not so good practically as the tests would take forever to run. More on this later....
These are exceptionally high speedup numbers, surpassing even your DTS results. I have never seen such to 16 or 20 cores. I am tempted to think something is amiss.
That is based on 4 runs of 24 positions. All I have so far is the 1 cpu run which didn't change at all, and these 4 20 cpu runs. Yes they are pretty high, but as to whether something is amiss or not I do not know. I have run literally thousands of positions comparing results with 1 to N cpus. there is always variability, but there have been no instances of bogus moves being chosen or bogus scores being chosen. And Crafty has been playing on ICC with this code for months. Only bug I have found in last 30 days is a statistical gathering bug counting the number of reductions that were done, overstepping an array bound. But that bug was in the search, period, not a SMP issue.

I am certainly looking at the code carefully and testing whatever I can. As of right now, it looks rock-solid to me. It might be that those 24 positions are very favorable to a search by Crafty. I have already run the set posted in this thread using 20 cores. I'm going to pick 60 of them and run to fixed depth with 1 and 20 cores to get a feel for whether the positions might be the issue. I'd rather run 1000 positions but that is simply not practical at these speeds...
I am not saying that necessarily there is something wrong, but these numbers with the new compiler are pretty unbelievable. And they contradict my tests with Crafty 23.6 to 4 cores and Andreas Strangmüller tests to 16 cores with Crafty 24.1:
http://www.talkchess.com/forum/viewtopi ... 59&start=0

I tested Crafty 23.6 with 30 positions for average 2 minutes per position on 4 cores (average 6 minutes per position on 1 core). The effective speedup on 4 cores I get is 2.7 and NPS speedup 3.2, nowhere near your 3.7 and 3.8 for Crafty 25.0 with the old compiler (maybe even higher with the new compiler).

Andreas Strangmüller to 16 cores got an effective speedup for Crafty 24.1 of maybe 6.0, and bad scaling 8 -> 16 threads, while your results to 16 cores with Crafty 25.0 using the old compiler show a speedup of 10-11 and a very good 8 -> 16 thread scaling of 1.6 for effective speedup and 1.7 for NPS speedup. Maybe even better with the new compiler.

The speedup of 15 on 20 cores with the new compiler is VERY high.

Anyway, if you confirm your Crafty 25.0 new compiler numbers, you not only should publish the results, but also a paper on explaining the parallel algorithm. I think nobody else gets such speedups. With this sort of SMP implementation, Jonny would have won long ago the WCCC.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Parallel search once more

Post by cdani »

Is possible to have a copy of this new Crafty?
I can do some tests at 32 cores (on amazon server), just for fun.
Of course I will publish the results.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Parallel search once more

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:I have started re-running my SMP tests since the Intel compiler provides such a nice NPS improvement (multiple threads only, single thread seems to be the same as before).

Here's my first 20 core test runs (same 24 positions, each run 4 times). All results are geometric mean for a 24 position run.

Code: Select all

            run 1    run 2   run 3   run 4   avg
speedup     15.0     15.0    15.0    15.3   15.1
nps         16.3     16.4    16.3    16.3   16.3
Net results, NPS scaling improved from 14.7 or so to 16.3x. I know part of the bottleneck is 64GB of hash and TLB thrashing, so I do have plans to try mmap() and using the rather clumsy 1gb huge pages. But I have not done this as of yet, still using the automatic 2mb pages (transparent huge pages in current linux kernels). Speedup (geomean) went from 12.45 to 16.3 due to the intel issue plus a few further SMP refinements. BTW the 20 cpu search looked at about 5% more nodes (average) compared to the 1 cpu test, so overhead is pretty well managed for the moment.

The speedup has really captured my attention, because it is right at the theoretical max (15.1 with max of 16.3 based on NPS numbers). I am fixing to create another set of test positions just to be sure that these positions don't somehow happen to artificially inflate the speedup results. I am not sure exactly how I am going to choose this test set. I was leaning toward either (a) taking a set of positions from a long time control game played on ICC, or else (b) a random set of positions extracted from GM games (the way I extract starting positions for cluster testing).

A much larger set of positions would be better, statistically, but not so good practically as the tests would take forever to run. More on this later....
These are exceptionally high speedup numbers, surpassing even your DTS results. I have never seen such to 16 or 20 cores. I am tempted to think something is amiss.
That is based on 4 runs of 24 positions. All I have so far is the 1 cpu run which didn't change at all, and these 4 20 cpu runs. Yes they are pretty high, but as to whether something is amiss or not I do not know. I have run literally thousands of positions comparing results with 1 to N cpus. there is always variability, but there have been no instances of bogus moves being chosen or bogus scores being chosen. And Crafty has been playing on ICC with this code for months. Only bug I have found in last 30 days is a statistical gathering bug counting the number of reductions that were done, overstepping an array bound. But that bug was in the search, period, not a SMP issue.

I am certainly looking at the code carefully and testing whatever I can. As of right now, it looks rock-solid to me. It might be that those 24 positions are very favorable to a search by Crafty. I have already run the set posted in this thread using 20 cores. I'm going to pick 60 of them and run to fixed depth with 1 and 20 cores to get a feel for whether the positions might be the issue. I'd rather run 1000 positions but that is simply not practical at these speeds...
I am not saying that necessarily there is something wrong, but these numbers with the new compiler are pretty unbelievable. And they contradict my tests with Crafty 23.6 to 4 cores and Andreas Strangmüller tests to 16 cores with Crafty 24.1:
http://www.talkchess.com/forum/viewtopi ... 59&start=0

I tested Crafty 23.6 with 30 positions for average 2 minutes per position on 4 cores (average 6 minutes per position on 1 core). The effective speedup on 4 cores I get is 2.7 and NPS speedup 3.2, nowhere near your 3.7 and 3.8 for Crafty 25.0 with the old compiler (maybe even higher with the new compiler).

Andreas Strangmüller to 16 cores got an effective speedup for Crafty 24.1 of maybe 6.0, and bad scaling 8 -> 16 threads, while your results to 16 cores with Crafty 25.0 using the old compiler show a speedup of 10-11 and a very good 8 -> 16 thread scaling of 1.6 for effective speedup and 1.7 for NPS speedup. Maybe even better with the new compiler.

The speedup of 15 on 20 cores with the new compiler is VERY high.

Anyway, if you confirm your Crafty 25.0 new compiler numbers, you not only should publish the results, but also a paper on explaining the parallel algorithm. I think nobody else gets such speedups. With this sort of SMP implementation, Jonny would have won long ago the WCCC.
I don't doubt the 23.x version was significantly worse. 25.0 is a complete rewrite of the parallel search. And it is definitely MUCH better particularly with more cores, compared to any previously released version.

I have a new set of positions to try but the test last night failed pretty badly. I picked 33 positions from the ones posted in this thread yesterday, positions that it appeared that a depth of 28 would average 1-2 minutes. Didn't happen. Several completed in a couple of seconds, and my current parallel search doesn't do very well at those time limits any more than the older versions did. It is a pain to set the depth for every position, but it appears necessary to get reasonable times. The whole thing ran so quickly that 4 20-thread runs AND one 1-thread run completed in less than 7 hours overnight.

So until I can confirm the numbers with new positions, it is certainly possible that the old 24 position set was simply a favorable test set. They were 24 consecutive positions for black from the same game. I do clear the hash between positions to avoid carry-over search information between positions, but I specifically chose those positions for my original DTS paper (JICCA) as the question I was trying to ask was "What kind of speedup does Cray Blitz get in a real game" rather than "What kind of speedup does Cray Blitz get in random test positions?"

So, maybe those numbers are too high. Although for that set of positions they are what I see and they are reproducible within reasonable bounds. But I am still looking. I don't think there are bugs in the code after all the testing that has been done (this has been almost a year's effort in the rewrite by now). But it is possible that the test data might be misleading. More as I get a new set of positions working.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Parallel search once more

Post by bob »

cdani wrote:Is possible to have a copy of this new Crafty?
I can do some tests at 32 cores (on amazon server), just for fun.
Of course I will publish the results.
Sure, that would be useful in fact. I am not yet ready to release it to everyone as I don't want two copies of 25.0 scattered around (I made that mistake a few times in the past and it causes massive problems when someone reports a problem, and it is with the old version X vs the new version X.)

But if you want to test it (without releasing it) I'm more than willing. No issues with reporting results at all of course, I just don't want multiple 25.0 versions scattered around.
cetormenter
Posts: 170
Joined: Sun Oct 28, 2012 9:46 pm

Re: Parallel search once more

Post by cetormenter »

Could I get a copy too? I would like to test your claim of
4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...
The only data I can find from ratings lists is from ccrl which shows crafty gains about 100 elo going from 1 core to 4 and Nirvana gains about 90, which the difference in performance is well within the margins of error. I would be fine with a previous versions of crafty but it seems like all of the download sites on the crafty chess website are nonfunctional.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Parallel search once more

Post by cdani »

bob wrote:
cdani wrote:Is possible to have a copy of this new Crafty?
I can do some tests at 32 cores (on amazon server), just for fun.
Of course I will publish the results.
Sure, that would be useful in fact. I am not yet ready to release it to everyone as I don't want two copies of 25.0 scattered around (I made that mistake a few times in the past and it causes massive problems when someone reports a problem, and it is with the old version X vs the new version X.)

But if you want to test it (without releasing it) I'm more than willing. No issues with reporting results at all of course, I just don't want multiple 25.0 versions scattered around.
No problem. I will not release it. Can you send me a pm or mail with it or a link? Thanks!
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Parallel search once more

Post by cdani »

cdani wrote:
bob wrote:
cdani wrote:Is possible to have a copy of this new Crafty?
I can do some tests at 32 cores (on amazon server), just for fun.
Of course I will publish the results.
Sure, that would be useful in fact. I am not yet ready to release it to everyone as I don't want two copies of 25.0 scattered around (I made that mistake a few times in the past and it causes massive problems when someone reports a problem, and it is with the old version X vs the new version X.)

But if you want to test it (without releasing it) I'm more than willing. No issues with reporting results at all of course, I just don't want multiple 25.0 versions scattered around.
No problem. I will not release it. Can you send me a pm or mail with it or a link? Thanks!
I forgot to mention that I need a Windows compile. I will not be able to do anything with a Linux one :-)
If you have not it, maybe you want to send me the source. The last version I was able to compile it. If you don't want, I will understand it. Anyway thanks!
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Parallel search once more

Post by bob »

cetormenter wrote:Could I get a copy too? I would like to test your claim of
4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...
The only data I can find from ratings lists is from ccrl which shows crafty gains about 100 elo going from 1 core to 4 and Nirvana gains about 90, which the difference in performance is well within the margins of error. I would be fine with a previous versions of crafty but it seems like all of the download sites on the crafty chess website are nonfunctional.
All the old sources are available at www.cis.uab.edu/hyatt/crafty

Not sure what you meant by "my claim of" followed by the quote. The quote was about the number of cores available on smart phones and the number will increase...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Parallel search once more

Post by bob »

cdani wrote:
bob wrote:
cdani wrote:Is possible to have a copy of this new Crafty?
I can do some tests at 32 cores (on amazon server), just for fun.
Of course I will publish the results.
Sure, that would be useful in fact. I am not yet ready to release it to everyone as I don't want two copies of 25.0 scattered around (I made that mistake a few times in the past and it causes massive problems when someone reports a problem, and it is with the old version X vs the new version X.)

But if you want to test it (without releasing it) I'm more than willing. No issues with reporting results at all of course, I just don't want multiple 25.0 versions scattered around.
No problem. I will not release it. Can you send me a pm or mail with it or a link? Thanks!
To be clear, we are talking about linux I assume? Not windows?
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Parallel search once more

Post by cdani »

bob wrote: To be clear, we are talking about linux I assume? Not windows?
I think in Windows, sorry :-) I do not know much about Linux.
Or source code, if you don't mind that I do a compile.