Threads test incl. Stockfish 5 and Komodo 8

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
And as "approximation" goes, your exceptional linear SMP of Crafty, plotted here, and log scaling of Komodo and Stockfish, fitted well and plotted here too, gives something like that:

Image

1/ As "approximation" goes, on a similar box to Andreas, with 64 cores, Effective Speed-Up of Crafty is of order 45 or 10?

2/ Why don't you enter WCCC on such a box, with such an exceptional linear scaling of Crafty, compared to log scaling of top engines? Crafty seems to have good chances.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

Laskos wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
And as "approximation" goes, your exceptional linear SMP of Crafty, plotted here, and log scaling of Komodo and Stockfish, fitted well and plotted here too, gives something like that:

Image

1/ As "approximation" goes, on a similar box to Andreas, with 64 cores, Effective Speed-Up of Crafty is of order 45 or 10?

2/ Why don't you enter WCCC on such a box, with such an exceptional linear scaling of Crafty, compared to log scaling of top engines? Crafty seems to have good chances.
The only 64 core data I have is from an Itanium. Eugene Nalimov ran Crafty on a 64 core Itanium box quite a few years back. The speedup was around 32, which was not so good. That was the point where we both started looking at NUMA issues because that was the first NUMA box either of us had ever used to run Crafty. We didn't go very far because the box was "going away" pretty soon at the time. We re-visited NUMA when we started on the 8 core AMD boxes which introduced NUMA issues once again.

So 32x faster on 64 cores. Pretty poor. But we didn't do any tuning. When we first started on the 8 core machine at AMD developer lab, the speedup was down in the 4x range, but NPS was also way off, maybe 5x-6x at best. We eventually fixed that, got the NPS up to 8x, and the speedup to above 6, as the old CCC threads will show.

10 is certainly wrong as I have run quite a bit on 16 cores and generally get over 10x there.

As far as your plots go, where did you get the 60 core data for Komodo and Stockfish? Surely you are not going to tell me "extrapolated" again? Some machines show a negative improvement as the number of threads go up, so extrapolation is no good.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by syzygy »

ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
Not really. The "simulate N threads by a single thread" argument firstly does not apply to the case, secondly does not take into account the loss of efficiency due to simulation. It's just a plain stupid argument.

The more concrete argument regarding what Komodo is likely doing (extending/reducing differently at split nodes) also does not cut it. The efficiency cost involved in forcing Komodo to behave exactly the same at split nodes as in regular nodes may simply be prohibitive. This is not a bug at all.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Adam Hair »

I think that the differences in the draw rates makes Zappa look like it performs better at 2, 4 , and 8 cores than Stockfish and Komodo. Draw rates are positively correlated to the average strength of the opponents. Higher draw rates contract Elo differences. So, the higher strength of Stockfish and Komodo causes the Elo difference measurements to be smaller as compared to Zappa.

So, I recomputed the ratings differences into Wilo differences (wins and losses only) and plotted them:

Image
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
On 1) and 3) there is Andreas'
http://www.talkchess.com/forum/viewtopic.php?p=570955
And Miguel's and mine
http://www.talkchess.com/forum/viewtopic.php?p=527702

As for my data is bogus because I am using Elos, it's completely non-bogus if you will see that doubling in time at that TC means 90-100 Elo points, with 80 it will go superlinear, higher is contradicting empirical data. And is consistent with other empirical data. And it doesn't matter too much, Crafty anyway will be on top by far on 16 threads. The question is: is SMP implementation in Crafty so much superior?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
On 1) and 3) there is Andreas'
http://www.talkchess.com/forum/viewtopic.php?p=570955

As for my data is bogus because I am using Elos, it's completely non-bogus if you will see that doubling in time at that TC means 90-100 Elo pints, with 80 it will go superlinear. And is consistent with other empirical data. And it doesn't matter too much, Crafty anyway will be on top by far on 16 threads. The question is: is SMP implementation in Crafty so much superior?


Do you believe doubling in time is 90-100 is true for ALL programs? In fact, it seems to be quite a bit higher than the number most have been using for years, namely 50-70.

As far as SMP implementation goes, I do not know whether what I am doing is better or worse than everyone/anyone. I don't have the time nor the interest to investigate that. I don't even investigate MY SMP performance that often, although I did some of late due to the new stuff I changed. All I can report is what I have in terms of data. I won't begin to answer as to what others do. For Komodo, it is closed source. For stockfish, it is open-source and seems to follow what I would call "the crafty model" (as it was defined several years ago) overall, but I have not looked at it in detail, it takes a lot of time to really study a parallel search because there are a zillion details that are very important, yet very subtle to spot.

When I have time, I will run Crafty on 1-12 cores since I have a cluster of such nodes to speed the test up. The killer is time. You need decent searches with 12 cores, which turns into LONG searches with 1 core. I need to reconfigure my normal testing approach to work more efficiently with that problem.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

Adam Hair wrote:I think that the differences in the draw rates makes Zappa look like it performs better at 2, 4 , and 8 cores than Stockfish and Komodo. Draw rates are positively correlated to the average strength of the opponents. Higher draw rates contract Elo differences. So, the higher strength of Stockfish and Komodo causes the Elo difference measurements to be smaller as compared to Zappa.

So, I recomputed the ratings differences into Wilo differences (wins and losses only) and plotted them:

Image
Is it REALLY true that stockfish gets nothing going from 8 to 16? That defies any logic. I could see that going from 256 to 512, but 8 to 16? I wonder if anyone bothered tuning stockfish for that test? I saw the craftyish minimum thread depth and max threads per split point idea mentioned regarding stockfish. those become more important as the number of cores increases.

I also have a hard time interpreting the results. +200 Elo for komodo going from 8 to 16? That is super-linear territory and doesn't make any sense at all.
Uri Blass
Posts: 10314
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Uri Blass »

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
On 1) and 3) there is Andreas'
http://www.talkchess.com/forum/viewtopic.php?p=570955

As for my data is bogus because I am using Elos, it's completely non-bogus if you will see that doubling in time at that TC means 90-100 Elo pints, with 80 it will go superlinear. And is consistent with other empirical data. And it doesn't matter too much, Crafty anyway will be on top by far on 16 threads. The question is: is SMP implementation in Crafty so much superior?


Do you believe doubling in time is 90-100 is true for ALL programs? In fact, it seems to be quite a bit higher than the number most have been using for years, namely 50-70.

As far as SMP implementation goes, I do not know whether what I am doing is better or worse than everyone/anyone. I don't have the time nor the interest to investigate that. I don't even investigate MY SMP performance that often, although I did some of late due to the new stuff I changed. All I can report is what I have in terms of data. I won't begin to answer as to what others do. For Komodo, it is closed source. For stockfish, it is open-source and seems to follow what I would call "the crafty model" (as it was defined several years ago) overall, but I have not looked at it in detail, it takes a lot of time to really study a parallel search because there are a zillion details that are very important, yet very subtle to spot.

When I have time, I will run Crafty on 1-12 cores since I have a cluster of such nodes to speed the test up. The killer is time. You need decent searches with 12 cores, which turns into LONG searches with 1 core. I need to reconfigure my normal testing approach to work more efficiently with that problem.
I believe that the gain from doubling is higher at faster time control so it is clearly possible that you get 90-100 elo per doubling in bullet time control(30 seconds per game against 1 minutes per game) when you only get 50 elo per doubling if you use time control that is 200 times slower.