Time to depth concerns

Werewolf · Post by **Werewolf** » Mon Aug 15, 2016 11:13 am

Recently I've been doing some research into real speedup on many threads. My results are alarming. Perhaps people can comment on them.

I'm running a dual E5-2697 v2 (24 physical cores) with the following settings:
- 64 GB RAM
- HT OFF
- clock speed is ALWAYS 3.00 GHz, measured with CPUZ.
- GUI Fritz 15
- Hash always 4 GB
- Windows 10, 64 bit.
- No other programs running.

1 THREAD
I took Stockfish Dev 10082016 and timed how long it took to reach depth 34 on the start position, running on 1 physical core (1 thread). To make sure Intel's turbo didn't raise the clock-speed I ran Aquarium in the background and had Komodo ticking over on 20 threads. I then checked the clockspeed to make sure it was 3.00 GHz.

RUN 1:
34 / 46 = 2175 seconds
34 / 52 = 2461 seconds

RUN 2:
34 / 46 = 2187 seconds
34 / 52 = 2475 seconds

Nothing surprising here. N/S were around 1.1 MNs. The two runs show what we all know - SF is deterministic on 1 thread and the results were what I expected.

24 THREADS
Next, I repeated the exercise but now I ran SF on 24 cores. No other programs were running. Clock speed was again confirmed at 3.00 GHz and N/S were around 25 million - about what was expected.

RUN 1:
34/49 = 714 s

RUN 2:
34/44 = 1052 s
34/47 = 1194 s

RUN 3:
34/45 = 537 s

RUN 4:
34/45 = 717 s
34/45 = 745 s
34/45 = 782 s

RUN 5:
34/43 = 507 s

RUN 6:
34/51 = 937 s

RUN 7:
34/49 = 671 s

RUN 8:
34/45 = 337 s
34/50 = 557 s

RUN 9:
34/43 = 497 s

RUN 10:
34/45 = 724 s

I realise it would be good to have more data. However, even with this limited sample it is fairly clear that if time-to-depth is a true indication of real speed up: the gain going from 1 thread to 24 is somewhere in the range of 3-4x quicker.

This is shocking, at least to me. I expected more like a 10-fold increase in search speed. Can anyone explain this?

Thanks

Carl

Edsel Apostol · Post by **Edsel Apostol** » Mon Aug 15, 2016 2:13 pm

Lazy SMP algorithm in SF is not conducive for time to depth tests. I think that is normal for Lazy SMP. You can try other engines with Lazy SMP like Andscacs/Texel to see if it has the same behavior. You can also compare it to engines with YBWC like Hannibal or Crafty.

gordonr · Post by **gordonr** » Mon Aug 15, 2016 2:22 pm

I can't explain your figures but in the past I've been interested in how best to conduct "time to depth" testing.

http://www.talkchess.com/forum/viewtopi ... highlight=

I ended up with an automated test framework that: used many various positions; did hundreds of tests runs; and used the median time. I believe this was necessary to make the figures as meaningful as possible.

zullil · Post by **zullil** » Mon Aug 15, 2016 2:22 pm

Edsel Apostol wrote:Lazy SMP algorithm in SF is not conducive for time to depth tests. I think that is normal for Lazy SMP. You can try other engines with Lazy SMP like Andscacs/Texel to see if it has the same behavior. You can also compare it to engines with YBWC like Hannibal or Crafty.

I agree----this seems like typical LazySMP behavior.

Werewolf · Post by **Werewolf** » Mon Aug 15, 2016 5:21 pm

Edsel Apostol wrote:Lazy SMP algorithm in SF is not conducive for time to depth tests. I think that is normal for Lazy SMP.

So are you saying Lazy SMP doesn't scale well with more threads, or that time to depth isn't a good indicator of its speed? I'm guessing the latter.

In which case, how do we measure speed with Lazy SMP?

What about Komodo, is that lazy SMP or YBWC?

Thanks.

APassionForCriminalJustic · Mon Aug 15, 2016 6:32 pm

Werewolf wrote:
Edsel Apostol wrote:Lazy SMP algorithm in SF is not conducive for time to depth tests. I think that is normal for Lazy SMP.
So are you saying Lazy SMP doesn't scale well with more threads, or that time to depth isn't a good indicator of its speed? I'm guessing the latter.

In which case, how do we measure speed with Lazy SMP?

What about Komodo, is that lazy SMP or YBWC?

Thanks.

Unless I am mistaken LAZY-SMP is about the nodes per second, mostly. Nobody knows what implementation Komodo uses as far as I know.

Eelco de Groot · Post by **Eelco de Groot** » Mon Aug 15, 2016 7:23 pm

APassionForCriminalJustic wrote:
Unless I am mistaken LAZY-SMP is about the nodes per second, mostly. Nobody knows what implementation Komodo uses as far as I know.

I don't know much more than you but those statements are both rather far from the truth, really. Back when Komodo did not have a working SMP implementation, unlike other programs from Don in the past, he did say he was working with an simpler form of YBW. Well something to that effect at least. I understood that as a form of Lazy SMP also because that was easier to implement. So it stands to reason that this could still be so if it worked well, and winning TCEC later qualifies as a success.

Has nothing to do with nodes per second. More with iterative deepening. The more threads, the fuzzier exactly what depth they are on, and what I have seen from SF they are all on different depths? But really that is about all I know of the implementation in SF. I just picture it as trying to get the same effect as restarting the search from scratch somewhere midtime, and do this many times over. You profit from the already partially ordered tree and start over. Some threads are at higher depths instead, but do not get the chance to finish in their iteration, but they could profit from a lucky transposition for instance or do some groundbreaking work for later iterations. Also, I think I saw the thread with the best result is chosen at the end in Stockfish, not just the highest completed depth. That has nothing to do with alpha beta anymore

Dann Corbit · Post by **Dann Corbit** » Mon Aug 15, 2016 8:53 pm

LazySMP does not get as deep as YBW, but it plays better, according to tests.

It also seems to perform shockingly well with hyperthreading turned on, at least according to some experiments.

I don't think anyone can tell you the why on either of those.

bob · Post by **bob** » Mon Aug 15, 2016 10:19 pm

Werewolf wrote:
Edsel Apostol wrote:Lazy SMP algorithm in SF is not conducive for time to depth tests. I think that is normal for Lazy SMP.
So are you saying Lazy SMP doesn't scale well with more threads, or that time to depth isn't a good indicator of its speed? I'm guessing the latter.

In which case, how do we measure speed with Lazy SMP?

What about Komodo, is that lazy SMP or YBWC?

Thanks.

For the record, there are two ways to use processing power through multiple cores.

(1) search the SAME tree faster, so that you get deeper into the tree in the same time;

(2) sure MORE of the tree at the same depth. This is almost always a purely accidental side-effect of a search that has excessive overhead. Which describes lazy-amp pretty accurately.

If you prune too much, then widening the tree is of some benefit. But I always point out that you could widen the tree without a parallel search, just as easily, if that is really a goal.

time-do-depth is by far the easiest way to analyze parallel speedup. Of course, playing games would be even better, but the computational cost is astronomical as in normal testing, you run one game per core to get enough games played to drive the error bar to your target. But if you test a parallel search, you use all cores on one game, which on a 20 core machine would make the testing take 20x longer. I don't think Elo will ever be a viable measuring approach. A couple of years ago we had a 70 node cluster with 8 cores per node. Giving me the option of playing 70 games at once, or 560 games at once. I always chose the latter.

APassionForCriminalJustic · Mon Aug 15, 2016 11:49 pm

Eelco de Groot wrote:
APassionForCriminalJustic wrote:
Unless I am mistaken LAZY-SMP is about the nodes per second, mostly. Nobody knows what implementation Komodo uses as far as I know.
I don't know much more than you but those statements are both rather far from the truth, really. Back when Komodo did not have a working SMP implementation, unlike other programs from Don in the past, he did say he was working with an simpler form of YBW. Well something to that effect at least. I understood that as a form of Lazy SMP also because that was easier to implement. So it stands to reason that this could still be so if it worked well, and winning TCEC later qualifies as a success.

Has nothing to do with nodes per second. More with iterative deepening. The more threads, the fuzzier exactly what depth they are on, and what I have seen from SF they are all on different depths? But really that is about all I know of the implementation in SF. I just picture it as trying to get the same effect as restarting the search from scratch somewhere midtime, and do this many times over. You profit from the already partially ordered tree and start over. Some threads are at higher depths instead, but do not get the chance to finish in their iteration, but they could profit from a lucky transposition for instance or do some groundbreaking work for later iterations. Also, I think I saw the thread with the best result is chosen at the end in Stockfish, not just the highest completed depth. That has nothing to do with alpha beta anymore

I'll go with what you say man.

You know way more than me.

Time to depth concerns

Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns

Re: Time to depth concerns