Scaling of Asmfish with large thread count

Dann Corbit · Post by **Dann Corbit** » Fri Oct 07, 2016 9:21 pm

It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.

bob · Post by **bob** » Fri Oct 07, 2016 10:38 pm

Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.

I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.

Dann Corbit · Post by **Dann Corbit** » Fri Oct 07, 2016 11:50 pm

bob wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.

Every engine on that page is a version of SF which all use lazy SMP.
But the C++ (Stockfish) and C (Cfish) versions have a 50% NPS loss at high core count compared to ASMFish.

That is the thing I find both astounding and puzzling.

The conclusion of the page author is Numa awareness.

zullil · Post by **zullil** » Sat Oct 08, 2016 1:15 am

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.
Every engine on that page is a version of SF which all use lazy SMP.
But the C++ (Stockfish) and C (Cfish) versions have a 50% NPS loss at high core count compared to ASMFish.

That is the thing I find both astounding and puzzling.

The conclusion of the page author is Numa awareness.

Yes, it's NUMA-awareness, but only because his system is running Windows, which is total crap with more than 64 threads unless NUMA is used.

APassionForCriminalJustic · Sat Oct 08, 2016 1:40 am

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.
Every engine on that page is a version of SF which all use lazy SMP.
But the C++ (Stockfish) and C (Cfish) versions have a 50% NPS loss at high core count compared to ASMFish.

The conclusion of the page author is Numa awareness.

Cfish is also NUMA aware. AsmFish is extremely fast though; it's definitely the only Stockfish that I use with my server-rig, 36 cores and 72 threads. I also keep hyperthreading enabled and use all cores since the NPS gain is close to 35 percent. People can say what they want - but speed matters. It's clear that asmFish is one heck of an engine.

mcostalba · Post by **mcostalba** » Sat Oct 08, 2016 7:03 am

Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.

Because of this

Code: Select all

setoption name threads value 72

bob · Post by **bob** » Sat Oct 08, 2016 7:14 am

mcostalba wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
Because of this
Code: Select all
setoption name threads value 72

Doesn't answer his question. ALL versions had 72 threads. asmfish scales MUCH better than the other versions.

Nay Lin Tun · Post by **Nay Lin Tun** » Sat Oct 08, 2016 7:14 am

Will ASM fish be competing in the TCEC superfinal?

mcostalba · Post by **mcostalba** » Sat Oct 08, 2016 7:24 am

bob wrote:
mcostalba wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
Because of this
Code: Select all
setoption name threads value 72
Doesn't answer his question. ALL versions had 72 threads. asmfish scales MUCH better than the other versions.

On Windows a process is not able to run by default on more than 64 'logical processors' (they call it like this).

asmfish workarounds this limitation calling some OS specific system calls that Windows official docs call 'NUMA library', note that this has nothing to do with NUMA, it is just the name Windows calls the functions needed to workaround this limitation.

In case of asmFish NUMA-awarness it means to use these Windows-specific system calls.

In case for TCEC, the hardware for superfinal runs on more than 64 logical processors, then we will have to use these Windows library too, to avoid a sensible slowdown.

Below 64 logical processors, we have measured and tested on fishtest that the difference is zero with an error margin of about 5-8 ELO, that is the resolution we achieved with our test.

Leo · Post by **Leo** » Sat Oct 08, 2016 8:51 am

A good and logical question. I hope so.

Scaling of Asmfish with large thread count

Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count

Re: Scaling of Asmfish with large thread count