Scaling of Asmfish with large thread count

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Dann Corbit
Posts: 10122
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Scaling of Asmfish with large thread count

Post by Dann Corbit » Fri Oct 07, 2016 7:21 pm

It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

bob
Posts: 20559
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Scaling of Asmfish with large thread count

Post by bob » Fri Oct 07, 2016 8:38 pm

Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.

Dann Corbit
Posts: 10122
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Scaling of Asmfish with large thread count

Post by Dann Corbit » Fri Oct 07, 2016 9:50 pm

bob wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.
Every engine on that page is a version of SF which all use lazy SMP.
But the C++ (Stockfish) and C (Cfish) versions have a 50% NPS loss at high core count compared to ASMFish.

That is the thing I find both astounding and puzzling.

The conclusion of the page author is Numa awareness.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

zullil
Posts: 5668
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: Scaling of Asmfish with large thread count

Post by zullil » Fri Oct 07, 2016 11:15 pm

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.
Every engine on that page is a version of SF which all use lazy SMP.
But the C++ (Stockfish) and C (Cfish) versions have a 50% NPS loss at high core count compared to ASMFish.

That is the thing I find both astounding and puzzling.

The conclusion of the page author is Numa awareness.
Yes, it's NUMA-awareness, but only because his system is running Windows, which is total crap with more than 64 threads unless NUMA is used.

APassionForCriminalJustic
Posts: 415
Joined: Sat May 24, 2014 7:16 am

Re: Scaling of Asmfish with large thread count

Post by APassionForCriminalJustic » Fri Oct 07, 2016 11:40 pm

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
I don't follow the question. Lazy-smp is SUPPOSED to scale well in terms of raw NPS since there is so little interaction between threads. But NPS is only part of the question. IE one can get perfect NPS scaling by just running N copies of the same program, but it won't play any stronger.

The term "scaling" generally applies to performance. IE time to depth as the most direct way of measuring performance in a chess engine. I also notice that they even reported a 103% NPS scaling which rings alarm bells for me. Tough to imagine how doubling cores would more than double NPS.

And more importantly, I notice that the trees are growing by a factor of 4+ from 8 to 72 cores which certainly establishes an upper bound on useful speedup, due to search overhead, which is what limits everyone regardless of raw NPS. This is much akin to trying to maximize your vehicle RPM by changing the final drive ratio. Tach reads higher but you won't be going near as fast in reality, which is not what you really want.

also, as an aside, asm fish is significantly faster than the C++/C versions as well, which would be expected. Might be that part of that asm optimizing is reducing memory/cache conflicts somewhere.
Every engine on that page is a version of SF which all use lazy SMP.
But the C++ (Stockfish) and C (Cfish) versions have a 50% NPS loss at high core count compared to ASMFish.

The conclusion of the page author is Numa awareness.
Cfish is also NUMA aware. AsmFish is extremely fast though; it's definitely the only Stockfish that I use with my server-rig, 36 cores and 72 threads. I also keep hyperthreading enabled and use all cores since the NPS gain is close to 35 percent. People can say what they want - but speed matters. It's clear that asmFish is one heck of an engine.

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Scaling of Asmfish with large thread count

Post by mcostalba » Sat Oct 08, 2016 5:03 am

Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
Because of this

Code: Select all

setoption name threads value 72

bob
Posts: 20559
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Scaling of Asmfish with large thread count

Post by bob » Sat Oct 08, 2016 5:14 am

mcostalba wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
Because of this

Code: Select all

setoption name threads value 72
Doesn't answer his question. ALL versions had 72 threads. asmfish scales MUCH better than the other versions.

Nay Lin Tun
Posts: 520
Joined: Mon Jan 16, 2012 5:34 am

Re: Scaling of Asmfish with large thread count

Post by Nay Lin Tun » Sat Oct 08, 2016 5:14 am

Will ASM fish be competing in the TCEC superfinal?

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Scaling of Asmfish with large thread count

Post by mcostalba » Sat Oct 08, 2016 5:24 am

bob wrote:
mcostalba wrote:
Dann Corbit wrote:It is absurdly better than any alternative.
http://www.ipmanchess.yolasite.com/test ... hreads.php

I wonder why.
Because of this

Code: Select all

setoption name threads value 72
Doesn't answer his question. ALL versions had 72 threads. asmfish scales MUCH better than the other versions.
On Windows a process is not able to run by default on more than 64 'logical processors' (they call it like this).

asmfish workarounds this limitation calling some OS specific system calls that Windows official docs call 'NUMA library', note that this has nothing to do with NUMA, it is just the name Windows calls the functions needed to workaround this limitation.

In case of asmFish NUMA-awarness it means to use these Windows-specific system calls.

In case for TCEC, the hardware for superfinal runs on more than 64 logical processors, then we will have to use these Windows library too, to avoid a sensible slowdown.

Below 64 logical processors, we have measured and tested on fishtest that the difference is zero with an error margin of about 5-8 ELO, that is the resolution we achieved with our test.

Leo
Posts: 836
Joined: Fri Sep 16, 2016 4:55 pm
Location: USA/Minnesota
Full name: Leo

Re: Scaling of Asmfish with large thread count

Post by Leo » Sat Oct 08, 2016 6:51 am

A good and logical question. I hope so.
Advanced Micro Devices fan.

Post Reply