7-men Syzygy attempt

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, Dann Corbit, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
syzygy
Posts: 4820
Joined: Tue Feb 28, 2012 10:56 pm

Re: 7-men Syzygy attempt

Post by syzygy » Fri Mar 30, 2018 9:00 pm

Ozymandias wrote:Other than the empirical fact derived from my own tests at ultrafast TCs, and the absence of said experience at longer TCs, there's also the intuition that more time generates better chess, at all phases of the game, leading to smaller Elo gaps between players. As with any intuition, I could be wrong.
But that intuition applies exactly in the same way to any improvement of a chess engine. It says nothing about the advantageous effect of TBs.

Let's say that at some fixed TC 6-piece TBs give a boost over 5-piece TBs that is worth as much as X% higher nps.

If we now multiply the TC by 10, my guess is that 6-piece TBs over 5-piece TBs are still worth as much as X% higher nps (and perhaps more), for the same value of X.

(One correction: for DTZ probing at the root this is clearly not true, so I am thinking of the effect of WDL probing in the search tree. If a part of the 20 Elo you saw came from DTZ probing at the root, then it is to be expected that that part melts away as the TC increases.)

noobpwnftw
Posts: 435
Joined: Sun Nov 08, 2015 10:10 pm

Re: 7-man Syzygy attempt.

Post by noobpwnftw » Fri Mar 30, 2018 9:06 pm

Status update:

Now I rearranged memory of my 2 machines in order to build efficiently in parallel, one building 6+1 and another building 5+2.

However there are a few glitches during the actual building work:

1. One machine has 384 cores, the generator won't build with MAX_THREADS=384 due to address space limitations of static variables, if I use -mcmodel=large I suspect that there will be performance issues with indirect memory access code being generated.

2. Linux kernel scheduler does not work well with large number of cores, I saw serious issues during the execution of run_threaded tasks, it seems that the kernel was trying to migrate threads among less busy cores and its internal spin locks took too much cycles. I tried fiddling with sched_migration_cost parameter but with no success, eventually I have to run ESXi on both machines and work in a virtual machine to get around the problem, however ESXi limits maximum CPU count per VM to 128 so I can only use 1/3 of the cores, yet still faster than on bare metal.

Any workarounds may greatly speed up the building process, I think.

syzygy
Posts: 4820
Joined: Tue Feb 28, 2012 10:56 pm

Re: 7-man Syzygy attempt.

Post by syzygy » Fri Mar 30, 2018 9:11 pm

noobpwnftw wrote:1. One machine has 384 cores, the generator won't build with MAX_THREADS=384 due to address space limitations of static variables, if I use -mcmodel=large I suspect that there will be performance issues with indirect memory access code being generated.
I can look into dynamically allocating those arrays. That would also remove the need to set a compile-time maximum on the number of threads.
2. Linux kernel scheduler does not work well with large number of cores, I saw serious issues during the execution of run_threaded tasks, it seems that the kernel was trying to migrate threads among less busy cores and its internal spin locks took too much cycles. I tried fiddling with sched_migration_cost parameter but with no success, eventually I have to run ESXi on both machines and work in a virtual machine to get around the problem, however ESXi limits maximum CPU count per VM to 128 so only 1/3 of CPU utilization, yet still faster than bare metal.
I'm not sure what is happening here, but maybe it would help if the generator set thread affinities?

noobpwnftw
Posts: 435
Joined: Sun Nov 08, 2015 10:10 pm

Re: 7-man Syzygy attempt.

Post by noobpwnftw » Fri Mar 30, 2018 9:46 pm

syzygy wrote:
noobpwnftw wrote:1. One machine has 384 cores, the generator won't build with MAX_THREADS=384 due to address space limitations of static variables, if I use -mcmodel=large I suspect that there will be performance issues with indirect memory access code being generated.
I can look into dynamically allocating those arrays. That would also remove the need to set a compile-time maximum on the number of threads.
2. Linux kernel scheduler does not work well with large number of cores, I saw serious issues during the execution of run_threaded tasks, it seems that the kernel was trying to migrate threads among less busy cores and its internal spin locks took too much cycles. I tried fiddling with sched_migration_cost parameter but with no success, eventually I have to run ESXi on both machines and work in a virtual machine to get around the problem, however ESXi limits maximum CPU count per VM to 128 so only 1/3 of CPU utilization, yet still faster than bare metal.
I'm not sure what is happening here, but maybe it would help if the generator set thread affinities?
For #1 I think this is nice to make them dynamic allocated.

For #2 I'm not sure if it will work or not, on a bare metal Linux it actually does everything slower for unknown reason, it took me a couple of days to get a Linux properly boot on this 384-core machine. :(

syzygy
Posts: 4820
Joined: Tue Feb 28, 2012 10:56 pm

Re: 7-man Syzygy attempt.

Post by syzygy » Fri Mar 30, 2018 10:00 pm

noobpwnftw wrote:For #1 I think this is nice to make them dynamic allocated.
Done.
For #2 I'm not sure if it will work or not, on a bare metal Linux it actually does everything slower for unknown reason, it took me a couple of days to get a Linux properly boot on this 384-core machine. :(
Strange. The Linux kernel mailing list might like to hear from you :)

Still it might be beneficial to set thread affinities on machines with many threads, so I will make it an option.

syzygy
Posts: 4820
Joined: Tue Feb 28, 2012 10:56 pm

Re: 7-man Syzygy attempt.

Post by syzygy » Fri Mar 30, 2018 10:24 pm

Setting of thread affinities added on Linux. Invoke with -a.

User avatar
Ozymandias
Posts: 1243
Joined: Sun Oct 25, 2009 12:30 am

Re: 7-men Syzygy attempt

Post by Ozymandias » Sat Mar 31, 2018 8:58 am

syzygy wrote:
Ozymandias wrote:Other than the empirical fact derived from my own tests at ultrafast TCs, and the absence of said experience at longer TCs, there's also the intuition that more time generates better chess, at all phases of the game, leading to smaller Elo gaps between players. As with any intuition, I could be wrong.
But that intuition applies exactly in the same way to any improvement of a chess engine. It says nothing about the advantageous effect of TBs.

Let's say that at some fixed TC 6-piece TBs give a boost over 5-piece TBs that is worth as much as X% higher nps.

If we now multiply the TC by 10, my guess is that 6-piece TBs over 5-piece TBs are still worth as much as X% higher nps (and perhaps more), for the same value of X.

(One correction: for DTZ probing at the root this is clearly not true, so I am thinking of the effect of WDL probing in the search tree. If a part of the 20 Elo you saw came from DTZ probing at the root, then it is to be expected that that part melts away as the TC increases.)
Well, it says something about the advantageous effect of TBs, in the sense that they represent an improvement for a chess engine. In my testings, it may go beyond that at very fast TCs, because I use fixed depth. It's not the same to let the engine allocate time, than telling it to search to a given depth. Normally, engines don't save that much time for endgames, and yet it seems to be more than enough. With fixed depth, time used for endgame positions will be much lower than the one allocated for the middle game, just because of the complexity of the positions. In this context, it looks like TBs help more. (Another note, I always drop probe depth to the minimum, that surely increases the impact on performance).

No TBs are probed at root, I let cutechess adjudicate as soon as a 6-men position is OTB.

User avatar
Nordlandia
Posts: 2690
Joined: Fri Sep 25, 2015 7:38 pm
Location: Sortland, Norway

Re: 7-men Syzygy attempt

Post by Nordlandia » Sat Mar 31, 2018 9:22 am

Juan Molina: i probe 5-men during search plus 6-man adjudication. The last few days i've tried to determine how much 6-man adjudication add as strength if probing 5-men.

Do 6-men adjudication help 5-men if opponent probe 6-men during search?

noobpwnftw
Posts: 435
Joined: Sun Nov 08, 2015 10:10 pm

Re: 7-man Syzygy attempt.

Post by noobpwnftw » Sat Mar 31, 2018 2:45 pm

Code: Select all

vm.swappiness=0
kernel.nmi_watchdog=0
[b]kernel.numa_balancing=0[/b]
kernel.sched_migration_cost_ns=5000000
kernel.sched_autogroup_enabled=0
Running without thread binding(-a) and having the above tweaks drops CPU sys% usage on top from 60% to 2%, thread binding has some negative impact on performance due to far side NUMA.

Looks like the workaround is to not letting Linux migrate memory pages across NUMA nodes since we use a lot of huge pages and accessed randomly in general, but to migrate threads among cores gently to reduce overall far side NUMA rate.

It seems to need a lot of more fiddlings to make it work, for example, under bare metal Linux even commands like perf and top would become non-responsive, while running inside a virtual machine such case does not exist.

User avatar
Ozymandias
Posts: 1243
Joined: Sun Oct 25, 2009 12:30 am

Re: 7-men Syzygy attempt

Post by Ozymandias » Sat Mar 31, 2018 3:14 pm

Nordlandia wrote:Do 6-men adjudication help 5-men if opponent probe 6-men during search?
I never tested that, if you adjudicate a 6-men position, and some of the participants only had access to 5-men TBs, it stands to reason that they will lose less points, against those that could probe 6-men. You're giving them less of a chance to mess up. This is, of course, if the engines in question are close in strength and handle syzygy reasonably, otherwise it could be the other way around.

Post Reply