7-men Syzygy attempt

syzygy · Post by **syzygy** » Fri Mar 30, 2018 11:00 pm

Ozymandias wrote:Other than the empirical fact derived from my own tests at ultrafast TCs, and the absence of said experience at longer TCs, there's also the intuition that more time generates better chess, at all phases of the game, leading to smaller Elo gaps between players. As with any intuition, I could be wrong.

But that intuition applies exactly in the same way to any improvement of a chess engine. It says nothing about the advantageous effect of TBs.

Let's say that at some fixed TC 6-piece TBs give a boost over 5-piece TBs that is worth as much as X% higher nps.

If we now multiply the TC by 10, my guess is that 6-piece TBs over 5-piece TBs are still worth as much as X% higher nps (and perhaps more), for the same value of X.

(One correction: for DTZ probing at the root this is clearly not true, so I am thinking of the effect of WDL probing in the search tree. If a part of the 20 Elo you saw came from DTZ probing at the root, then it is to be expected that that part melts away as the TC increases.)

noobpwnftw · Post by **noobpwnftw** » Fri Mar 30, 2018 11:06 pm

Status update:

Now I rearranged memory of my 2 machines in order to build efficiently in parallel, one building 6+1 and another building 5+2.

However there are a few glitches during the actual building work:

1. One machine has 384 cores, the generator won't build with MAX_THREADS=384 due to address space limitations of static variables, if I use -mcmodel=large I suspect that there will be performance issues with indirect memory access code being generated.

2. Linux kernel scheduler does not work well with large number of cores, I saw serious issues during the execution of run_threaded tasks, it seems that the kernel was trying to migrate threads among less busy cores and its internal spin locks took too much cycles. I tried fiddling with sched_migration_cost parameter but with no success, eventually I have to run ESXi on both machines and work in a virtual machine to get around the problem, however ESXi limits maximum CPU count per VM to 128 so I can only use 1/3 of the cores, yet still faster than on bare metal.

Any workarounds may greatly speed up the building process, I think.

syzygy · Post by **syzygy** » Fri Mar 30, 2018 11:11 pm

noobpwnftw wrote:1. One machine has 384 cores, the generator won't build with MAX_THREADS=384 due to address space limitations of static variables, if I use -mcmodel=large I suspect that there will be performance issues with indirect memory access code being generated.

I can look into dynamically allocating those arrays. That would also remove the need to set a compile-time maximum on the number of threads.

2. Linux kernel scheduler does not work well with large number of cores, I saw serious issues during the execution of run_threaded tasks, it seems that the kernel was trying to migrate threads among less busy cores and its internal spin locks took too much cycles. I tried fiddling with sched_migration_cost parameter but with no success, eventually I have to run ESXi on both machines and work in a virtual machine to get around the problem, however ESXi limits maximum CPU count per VM to 128 so only 1/3 of CPU utilization, yet still faster than bare metal.

I'm not sure what is happening here, but maybe it would help if the generator set thread affinities?

noobpwnftw · Post by **noobpwnftw** » Fri Mar 30, 2018 11:46 pm

syzygy wrote:
noobpwnftw wrote:1. One machine has 384 cores, the generator won't build with MAX_THREADS=384 due to address space limitations of static variables, if I use -mcmodel=large I suspect that there will be performance issues with indirect memory access code being generated.
I can look into dynamically allocating those arrays. That would also remove the need to set a compile-time maximum on the number of threads.

2. Linux kernel scheduler does not work well with large number of cores, I saw serious issues during the execution of run_threaded tasks, it seems that the kernel was trying to migrate threads among less busy cores and its internal spin locks took too much cycles. I tried fiddling with sched_migration_cost parameter but with no success, eventually I have to run ESXi on both machines and work in a virtual machine to get around the problem, however ESXi limits maximum CPU count per VM to 128 so only 1/3 of CPU utilization, yet still faster than bare metal.
I'm not sure what is happening here, but maybe it would help if the generator set thread affinities?

For #1 I think this is nice to make them dynamic allocated.

For #2 I'm not sure if it will work or not, on a bare metal Linux it actually does everything slower for unknown reason, it took me a couple of days to get a Linux properly boot on this 384-core machine.

syzygy · Post by **syzygy** » Sat Mar 31, 2018 12:00 am

noobpwnftw wrote:For #1 I think this is nice to make them dynamic allocated.

Done.

For #2 I'm not sure if it will work or not, on a bare metal Linux it actually does everything slower for unknown reason, it took me a couple of days to get a Linux properly boot on this 384-core machine.

Strange. The Linux kernel mailing list might like to hear from you

Still it might be beneficial to set thread affinities on machines with many threads, so I will make it an option.

syzygy · Post by **syzygy** » Sat Mar 31, 2018 12:24 am

Setting of thread affinities added on Linux. Invoke with -a.

Ozymandias · Post by **Ozymandias** » Sat Mar 31, 2018 10:58 am

syzygy wrote:
Ozymandias wrote:Other than the empirical fact derived from my own tests at ultrafast TCs, and the absence of said experience at longer TCs, there's also the intuition that more time generates better chess, at all phases of the game, leading to smaller Elo gaps between players. As with any intuition, I could be wrong.
But that intuition applies exactly in the same way to any improvement of a chess engine. It says nothing about the advantageous effect of TBs.

Let's say that at some fixed TC 6-piece TBs give a boost over 5-piece TBs that is worth as much as X% higher nps.

If we now multiply the TC by 10, my guess is that 6-piece TBs over 5-piece TBs are still worth as much as X% higher nps (and perhaps more), for the same value of X.

(One correction: for DTZ probing at the root this is clearly not true, so I am thinking of the effect of WDL probing in the search tree. If a part of the 20 Elo you saw came from DTZ probing at the root, then it is to be expected that that part melts away as the TC increases.)

Well, it says something about the advantageous effect of TBs, in the sense that they represent an improvement for a chess engine. In my testings, it may go beyond that at very fast TCs, because I use fixed depth. It's not the same to let the engine allocate time, than telling it to search to a given depth. Normally, engines don't save that much time for endgames, and yet it seems to be more than enough. With fixed depth, time used for endgame positions will be much lower than the one allocated for the middle game, just because of the complexity of the positions. In this context, it looks like TBs help more. (Another note, I always drop probe depth to the minimum, that surely increases the impact on performance).

No TBs are probed at root, I let cutechess adjudicate as soon as a 6-men position is OTB.

Nordlandia · Post by **Nordlandia** » Sat Mar 31, 2018 11:22 am

Juan Molina: i probe 5-men during search plus 6-man adjudication. The last few days i've tried to determine how much 6-man adjudication add as strength if probing 5-men.

Do 6-men adjudication help 5-men if opponent probe 6-men during search?

noobpwnftw · Post by **noobpwnftw** » Sat Mar 31, 2018 4:45 pm

Code: Select all

vm.swappiness=0
kernel.nmi_watchdog=0
&#91;b&#93;kernel.numa_balancing=0&#91;/b&#93;
kernel.sched_migration_cost_ns=5000000
kernel.sched_autogroup_enabled=0

Running without thread binding(-a) and having the above tweaks drops CPU sys% usage on top from 60% to 2%, thread binding has some negative impact on performance due to far side NUMA.

Looks like the workaround is to not letting Linux migrate memory pages across NUMA nodes since we use a lot of huge pages and accessed randomly in general, but to migrate threads among cores gently to reduce overall far side NUMA rate.

It seems to need a lot of more fiddlings to make it work, for example, under bare metal Linux even commands like perf and top would become non-responsive, while running inside a virtual machine such case does not exist.

Ozymandias · Post by **Ozymandias** » Sat Mar 31, 2018 5:14 pm

Nordlandia wrote:Do 6-men adjudication help 5-men if opponent probe 6-men during search?

I never tested that, if you adjudicate a 6-men position, and some of the participants only had access to 5-men TBs, it stands to reason that they will lose less points, against those that could probe 6-men. You're giving them less of a chance to mess up. This is, of course, if the engines in question are close in strength and handle syzygy reasonably, otherwise it could be the other way around.

7-men Syzygy attempt

Re: 7-men Syzygy attempt

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-men Syzygy attempt

Re: 7-men Syzygy attempt

Re: 7-man Syzygy attempt.

Re: 7-men Syzygy attempt