7-men Syzygy attempt

syzygy · Post by **syzygy** » Sun Apr 01, 2018 6:55 pm

abulmo2 wrote:You forgot the dtz50 table:

You were not talking about dtz50 tables:

apart for a few trivial cases that a 1 ply search can decipher.

A 1-ply search is insufficient to solve KRvK and KQvK.

Moreover, you cut my quote where the important sentence was the following one:
Of course the same can be said for many 4-men to 8-men tables

Ah duh... Why do you think the 4-, 5- and 6-men tables do not efficiently store the trivial cases?

For example, are the 5-men positions with an isolated king usefull?

They are very useful because they cut down the size of the 4v2 tables.

I understand that as a table maker, you want to be exhaustive; but I guess you could agree that an incomplete 7-men table restricted to hard to decipher positions are somewhat more usefull than a bunch of trivial 3,4,5 and 6-men positions.

Do you realise that that bunch of trivial positions takes up less space than 0.1% of a non-trivial 7-men table? There is nothing to save here.

Dann Corbit · Post by **Dann Corbit** » Sun Apr 01, 2018 9:06 pm

syzygy wrote:
abulmo2 wrote:You forgot the dtz50 table:
You were not talking about dtz50 tables:
apart for a few trivial cases that a 1 ply search can decipher.
A 1-ply search is insufficient to solve KRvK and KQvK.

Moreover, you cut my quote where the important sentence was the following one:
Of course the same can be said for many 4-men to 8-men tables
Ah duh... Why do you think the 4-, 5- and 6-men tables do not efficiently store the trivial cases?

For example, are the 5-men positions with an isolated king usefull?
They are very useful because they cut down the size of the 4v2 tables.

I understand that as a table maker, you want to be exhaustive; but I guess you could agree that an incomplete 7-men table restricted to hard to decipher positions are somewhat more usefull than a bunch of trivial 3,4,5 and 6-men positions.
Do you realise that that bunch of trivial positions takes up less space than 0.1% of a non-trivial 7-men table? There is nothing to save here.

I want the "absurd" tables too.

Besides the interesting statistics, consider the number of drawn positions in
KQQQQQk.
It's huge.
Of course, in a case like that, a tempo is a precious quantity for the disadvantaged side.

Let's look at the down-side:
Very small amount of space consumed by tables that will not be called during game play.
Very small amount of time lost during computation.

And not the positives:
Simplification of computation of other tables.
Mathematical completeness.
Interesting statistics generated.
And when you are done, you can delete any files you don't want.

It seems a no-brainer to me.

noobpwnftw · Post by **noobpwnftw** » Thu Apr 05, 2018 5:22 pm

Some more progress:

All 6+1 pawnless tables are available, pawnful ones are currently being built and uploaded as well as the 5+2 pawnless ones, so far no occurrence of DTZ overflow, I guess those few are in 4+3.

Hint: most calculation work runs quite fast and since there are CAS atomics used during intermediate memory access, more precisely, "lock cmpxchg8b" on random unaligned memory addresses. It is by design and I don't think can be avoided, already did enough fiddling.

This will hit memory bandwidth with about 60-90 threads on Xeon V4 & Skylake-SP processors, and it will backfire up to 10x slowdown on the latter with more threads, affecting all threads, while on the former only the extra threads will show high memory access latency. So it does not benefit from having more threads.

Binding seems not helpful and I guess a hard cap of 64 threads should be there as a reminder to everyone else not to waste more time on this matter.

Nordlandia · Post by **Nordlandia** » Thu Apr 05, 2018 6:36 pm

As of now, it is highly likely to indicate wdl is to be less or more than 10 TB in size. For simple adjudication only wdl is needed. 10 TB HDDs is fairly affordable.

Dann Corbit · Post by **Dann Corbit** » Thu Apr 05, 2018 8:12 pm

Nordlandia wrote:As of now, it is highly likely to indicate wdl is to be less or more than 10 TB in size. For simple adjudication only wdl is needed. 10 TB HDDs is fairly affordable.

There is a commercial 100 TB SSD drive.
I guess in 5 years a 100 TB SSD drive will be less than $1000.
Anyone who wants will be able to afford storage for the full 7 man files.

Jesse Gersenson · Post by **Jesse Gersenson** » Thu Apr 05, 2018 8:29 pm

noobpwnftw wrote:Some more progress:

All 6+1 pawnless tables are available, pawnful ones are currently being built and uploaded as well as the 5+2 pawnless ones, so far no occurrence of DTZ overflow, I guess those few are in 4+3.

Hint: most calculation work runs quite fast and since there are CAS atomics used during intermediate memory access, more precisely, "lock cmpxchg8b" on random unaligned memory addresses. It is by design and I don't think can be avoided, already did enough fiddling.

This will hit memory bandwidth with about 60-90 threads on Xeon V4 & Skylake-SP processors, and it will backfire up to 10x slowdown on the latter with more threads, affecting all threads, while on the former only the extra threads will show high memory access latency. So it does not benefit from having more threads.

Binding seems not helpful and I guess a hard cap of 64 threads should be there as a reminder to everyone else not to waste more time on this matter.

Was that on bare metal or on the VM? I was going to suggest you request a special build from VMware with the core limit increased.

Nordlandia · Post by **Nordlandia** » Thu Apr 05, 2018 8:57 pm

Dann Corbit wrote:
Nordlandia wrote:As of now, it is highly likely to indicate wdl is to be less or more than 10 TB in size. For simple adjudication only wdl is needed. 10 TB HDDs is fairly affordable.
There is a commercial 100 TB SSD drive.
I guess in 5 years a 100 TB SSD drive will be less than $1000.
Anyone who wants will be able to afford storage for the full 7 man files.

7 man files is around the corner. Storing them in SSD format is extremely expensive. Alternatively using HDD for adjudication in cutechess is less expensive. In the long run like you're implying, more and more people can afford them for raw analysis. HDD simply inflict severe bottleneck, thus only choise is for adjudication.

noobpwnftw · Post by **noobpwnftw** » Thu Apr 05, 2018 9:22 pm

Jesse Gersenson wrote: Was that on bare metal or on the VM? I was going to suggest you request a special build from VMware with the core limit increased.

On bare metal.

In VM the problem is still there but not that obvious due to lesser running threads and the way hypervisor scheduling stuff, but I tested it on an evaluation version of the latest Hyper-V VM which supports a max of 240 cores, it also backfired similarly. So it probably would be the same if I have that special build.

The whole picture is: On older platforms if you do that, it would just make your extra threads running slower, on the latest platform it will affect your entire system and the total combined performance will be way lower. This is a disturbing fact I find hard to believe.

syzygy · Post by **syzygy** » Thu Apr 05, 2018 9:48 pm

noobpwnftw wrote:Hint: most calculation work runs quite fast and since there are CAS atomics used during intermediate memory access, more precisely, "lock cmpxchg8b" on random unaligned memory addresses. It is by design and I don't think can be avoided, already did enough fiddling.

Yes, the locked instructions are necessary to guarantee correctness.

I would expect there to be hardly any contention even with very many threads when generating a 7-piece table, so inserting pause instructions shouldn't make a difference. Without contention, locked instruction don't generate much overhead on modern cpus. But this might be different on NUMA machines, in particular if the accessed memory location is a remote location. (Even without the lock prefix things might be slowing down though if the NUMA interconnect starts to run out of bandwidth.)

This will hit memory bandwidth with about 60-90 threads on Xeon V4 & Skylake-SP processors, and it will backfire up to 10x slowdown on the latter with more threads, affecting all threads, while on the former only the extra threads will show high memory access latency. So it does not benefit from having more threads.

So "Skylape Scalable Performance" is not so scalable?

Binding seems not helpful and I guess a hard cap of 64 threads should be there as a reminder to everyone else not to waste more time on this matter.

Optimising NUMA memory allocation together with thread binding might still help a bit. I will have a look at it.

syzygy · Post by **syzygy** » Thu Apr 05, 2018 9:49 pm

Btw, are there differences in scalability between the generator, the permutator and the compressor? It might make sense to use different numbers of threads for these parts...

7-men Syzygy attempt

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.

Re: 7-man Syzygy attempt.