Maybe a new use case for an old idea of mine

Michael Sherwin · Post by **Michael Sherwin** » Fri Mar 01, 2019 6:38 pm

Multicore processors (like the TR2990x) have poor memory access mechanics. In large data set code there tends to be more waiting while cores compete for data. My idea for chess engines is to dedicate one thread to the hashtable. The dedicated thread will poll a global data structure for store and retrieval request and would perform all hashtable access. A search thread can request a hashtable entry as early as the end of MakeMove() so it will be waiting in the global data structure when needed. There will then not be a need for locks. I do not know how many threads a dedicated hashtable thread could service before that would create a bottleneck of its own but I imagine it would be a lot. I also imagine as the number of threads increases up to that limit it would perform better.

mar · Post by **mar** » Fri Mar 01, 2019 7:21 pm

Michael Sherwin wrote: ↑Fri Mar 01, 2019 6:38 pm Multicore processors (like the TR2990x) have poor memory access mechanics. In large data set code there tends to be more waiting while cores compete for data. My idea for chess engines is to dedicate one thread to the hashtable. The dedicated thread will poll a global data structure for store and retrieval request and would perform all hashtable access. A search thread can request a hashtable entry as early as the end of MakeMove() so it will be waiting in the global data structure when needed. There will then not be a need for locks. I do not know how many threads a dedicated hashtable thread could service before that would create a bottleneck of its own but I imagine it would be a lot. I also imagine as the number of threads increases up to that limit it would perform better.

I don't see how this is a good idea.
So instead of prefetch which gives very marginal gains you propose to waste one full logical core and totally kill performance on sychronization.
(reminds me of a guy who fried all cores on a global spinlock while claiming great scaling because of 100% CPU utilization)
You would need 1 extra hashtable worker per search worker (or maybe 1 global hashtable worker? - that would be disastrous)
Have you actually tried and measured it? I bet that this will be much slower in general than the memory access itself (assuming a cache miss across all levels).
I also don't follow what you gain by "scheduling" the tt request after makemove. The first thing you do after that is you recurse (and probe TT), not much time to process the request.
It seems to me you're trying to invent something like software hyperthreading, except orders of magnitude slower than what the hardware already does.

Or perhaps this was some kind of a joke and I missed the point

If that's the case, my apologies.

hgm · Post by **hgm** » Fri Mar 01, 2019 7:24 pm

I don't see why that would gain anything. It just complicates things. The caching hardware and memory control unit pretty much already do all that. Emulating it in software will just slow things down.

Michael Sherwin · Post by **Michael Sherwin** » Fri Mar 01, 2019 7:42 pm

Yes it needs to be tested. I did specify a CPU like the TR2990x. The first 16 cores connect directly to the memory bus. The second 16 cores connect through the first 16 cores to the memory bus. There is a huge memory bottleneck created. Crafty for example runs no faster on a 32 core TR2990x than it does on a TR2950x because of that bottleneck. If Crafty for example had a dedicated thread for the hashtable that bottleneck with all its wait states would vanish. That is what is called a use case. A use case is a very specific set of circumstances.

It would not be throwing a whole thread away. It would be doing work that the other threads would not have to do. Locks are expensive maybe even costing as much as a 10% performance hit. Don't shoot me I'm just guessing because I don't remember what the performance hit is. But whatever it is a simple break even formula should tell us how many threads would be the break even point. As far as not having much time after MakeMove() let's see--in my engine I would have a ton of time but that might depend on the specific engine.

hgm · Post by **hgm** » Fri Mar 01, 2019 8:55 pm

You don't need locks for hash tables. And the bottleneck will still be there; it is a hardware bottleneck. It doesn't matter if one thread or many request the data, it still has to pass all through the memory bus. And this has finite bandwidth.

And there still isn't any useful work for a thread to do between knowing what position to look up in the hash, and getting the result that tells it the result is already known.

Michael Sherwin · Post by **Michael Sherwin** » Fri Mar 01, 2019 9:37 pm

hgm wrote: ↑Fri Mar 01, 2019 8:55 pm You don't need locks for hash tables. And the bottleneck will still be there; it is a hardware bottleneck. It doesn't matter if one thread or many request the data, it still has to pass all through the memory bus. And this has finite bandwidth.

And there still isn't any useful work for a thread to do between knowing what position to look up in the hash, and getting the result that tells it the result is already known.

I do not believe that you are understanding the peculiar bottleneck problem of the TR2990x. Memory accessed by the second set of 16 cores is very very very slow when all threads access big data sets. In that situation even the first 16 cores suffer because of this bottleneck. So you want big data access running on the first 16 cores and as little data as possible accessed on the rest of the cores. The second set of cores do have their own memory cache and it is just when they are forced to go to the main memory that the whole system suffers on the TR2990x.

Dann Corbit · Post by **Dann Corbit** » Fri Mar 01, 2019 10:55 pm

Asmfish does seem to suffer a lot of unusual SMP loss on that chip (we ordinarily see NPS increase in linear fashion with cores, though search depth does not get linear increase):
http://www.ipmanchess.yolasite.com/amd- ... -bench.php

NPS: 77,576,847
Machine: AMD Ryzen Threadripper 2 2990WX @3.4Ghz
Threadcount: 64threads
Use std, popcount, or bmi build: pop
User name: Monstru

Same machine with 32 threads:
NPS: 58,856,709

So only 1.31x speedup in NPS with double the theads.

A single AMD 7601 with 64 threads gets 66,596,640 NPS at 2.2 GHz (at equivalent clock speed to 2990WX it would scale to 102,922,080 NPS)
and two of them get 127,342,246 NPS at 2.2 GHz with 128 threads (very nearly linear at high core counts)

The difference is that the 7601 has twice the memory channels. It also costs about 2.22x as much money.
https://www.anandtech.com/show/13124/th ... -review/15
https://www.pcworld.com/article/3298859 ... mance.html

I am very keen to see what the dual CPU 7nm next gen 7601 type systems will do. I think they will have 128 core count versions, so two of them will give 256 cores times two threads/core = 512 threads. I am also guess they will be serious money.

hgm · Post by **hgm** » Fri Mar 01, 2019 11:05 pm

That no doubt counts as a poor design, but how does your scheme solve that? The threads on the 'orphaned' cores must still do hash probes, and these still have to come through the CPU that can communicate with memory. Hardware should always be able to do that faster than software.

Michael Sherwin · Post by **Michael Sherwin** » Fri Mar 01, 2019 11:38 pm

hgm wrote: ↑Fri Mar 01, 2019 11:05 pm That no doubt counts as a poor design, but how does your scheme solve that? The threads on the 'orphaned' cores must still do hash probes, and these still have to come through the CPU that can communicate with memory. Hardware should always be able to do that faster than software.

The orphaned cores do not do hash probes or stores in this scheme. Instead they place a request for such in a rather small global structure array and it is the dedicated thread that accesses the large data hashtable. The dedicated thread loops over (polls) the global request data structure and either stores an entry in the hashtable or retrieves an entry from the hashtable and places it in the global data structure. When a search thread is ready for the retrieved entry it should be in the global data structure by then. Sorry, I do not know how else to word it. The schemes philosophy is to keep data requirements for all other threads small and to totally avoid race conditions.

hgm · Post by **hgm** » Sat Mar 02, 2019 12:05 am

You make it sound like there is any shared storage between the orphaned cores and the others. But if that were true, they would certainly also share the memory interface. Data is either in one groups of cores or in the other, and the one where it isn't just has tough luck when it wants to access it, because then it will have to be transferred over a 'slow' connection. This is why the access to memory from the orphaned cores is slow in the first place. It adds the (two-way) transfer time between the groups to the memory access time.

Maybe a new use case for an old idea of mine

Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine

Re: Maybe a new use case for an old idea of mine