SMP question, dedicating a thread for managing hash

Michael Sherwin · Post by **Michael Sherwin** » Mon Oct 09, 2017 6:39 pm

How many search threads can be serviced by the dedicated hash thread before it will start getting behind. The idea is for a search thread to request an entry with a pointer to store the query, do some stuff and have the information waiting when it is needed. Or to send a pointer to store an entry and proceed without waiting. Since their will be only one thread accessing the hash table(s) race conditions would be non existent and locks would not be needed. Can this be efficient? How many search threads would be needed for this to be efficient?

Evert · Post by **Evert** » Mon Oct 09, 2017 7:12 pm

Michael Sherwin wrote:How many search threads can be serviced by the dedicated hash thread before it will start getting behind. The idea is for a search thread to request an entry with a pointer to store the query, do some stuff and have the information waiting when it is needed. Or to send a pointer to store an entry and proceed without waiting. Since their will be only one thread accessing the hash table(s) race conditions would be non existent and locks would not be needed. Can this be efficient? How many search threads would be needed for this to be efficient?

I don't think this will work too well. You need some global structure to pass the information around (or communicate through a pipe, which might be more elegant but carries overhead), but then the search thread still needs to synchronise with the hash thread to make sure the information it requested there (or you must accept spurious hash misses). The hash thread, meanwhile, needs to busy-wait for data, which is probably a waste of CPU time.
That's my expectation though, I'm not aware of anyone trying something like this.

hgm · Post by **hgm** » Mon Oct 09, 2017 7:50 pm

You are trying to do in software what the hardware already does better. The CPU does continue executing instructions while waiting for the hash entry to arrive, through speculative out-of-order execution. You would also have to speculate what 'stuff' you would have the thread doing while waiting for the hashed data. Because this is usually completely different depending on whether you have a miss, a hit or a hash cut. You can follow the most likely path, but that is what the branch prediction logic would also do for the speculative execution. (And it might be more clever in guessing.)

There is one aspect where you could probably beat the hardware, because the size of the reorder buffer is relatively small (something like 128 instructions, IIRC, and at 3 instructions per clock it might have executed them all before the DRAM read completes, and then the pipeline stalls anyway). If you can make a non-committal access to the memory location (with an instruction that completes without the memory data arriving, so that it can retire from the reorder buffer, and won't block the pipeline), you could make it so far in advance of the actual usage of the data that it would be fetched before the program needs it. A prefetch or a store instruction would do this. The latter would make the cache line dirty, however. (But if you update age counters, you would make it dirty anyway.) A store would force the data in cache, and then the actual reading of it, if doe sufficiently long after it, will be almost immediate.

But there isn't an awful lot you certainly have to do between knowing which entry you need, and having to know what it contained. You could update the key before MakeMove, but actually the whole probe could best be done before MakeMove. You could calculate the key for the next move, and prefetch the hash data for that, before making the current move, but you would have done that in vain if you got a cutoff. But it might be a good strategy in all-nodes, or for late moves, where the chances for a cutoff are very slim.

But you should always leave handling of the que of prefetch requests to the harware load/store unit, software has no chance to compete with that.

Michael Sherwin · Post by **Michael Sherwin** » Mon Oct 09, 2017 9:17 pm

hgm wrote:You are trying to do in software what the hardware already does better. The CPU does continue executing instructions while waiting for the hash entry to arrive, through speculative out-of-order execution. You would also have to speculate what 'stuff' you would have the thread doing while waiting for the hashed data. Because this is usually completely different depending on whether you have a miss, a hit or a hash cut. You can follow the most likely path, but that is what the branch prediction logic would also do for the speculative execution. (And it might be more clever in guessing.)

There is one aspect where you could probably beat the hardware, because the size of the reorder buffer is relatively small (something like 128 instructions, IIRC, and at 3 instructions per clock it might have executed them all before the DRAM read completes, and then the pipeline stalls anyway). If you can make a non-committal access to the memory location (with an instruction that completes without the memory data arriving, so that it can retire from the reorder buffer, and won't block the pipeline), you could make it so far in advance of the actual usage of the data that it would be fetched before the program needs it. A prefetch or a store instruction would do this. The latter would make the cache line dirty, however. (But if you update age counters, you would make it dirty anyway.) A store would force the data in cache, and then the actual reading of it, if doe sufficiently long after it, will be almost immediate.

But there isn't an awful lot you certainly have to do between knowing which entry you need, and having to know what it contained. You could update the key before MakeMove, but actually the whole probe could best be done before MakeMove. You could calculate the key for the next move, and prefetch the hash data for that, before making the current move, but you would have done that in vain if you got a cutoff. But it might be a good strategy in all-nodes, or for late moves, where the chances for a cutoff are very slim.

But you should always leave handling of the que of prefetch requests to the harware load/store unit, software has no chance to compete with that.

I was thinking it could be prefetched right after MakeMove and before -Search is called as there is a lot that is done before the result is needed after -Search is called. So maybe in RomiChess i'm not doing my hash check soon enough.

Michael Sherwin · Post by **Michael Sherwin** » Mon Oct 09, 2017 9:41 pm

Evert wrote:
Michael Sherwin wrote:How many search threads can be serviced by the dedicated hash thread before it will start getting behind. The idea is for a search thread to request an entry with a pointer to store the query, do some stuff and have the information waiting when it is needed. Or to send a pointer to store an entry and proceed without waiting. Since their will be only one thread accessing the hash table(s) race conditions would be non existent and locks would not be needed. Can this be efficient? How many search threads would be needed for this to be efficient?
I don't think this will work too well. You need some global structure to pass the information around (or communicate through a pipe, which might be more elegant but carries overhead), but then the search thread still needs to synchronise with the hash thread to make sure the information it requested there (or you must accept spurious hash misses). The hash thread, meanwhile, needs to busy-wait for data, which is probably a waste of CPU time.
That's my expectation though, I'm not aware of anyone trying something like this.

I was thinking that the hash thread would simply poll a global queue for a pointer remove the pointer and either fetch the information to store in the hash or deliver the probe data.

Edit: Actually two queues one for probing and one for storing.

hgm · Post by **hgm** » Tue Oct 10, 2017 1:18 pm

The point is that the CPU hardware already maintains such queues. You send it the pointers as the address of load or strore instructions, and after some time it returns you the data in a CPU register.

Doing a lot in Search() before examining the TT seems indeed bad; if you get a hash cutoff, you would have done it all in vain. You would even have called Search() in vain. And MakeMove()...

In Joker I probe the hash before MakeMove(), either before or after the repetition test. (This is a close call; most moves are ot repetitions, and the repetition test gives you something well-predictable to do while waiting for the probe, whichyou would have to do ayway, because even a probe that would give you a cutoff is invalid in the case of a repetition. OTOH, the probe is much more expensive than the repetition test, and if you did the test first, and you did have a repetition, you would not have to do the probe. But in this case the CPU could already have started the probe speculatively, and I am not sure if it could abort the memory fetch.) Only if there is no repeat and no hash cutoff it does a MakeMove() and a recursive call to Search().

SMP question, dedicating a thread for managing hash

SMP question, dedicating a thread for managing hash

Re: SMP question, dedicating a thread for managing hash

Re: SMP question, dedicating a thread for managing hash

Re: SMP question, dedicating a thread for managing hash

Re: SMP question, dedicating a thread for managing hash

Re: SMP question, dedicating a thread for managing hash