hyper threading and move generation

tej · Post by **tej** » Wed Aug 01, 2012 5:21 pm

Hi, I have an idea I would like comments about. I'm totally new to this forum, so just correct me if I'm wrong on something.

Anyways, I'm thinking about using a "slave thread" for move generation, while a "master thread" on the HT core is accessing a hash table, and waiting for RAM. I don't know how to determine a good hash table size, but with this framework cpu cache usage while using the hash table would not be an issue, so there wouldn't really be a limit on the number of entries in the table.

Has anyone tried something similar? / Or has a quick idea why this would not work?

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 01, 2012 5:56 pm

tej wrote:Hi, I have an idea I would like comments about. I'm totally new to this forum, so just correct me if I'm wrong on something.

Anyways, I'm thinking about using a "slave thread" for move generation, while a "master thread" on the HT core is accessing a hash table, and waiting for RAM. I don't know how to determine a good hash table size, but with this framework cpu cache usage while using the hash table would not be an issue, so there wouldn't really be a limit on the number of entries in the table.

Has anyone tried something similar? / Or has a quick idea why this would not work?

I think you got this wrong. Hash table size (number of entries) does not affect how much time is spent on accessing the hash table, since it is always done in constant time O(1).
HT technology uses idle superscalar processors by starting a new thread so you may as well start two search threads on one physical core. For chess programs this was not very much of a win because whenever you add a thread you add also search overhead. Unless the speed up from the additional hyper thread is enough to offset the overhead , the search would be slower than using single thread. You can find some threads in this forum about tests with/without HT.

Note that if there was any instruction level parallelism left the compiler probably would have taken care of it. So it is better to start a new thread for searching.

Desperado · Post by **Desperado** » Wed Aug 01, 2012 6:12 pm

tej wrote:Hi, I have an idea I would like comments about. I'm totally new to this forum, so just correct me if I'm wrong on something.

Anyways, I'm thinking about using a "slave thread" for move generation,
while a "master thread" on the HT core is accessing a hash table, and waiting for RAM. I don't know how to determine a good hash table size, but with this framework cpu cache usage while using the hash table would not be an issue, so there wouldn't really be a limit on the number of entries in the table.

Has anyone tried something similar? / Or has a quick idea why this would not work?

Hello Gabor,

first welcome to the forum

!

if i understand correctly, you want to use a "move-generation" thread while another thread is just picking the generated moves from a hashtable then. Correct ?

If so, i wonder how do you handle the alphaBeta framework.
Will you generate millions of nodes which will never be accessed later ?
Or do you have a nice solution to avoid massive overhead ?

Parallel computing power should be used for the search imo.

you can find more information here

Michael

tej · Post by **tej** » Wed Aug 01, 2012 6:34 pm

Quick responses : )

Anyway, I wasn't interested in paralell search, let me try to explain it better:

A thread is doing the nice alpha-beta work, and stores stuff in the transposition table, and sometimes reads from the transposition table. Also, Probably not doing mush work while waiting for the RAM, which can be a long time if it doesn't hit L2 or L3 cache. So the other thread would wake up, and help out with some move generation in the L1 cache of the same core. When the very useful bits arrive from RAM, the main thread takes back control.

hgm · Post by **hgm** » Wed Aug 01, 2012 7:40 pm

In principle this could work, provided you have an efficient way to start and stop the move-generation thread, and can prevent it from wasting too much CPU resoures when it has nothing to do. And you should be able to force the threads to run on the same physical core, so they share L1, or it would be a disaster.

One way would be to use a long-latency instruction like multiply in a spin-lock. Like:

Code: Select all

value = lock;
while&#40;value&#41; value *= lock;

This would fall through when the other thread sets lock=0. And it would not consume too many resoures, since the critical path of the loop contains a multiply, which takes many clocks, and the execution of the test-and-branch will have to wait for the result.

syzygy · Post by **syzygy** » Wed Aug 01, 2012 9:09 pm

tej wrote:A thread is doing the nice alpha-beta work, and stores stuff in the transposition table, and sometimes reads from the transposition table. Also, Probably not doing mush work while waiting for the RAM, which can be a long time if it doesn't hit L2 or L3 cache. So the other thread would wake up, and help out with some move generation in the L1 cache of the same core. When the very useful bits arrive from RAM, the main thread takes back control.

An easier way to achieve more or less the same effect is to issue a prefetch instruction to load the TT entry into cache while the cpu continues with some useful work.

In this paper, Ulrich Drepper proposes to use a hyperthread to do the prefetching (see par. 6.3.4 starting on page 63), which seems to get quite close to what you are thinking of. I don't think it will be easy to get synchronisation overhead low enough to make it work, though.

syzygy · Post by **syzygy** » Wed Aug 01, 2012 9:19 pm

hgm wrote:In principle this could work, provided you have an efficient way to start and stop the move-generation thread, and can prevent it from wasting too much CPU resoures when it has nothing to do. And you should be able to force the threads to run on the same physical core, so they share L1, or it would be a disaster.

One way would be to use a long-latency instruction like multiply in a spin-lock. Like:
Code: Select all
value = lock;
while&#40;value&#41; value *= lock;
This would fall through when the other thread sets lock=0. And it would not consume too many resoures, since the critical path of the loop contains a multiply, which takes many clocks, and the execution of the test-and-branch will have to wait for the result.

The monitor/mwait instructions were designed for efficiently implementing spinlocks when using hyperthreads. See e.g. this presentation.

tej · Post by **tej** » Wed Aug 01, 2012 10:07 pm

syzygy wrote: An easier way to achieve more or less the same effect is to issue a prefetch instruction to load the TT entry into cache while the cpu continues with some useful work.

Yes, prefetch is much more simple then what I proposed, with no overhead I think.

syzygy wrote: In this paper, Ulrich Drepper proposes to use a hyperthread to do the prefetching (see par. 6.3.4 starting on page 63), which seems to get quite close to what you are thinking of. I don't think it will be easy to get synchronisation overhead low enough to make it work, though.

So it wasn't a brand new idea after all : )

hyper threading and move generation

hyper threading and move generation

Re: hyper threading and move generation

Re: hyper threading and move generation

Re: hyper threading and move generation

Re: hyper threading and move generation

Re: hyper threading and move generation

Re: hyper threading and move generation

Re: hyper threading and move generation