Page 1 of 2

using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 4:36 pm
by Engin
hello,

i am really confused todays and dont know what to do with SSE4

i am using Popcount with SSE4 and i get only 1 sec extra speed after a benchmark test from my mashine (AMD Phenom II X6), thats ok, but if i want to try get more speed to use prefetch for the hash tables its look a bit slower then without it.

i do prefetch right after a move is maked, and right before the key is used, nothing helps.

so what is the right place to get a speed up ?

and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.

are the SSE functions not the same on all mashines ?

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 4:53 pm
by Daniel Shawul
Popcnt should give you some amount of speed up (depending on use) but prefetching hash table entries not so much...

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 5:05 pm
by Engin
yes it seems so, but my AMD CPU has a bigger L2 Cache instead of L1 ones

so is it not better to using like

_mm_prefetch((char *)entry, _MM_HINT_T2);

instead of

_mm_prefetch((char *)entry, _MM_HINT_T1);


??

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 5:57 pm
by syzygy
Engin wrote:i do prefetch right after a move is maked, and right before the key is used, nothing helps.

so what is the right place to get a speed up ?
Right before the memory access it will be too late to do a prefetch. The idea of a memory prefetch is that the memory you are going to access is brought into the cache before the actual access takes place.

I think the best place to do the memory prefetch is right after you have finished calculating the hashkey for the new position, so probably within your make_move() routine.
and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.

are the SSE functions not the same on all mashines ?
According to Wikipedia, POPCNT is available on Intel beginning with the Nehalem microarchitecture, and on AMD beginning with the Barcelona microarchitecture. POPCNT officially is not part of any SSE instruction set, but if the machine supports SSE4.2 or SSE4a, it seems it should also support POPCNT.

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 6:05 pm
by Gerd Isenberg
Engin wrote:yes it seems so, but my AMD CPU has a bigger L2 Cache instead of L1 ones

so is it not better to using like

_mm_prefetch((char *)entry, _MM_HINT_T2);

instead of

_mm_prefetch((char *)entry, _MM_HINT_T1);


??
You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...

On popcnt, I still have no box with that instructions by myself, but it seems AMD and Intel are binary compatible, F3 0F B8 /r. Same CPUID flags.

AMD:
http://support.amd.com/us/Processor_Tec ... APM_v3.pdf
Support for the POPCNT instruction is indicated by ECX bit 23 (POPCNT) as returned by CPUID
function 0000_0001h. Software MUST check the CPUID bit once per program or library initialization
before using the POPCNT instruction, or inconsistent behavior may result.

Intel
http://www.intel.com/content/www/us/en/ ... anual.html
CPUID.01H:ECX.POPCNT [Bit 23]

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 7:10 pm
by Edmund
Gerd Isenberg wrote:You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
Fascinating article. Thanks for sharing it.

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 7:31 pm
by Gerd Isenberg
Edmund wrote:
Gerd Isenberg wrote:You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
Fascinating article. Thanks for sharing it.
he he, just found it via cpw-engine ;-)

Code: Select all

 _mm_prefetch((char *)&tt[b.hash & tt_size], _MM_HINT_NTA);
Googling _MM_HINT_NTA guided me to the article by Ulrich Drepper, also as available as pdf. See also Software prefetching considered harmful by Linus Torvalds

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 7:54 pm
by Engin
Ok, it seems i do benefit most of from popcount instead of prefetch, but thats better then nothing.

if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.

and i found out to use the hint parameter of MM_HINT_T0 is better then the other T1, T2

many thanks at all for your help !

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sat May 19, 2012 8:17 pm
by syzygy
Engin wrote:if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.
Did you try doing it earlier? I agree that would result in some prefetching of entries that aren't used, but that might not hurt much...

Just to be clear: I don't know what is better, but the fact that you sometimes don't need the TT entry is not necessarily a good reason to dismiss early prefetching.

Re: using Popcount and Prefetch with SSE4 hardware support

Posted: Sun May 20, 2012 4:51 pm
by bob
Engin wrote:hello,

i am really confused todays and dont know what to do with SSE4

i am using Popcount with SSE4 and i get only 1 sec extra speed after a benchmark test from my mashine (AMD Phenom II X6), thats ok, but if i want to try get more speed to use prefetch for the hash tables its look a bit slower then without it.

i do prefetch right after a move is maked, and right before the key is used, nothing helps.

so what is the right place to get a speed up ?

and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.

are the SSE functions not the same on all mashines ?
Prefetching is "iffy".

1. Where to do it? As SOON as you possibly can. As soon as you update the hash signature, because you want to minimize the wait for data to reach cache.

2. What's the trade-off? When you pre-fetch something into cache, you by definition replace something that is already there. If you need what you replace before you need the TT entry, you have to rely on victim buffer hits to avoid a double-memory penalty - one to re-fetch that which was replaced and then one to re-fetch that which you had already pre-fetched...

3. Can't answer the question about compatibility. 4.2 should be 4.2, but not all machines have 4.2 of course.