using Popcount and Prefetch with SSE4 hardware support

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Engin
Posts: 823
Joined: Mon Jan 05, 2009 6:40 pm
Location: Germany

using Popcount and Prefetch with SSE4 hardware support

Post by Engin » Sat May 19, 2012 2:36 pm

hello,

i am really confused todays and dont know what to do with SSE4

i am using Popcount with SSE4 and i get only 1 sec extra speed after a benchmark test from my mashine (AMD Phenom II X6), thats ok, but if i want to try get more speed to use prefetch for the hash tables its look a bit slower then without it.

i do prefetch right after a move is maked, and right before the key is used, nothing helps.

so what is the right place to get a speed up ?

and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.

are the SSE functions not the same on all mashines ?

Daniel Shawul
Posts: 3508
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Daniel Shawul » Sat May 19, 2012 2:53 pm

Popcnt should give you some amount of speed up (depending on use) but prefetching hash table entries not so much...

Engin
Posts: 823
Joined: Mon Jan 05, 2009 6:40 pm
Location: Germany

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Engin » Sat May 19, 2012 3:05 pm

yes it seems so, but my AMD CPU has a bigger L2 Cache instead of L1 ones

so is it not better to using like

_mm_prefetch((char *)entry, _MM_HINT_T2);

instead of

_mm_prefetch((char *)entry, _MM_HINT_T1);


??

syzygy
Posts: 4319
Joined: Tue Feb 28, 2012 10:56 pm

Re: using Popcount and Prefetch with SSE4 hardware support

Post by syzygy » Sat May 19, 2012 3:57 pm

Engin wrote:i do prefetch right after a move is maked, and right before the key is used, nothing helps.

so what is the right place to get a speed up ?
Right before the memory access it will be too late to do a prefetch. The idea of a memory prefetch is that the memory you are going to access is brought into the cache before the actual access takes place.

I think the best place to do the memory prefetch is right after you have finished calculating the hashkey for the new position, so probably within your make_move() routine.
and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.

are the SSE functions not the same on all mashines ?
According to Wikipedia, POPCNT is available on Intel beginning with the Nehalem microarchitecture, and on AMD beginning with the Barcelona microarchitecture. POPCNT officially is not part of any SSE instruction set, but if the machine supports SSE4.2 or SSE4a, it seems it should also support POPCNT.

Gerd Isenberg
Posts: 2107
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Gerd Isenberg » Sat May 19, 2012 4:05 pm

Engin wrote:yes it seems so, but my AMD CPU has a bigger L2 Cache instead of L1 ones

so is it not better to using like

_mm_prefetch((char *)entry, _MM_HINT_T2);

instead of

_mm_prefetch((char *)entry, _MM_HINT_T1);


??
You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...

On popcnt, I still have no box with that instructions by myself, but it seems AMD and Intel are binary compatible, F3 0F B8 /r. Same CPUID flags.

AMD:
http://support.amd.com/us/Processor_Tec ... APM_v3.pdf
Support for the POPCNT instruction is indicated by ECX bit 23 (POPCNT) as returned by CPUID
function 0000_0001h. Software MUST check the CPUID bit once per program or library initialization
before using the POPCNT instruction, or inconsistent behavior may result.

Intel
http://www.intel.com/content/www/us/en/ ... anual.html
CPUID.01H:ECX.POPCNT [Bit 23]

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Edmund » Sat May 19, 2012 5:10 pm

Gerd Isenberg wrote:You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
Fascinating article. Thanks for sharing it.

Gerd Isenberg
Posts: 2107
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Gerd Isenberg » Sat May 19, 2012 5:31 pm

Edmund wrote:
Gerd Isenberg wrote:You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
Fascinating article. Thanks for sharing it.
he he, just found it via cpw-engine ;-)

Code: Select all

 _mm_prefetch((char *)&tt[b.hash & tt_size], _MM_HINT_NTA);
Googling _MM_HINT_NTA guided me to the article by Ulrich Drepper, also as available as pdf. See also Software prefetching considered harmful by Linus Torvalds

Engin
Posts: 823
Joined: Mon Jan 05, 2009 6:40 pm
Location: Germany

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Engin » Sat May 19, 2012 5:54 pm

Ok, it seems i do benefit most of from popcount instead of prefetch, but thats better then nothing.

if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.

and i found out to use the hint parameter of MM_HINT_T0 is better then the other T1, T2

many thanks at all for your help !

syzygy
Posts: 4319
Joined: Tue Feb 28, 2012 10:56 pm

Re: using Popcount and Prefetch with SSE4 hardware support

Post by syzygy » Sat May 19, 2012 6:17 pm

Engin wrote:if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.
Did you try doing it earlier? I agree that would result in some prefetching of entries that aren't used, but that might not hurt much...

Just to be clear: I don't know what is better, but the fact that you sometimes don't need the TT entry is not necessarily a good reason to dismiss early prefetching.

bob
Posts: 20357
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: using Popcount and Prefetch with SSE4 hardware support

Post by bob » Sun May 20, 2012 2:51 pm

Engin wrote:hello,

i am really confused todays and dont know what to do with SSE4

i am using Popcount with SSE4 and i get only 1 sec extra speed after a benchmark test from my mashine (AMD Phenom II X6), thats ok, but if i want to try get more speed to use prefetch for the hash tables its look a bit slower then without it.

i do prefetch right after a move is maked, and right before the key is used, nothing helps.

so what is the right place to get a speed up ?

and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.

are the SSE functions not the same on all mashines ?
Prefetching is "iffy".

1. Where to do it? As SOON as you possibly can. As soon as you update the hash signature, because you want to minimize the wait for data to reach cache.

2. What's the trade-off? When you pre-fetch something into cache, you by definition replace something that is already there. If you need what you replace before you need the TT entry, you have to rely on victim buffer hits to avoid a double-memory penalty - one to re-fetch that which was replaced and then one to re-fetch that which you had already pre-fetched...

3. Can't answer the question about compatibility. 4.2 should be 4.2, but not all machines have 4.2 of course.

Post Reply