hello,
i am really confused todays and dont know what to do with SSE4
i am using Popcount with SSE4 and i get only 1 sec extra speed after a benchmark test from my mashine (AMD Phenom II X6), thats ok, but if i want to try get more speed to use prefetch for the hash tables its look a bit slower then without it.
i do prefetch right after a move is maked, and right before the key is used, nothing helps.
so what is the right place to get a speed up ?
and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.
are the SSE functions not the same on all mashines ?
using Popcount and Prefetch with SSE4 hardware support
Moderators: hgm, Rebel, chrisw
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: using Popcount and Prefetch with SSE4 hardware support
Popcnt should give you some amount of speed up (depending on use) but prefetching hash table entries not so much...
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
Re: using Popcount and Prefetch with SSE4 hardware support
yes it seems so, but my AMD CPU has a bigger L2 Cache instead of L1 ones
so is it not better to using like
_mm_prefetch((char *)entry, _MM_HINT_T2);
instead of
_mm_prefetch((char *)entry, _MM_HINT_T1);
??
so is it not better to using like
_mm_prefetch((char *)entry, _MM_HINT_T2);
instead of
_mm_prefetch((char *)entry, _MM_HINT_T1);
??
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: using Popcount and Prefetch with SSE4 hardware support
Right before the memory access it will be too late to do a prefetch. The idea of a memory prefetch is that the memory you are going to access is brought into the cache before the actual access takes place.Engin wrote:i do prefetch right after a move is maked, and right before the key is used, nothing helps.
so what is the right place to get a speed up ?
I think the best place to do the memory prefetch is right after you have finished calculating the hashkey for the new position, so probably within your make_move() routine.
According to Wikipedia, POPCNT is available on Intel beginning with the Nehalem microarchitecture, and on AMD beginning with the Barcelona microarchitecture. POPCNT officially is not part of any SSE instruction set, but if the machine supports SSE4.2 or SSE4a, it seems it should also support POPCNT.and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.
are the SSE functions not the same on all mashines ?
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: using Popcount and Prefetch with SSE4 hardware support
You may try Bypassing the Cache ...Engin wrote:yes it seems so, but my AMD CPU has a bigger L2 Cache instead of L1 ones
so is it not better to using like
_mm_prefetch((char *)entry, _MM_HINT_T2);
instead of
_mm_prefetch((char *)entry, _MM_HINT_T1);
??
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
On popcnt, I still have no box with that instructions by myself, but it seems AMD and Intel are binary compatible, F3 0F B8 /r. Same CPUID flags.
AMD:
http://support.amd.com/us/Processor_Tec ... APM_v3.pdf
Support for the POPCNT instruction is indicated by ECX bit 23 (POPCNT) as returned by CPUID
function 0000_0001h. Software MUST check the CPUID bit once per program or library initialization
before using the POPCNT instruction, or inconsistent behavior may result.
Intel
http://www.intel.com/content/www/us/en/ ... anual.html
CPUID.01H:ECX.POPCNT [Bit 23]
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: using Popcount and Prefetch with SSE4 hardware support
Fascinating article. Thanks for sharing it.Gerd Isenberg wrote:You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: using Popcount and Prefetch with SSE4 hardware support
he he, just found it via cpw-engineEdmund wrote:Fascinating article. Thanks for sharing it.Gerd Isenberg wrote:You may try Bypassing the Cache ...
http://lwn.net/Articles/255364/
Depends on your memory footprint and processor whether prefetching works. Instruction might as well be nop by some processors ...
Code: Select all
_mm_prefetch((char *)&tt[b.hash & tt_size], _MM_HINT_NTA);
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
Re: using Popcount and Prefetch with SSE4 hardware support
Ok, it seems i do benefit most of from popcount instead of prefetch, but thats better then nothing.
if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.
and i found out to use the hint parameter of MM_HINT_T0 is better then the other T1, T2
many thanks at all for your help !
if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.
and i found out to use the hint parameter of MM_HINT_T0 is better then the other T1, T2
many thanks at all for your help !
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: using Popcount and Prefetch with SSE4 hardware support
Did you try doing it earlier? I agree that would result in some prefetching of entries that aren't used, but that might not hurt much...Engin wrote:if i do prefetch in make move, this is really useless for me, because i dont need the entrys everytime after maked move, because some are pruned, some position are 3 repetition or 50 rule move return draw scores, so i made it only if its really use after draw check and after mate distance pruning before probe the hash only, thats give me some minor speed up in mseconds but its ok.
Just to be clear: I don't know what is better, but the fact that you sometimes don't need the TT entry is not necessarily a good reason to dismiss early prefetching.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: using Popcount and Prefetch with SSE4 hardware support
Prefetching is "iffy".Engin wrote:hello,
i am really confused todays and dont know what to do with SSE4
i am using Popcount with SSE4 and i get only 1 sec extra speed after a benchmark test from my mashine (AMD Phenom II X6), thats ok, but if i want to try get more speed to use prefetch for the hash tables its look a bit slower then without it.
i do prefetch right after a move is maked, and right before the key is used, nothing helps.
so what is the right place to get a speed up ?
and it seem that on my mashine the SSE4 is working fine and i think on AMD its SSE4.A and not 4.2 like in Intel mashines, some of testing peoples reports me that Tornado SSE4 is crash on his mashine.
are the SSE functions not the same on all mashines ?
1. Where to do it? As SOON as you possibly can. As soon as you update the hash signature, because you want to minimize the wait for data to reach cache.
2. What's the trade-off? When you pre-fetch something into cache, you by definition replace something that is already there. If you need what you replace before you need the TT entry, you have to rely on victim buffer hits to avoid a double-memory penalty - one to re-fetch that which was replaced and then one to re-fetch that which you had already pre-fetched...
3. Can't answer the question about compatibility. 4.2 should be 4.2, but not all machines have 4.2 of course.