using Popcount and Prefetch with SSE4 hardware support

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Engin
Posts: 918
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Engin »

hi Robert,
many thanks for you commit, with 4.2 i mean the difference of the AMD and Intel CPU SSE4 functions, on my AMD its mean SSE4.0A and on Intel its mean probably 4.2
Engin
Posts: 918
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Engin »

so i am not very sure if both are compatible with other.
Engin
Posts: 918
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Engin »

lol, "Removing the prefetch actually IMPROVES PERFORMANCE"

thats what i experemented too :)

NTA means no cache using, so you can complete disable that too.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: using Popcount and Prefetch with SSE4 hardware support

Post by syzygy »

From Wikipedia:
Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in Penryn. Additionally, SSE4.2, a second subset consisting of the 7 remaining instructions, is first available in Nehalem-based Core i7. Intel credits feedback from developers as playing an important role in the development of the instruction set.

AMD supports 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 in the Bulldozer-based FX processors. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
If I understand this correctly, SSE4a does NOT implement all of Intel's SSE4.

The full set of Intel's SSE4 instructions can only be used on Bulldozer-based FX processors. These processors also implement SSE4.1 and SSE4.2.

If you are only using the POPCNT instruction, your program should run on all Intel processors that support SSE4.2 and on all AMD processors that support SSE4a.

This is how I understand it.

For what it's worth: for me, prefetching inside make_move() immediately after calculating the new hashkey improves performance by about 2%, but only if I prefetch with HINT_NTA. Using other flags or moving the prefetch closer to the actual read access either gave inconsistent results or made no difference compared with not prefetching. I did not yet try prefetching the tt entry before storing.

I think you need to use performance counters to properly test this (which I have not done).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: using Popcount and Prefetch with SSE4 hardware support

Post by bob »

Engin wrote:lol, "Removing the prefetch actually IMPROVES PERFORMANCE"

thats what i experemented too :)

NTA means no cache using, so you can complete disable that too.
The issue is that one has to be very careful when choosing to pre-fetch something you might not need, and replacing something that you might need. Done wisely, it should never hurt. Done carelessly, it can hurt as Linus pointed out...

Compilers are notoriously poor at dealing with this, because they don't understand the code...
Engin
Posts: 918
Joined: Mon Jan 05, 2009 7:40 pm
Location: Germany
Full name: Engin Üstün

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Engin »

yeah, but we dont need all sse4 instructions for chess, we are interested only on popcount and prefetch mostly, i am confused that on AMD its working fine and on intel its crashed sometimes.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: using Popcount and Prefetch with SSE4 hardware support

Post by Daniel Shawul »

I got a 3% speed up with prefetching now. I was pre-fetching the wrong tt table. Prefetching the main tt which is not probed in qsearch does not give any speed up, and infact it may have slowed it down since I was prefetching it where it was not needed. BUT the eval and pawn hash table entries combined gave me a good speed up. There are many things to do before checking the eval and pawn hash tables which is what brought the difference. Also prefetching to L1 seems to give a slight measurable improvement to doing it to L2/L3 only, so MM_HINT_T0 may be better. I realized that I should have focused on prefetching in qsearch after I did cache simulation which shows eval cache to be the most critical. Also avoiding L1 cache misses may help from the high D1mr misses (57k vs 25k). Here is relevant part of cachegrind result for probe_eval_hash and probe_hash.

Code: Select all

3,900,855      8    8  1,447,712 31,470 25,258   520,518     12      0  hash.cpp:SEARCHER::probe_hash(int, unsigned long const&, int, int, int&, unsigned int&, int, int, int&, int&) 
 3,089,440      9    7  1,505,679  7,066     44   730,381  3,466     18  scorpio.h:SEARCHER::do_move(unsigned int const&) 
 2,966,961     28   18    434,388  2,826    180    88,488      0      0  attack.cpp:SEARCHER::is_legal_fast(unsigned int) const 
 2,480,202  4,268   87    431,564    154     20   120,372  1,284    106  moves.cpp:SEARCHER::gen_noncaps() 
 2,462,737     93   61    554,965  2,241      7   296,840    908      0  eval.cpp:SEARCHER::eval_pawn_cover(int, int, unsigned char*, unsigned char*) 
 2,399,779    597   15    541,011    392      3         0      0      0  util.cpp:SEARCHER::draw() const 
 2,063,724      1    1  1,119,950 70,376 57,085   232,308      0      0  hash.cpp:SEARCHER::probe_eval_hash(unsigned long const&, int&, int&, tagEVALREC&) 
 2,020,320      4    4    597,780  2,142      9   777,808    308      5  scorpio.h:SEARCHER::undo_move() 
 1,583,864     11   11    502,992  1,997     37   298,329      0      0  hash.cpp:SEARCHER::record_hash(int, unsigned long const&, int, int, int, int, unsigned int, int) 
 1,386,415      5    5    437,004      0      0   391,721      0      0  scorpio.h:search(PROCESSOR*) 
 1,063,726      1    1    169,138    202      0    50,160      0      0  scorpio.h:SEARCHER::get_qmove() 
 1,020,805      2    2    492,090 24,942  6,552   146,500     16      0  hash.cpp:SEARCHER::probe_pawn_hash(unsigned long const&, SCORE&, tagPAWNREC&) 
OTOH popcnt is not giving any improvement on the opteron for some reason :?