hi Robert,
many thanks for you commit, with 4.2 i mean the difference of the AMD and Intel CPU SSE4 functions, on my AMD its mean SSE4.0A and on Intel its mean probably 4.2
using Popcount and Prefetch with SSE4 hardware support
Moderators: hgm, Rebel, chrisw
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
Re: using Popcount and Prefetch with SSE4 hardware support
so i am not very sure if both are compatible with other.
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
Re: using Popcount and Prefetch with SSE4 hardware support
lol, "Removing the prefetch actually IMPROVES PERFORMANCE"
thats what i experemented too
NTA means no cache using, so you can complete disable that too.
thats what i experemented too
NTA means no cache using, so you can complete disable that too.
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: using Popcount and Prefetch with SSE4 hardware support
From Wikipedia:
The full set of Intel's SSE4 instructions can only be used on Bulldozer-based FX processors. These processors also implement SSE4.1 and SSE4.2.
If you are only using the POPCNT instruction, your program should run on all Intel processors that support SSE4.2 and on all AMD processors that support SSE4a.
This is how I understand it.
For what it's worth: for me, prefetching inside make_move() immediately after calculating the new hashkey improves performance by about 2%, but only if I prefetch with HINT_NTA. Using other flags or moving the prefetch closer to the actual read access either gave inconsistent results or made no difference compared with not prefetching. I did not yet try prefetching the tt entry before storing.
I think you need to use performance counters to properly test this (which I have not done).
If I understand this correctly, SSE4a does NOT implement all of Intel's SSE4.Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in Penryn. Additionally, SSE4.2, a second subset consisting of the 7 remaining instructions, is first available in Nehalem-based Core i7. Intel credits feedback from developers as playing an important role in the development of the instruction set.
AMD supports 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 in the Bulldozer-based FX processors. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
The full set of Intel's SSE4 instructions can only be used on Bulldozer-based FX processors. These processors also implement SSE4.1 and SSE4.2.
If you are only using the POPCNT instruction, your program should run on all Intel processors that support SSE4.2 and on all AMD processors that support SSE4a.
This is how I understand it.
For what it's worth: for me, prefetching inside make_move() immediately after calculating the new hashkey improves performance by about 2%, but only if I prefetch with HINT_NTA. Using other flags or moving the prefetch closer to the actual read access either gave inconsistent results or made no difference compared with not prefetching. I did not yet try prefetching the tt entry before storing.
I think you need to use performance counters to properly test this (which I have not done).
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: using Popcount and Prefetch with SSE4 hardware support
The issue is that one has to be very careful when choosing to pre-fetch something you might not need, and replacing something that you might need. Done wisely, it should never hurt. Done carelessly, it can hurt as Linus pointed out...Engin wrote:lol, "Removing the prefetch actually IMPROVES PERFORMANCE"
thats what i experemented too
NTA means no cache using, so you can complete disable that too.
Compilers are notoriously poor at dealing with this, because they don't understand the code...
-
- Posts: 918
- Joined: Mon Jan 05, 2009 7:40 pm
- Location: Germany
- Full name: Engin Üstün
Re: using Popcount and Prefetch with SSE4 hardware support
yeah, but we dont need all sse4 instructions for chess, we are interested only on popcount and prefetch mostly, i am confused that on AMD its working fine and on intel its crashed sometimes.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: using Popcount and Prefetch with SSE4 hardware support
I got a 3% speed up with prefetching now. I was pre-fetching the wrong tt table. Prefetching the main tt which is not probed in qsearch does not give any speed up, and infact it may have slowed it down since I was prefetching it where it was not needed. BUT the eval and pawn hash table entries combined gave me a good speed up. There are many things to do before checking the eval and pawn hash tables which is what brought the difference. Also prefetching to L1 seems to give a slight measurable improvement to doing it to L2/L3 only, so MM_HINT_T0 may be better. I realized that I should have focused on prefetching in qsearch after I did cache simulation which shows eval cache to be the most critical. Also avoiding L1 cache misses may help from the high D1mr misses (57k vs 25k). Here is relevant part of cachegrind result for probe_eval_hash and probe_hash.
OTOH popcnt is not giving any improvement on the opteron for some reason
Code: Select all
3,900,855 8 8 1,447,712 31,470 25,258 520,518 12 0 hash.cpp:SEARCHER::probe_hash(int, unsigned long const&, int, int, int&, unsigned int&, int, int, int&, int&)
3,089,440 9 7 1,505,679 7,066 44 730,381 3,466 18 scorpio.h:SEARCHER::do_move(unsigned int const&)
2,966,961 28 18 434,388 2,826 180 88,488 0 0 attack.cpp:SEARCHER::is_legal_fast(unsigned int) const
2,480,202 4,268 87 431,564 154 20 120,372 1,284 106 moves.cpp:SEARCHER::gen_noncaps()
2,462,737 93 61 554,965 2,241 7 296,840 908 0 eval.cpp:SEARCHER::eval_pawn_cover(int, int, unsigned char*, unsigned char*)
2,399,779 597 15 541,011 392 3 0 0 0 util.cpp:SEARCHER::draw() const
2,063,724 1 1 1,119,950 70,376 57,085 232,308 0 0 hash.cpp:SEARCHER::probe_eval_hash(unsigned long const&, int&, int&, tagEVALREC&)
2,020,320 4 4 597,780 2,142 9 777,808 308 5 scorpio.h:SEARCHER::undo_move()
1,583,864 11 11 502,992 1,997 37 298,329 0 0 hash.cpp:SEARCHER::record_hash(int, unsigned long const&, int, int, int, int, unsigned int, int)
1,386,415 5 5 437,004 0 0 391,721 0 0 scorpio.h:search(PROCESSOR*)
1,063,726 1 1 169,138 202 0 50,160 0 0 scorpio.h:SEARCHER::get_qmove()
1,020,805 2 2 492,090 24,942 6,552 146,500 16 0 hash.cpp:SEARCHER::probe_pawn_hash(unsigned long const&, SCORE&, tagPAWNREC&)