Code: Select all
struct thash{
uint64_t w1;
/*ClauGuardaHash clauhash;
Moviment millor_moviment;
Valor valor; //avaluació */
uint8_t tipuspuntuaciohash; //tt_alpha,beta,exacte
uint8_t edat;
Valor score;
int16_t depth;
};
Moderators: hgm, Rebel, chrisw
Code: Select all
struct thash{
uint64_t w1;
/*ClauGuardaHash clauhash;
Moviment millor_moviment;
Valor valor; //avaluació */
uint8_t tipuspuntuaciohash; //tt_alpha,beta,exacte
uint8_t edat;
Valor score;
int16_t depth;
};
Code: Select all
typedef struct {
int16 move;
int16 score;
int32 key[2];
char flags;
char depth;
char age;
char whatever;
} HashEntry;
Code: Select all
if(hashKey = *(int64 *)hashEntry.key) ...
256 locks = 8 cache lines. So there is contention when you get to a significant number of cores. but more importantly, you generate a TON of cache traffic. My 20 core box will definitely show the problem as you have 20 cores fighting over 8 cache blocks. Number of locks is irrelevant here, it is the cache blocks that are the issue.sje wrote:This is apparently not a problem with 256 locks, even if there is cache line shared residency. If it were otherwise, then there would be a big time difference between 256 regions and 512 regions -- there was none.bob wrote:One thing you MUST do to make this work well... each 4 byte lock MUST be in a separate 64 byte block of memory, so that you don't share locks within a cache line... The read-for-ownership type overhead becomes untenable...
I started the testing with 64 regions and doubled that until there was no more time savings.
Perhaps the region count should be a one-time calculation at startup where it is given a value of 16 times the maximum in-use thread count or core count.
I've also run tests with up to 256 threads running on from one to 16 cores. It works surprisingly well, in part because a lock is held for only a few nanoseconds and so there's little chance of a timer interrupt suspending the locking thread.
I'd bet that with a decent # of cores (16 and up) this will crash and burn performance-wise on a position like fine 70, since two cores can never have the same tt bucket in a local cache block at the same time due to the atomic access required for the embedded lock.lucasart wrote:OK, so the short answer is no. I was a bit over optimistic here.
I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.
A while back I ran a test on a quad core box with and without spinlocked regions. The test without spinlocks showed some data errors as expected. But it also ran only about three percent faster than the locking version. This was done doing perft() stashing/fetching perft(2) and deeper sums at every opportunity.bob wrote:256 locks = 8 cache lines. So there is contention when you get to a significant number of cores. but more importantly, you generate a TON of cache traffic. My 20 core box will definitely show the problem as you have 20 cores fighting over 8 cache blocks. Number of locks is irrelevant here, it is the cache blocks that are the issue.
We shall see. But I can only test up to 8 cores, and that's using HT…bob wrote:I'd bet that with a decent # of cores (16 and up) this will crash and burn performance-wise on a position like fine 70, since two cores can never have the same tt bucket in a local cache block at the same time due to the atomic access required for the embedded lock.lucasart wrote:OK, so the short answer is no. I was a bit over optimistic here.
I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.
That's not going to say much. Each pair of cores has a shared L1 and L2 cache in that configuration since each pair are really one physical core. Contention for locks won't be so noticeable.lucasart wrote:We shall see. But I can only test up to 8 cores, and that's using HT…bob wrote:I'd bet that with a decent # of cores (16 and up) this will crash and burn performance-wise on a position like fine 70, since two cores can never have the same tt bucket in a local cache block at the same time due to the atomic access required for the embedded lock.lucasart wrote:OK, so the short answer is no. I was a bit over optimistic here.
I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.