atomic TT

cdani · Post by **cdani** » Mon Sep 14, 2015 2:02 pm

In Andscacs I use the lockless trick but without even any xor, for the key (uint32_t) + move (uint16_t) + static evaluation (int16_t), so I fit them in an uint64_t, and for the rest I ignore TT races:

Code: Select all

struct thash&#123;
	uint64_t w1;
	/*ClauGuardaHash clauhash;
	Moviment millor_moviment;
	Valor valor; //avaluació 	*/
	uint8_t tipuspuntuaciohash;  //tt_alpha,beta,exacte
	uint8_t edat;
	Valor score;
	int16_t depth;
&#125;;

hgm · Post by **hgm** » Mon Sep 14, 2015 2:23 pm

So what exactly is the 'trick' then? Or do you mean that ignoring the problem completely is tricky?

cdani · Post by **cdani** » Mon Sep 14, 2015 3:01 pm

Well, the "trick" is that I have the atomicity of part of the hash entry, the atomicity produced by storing a single uint64_t.

I think is better than what is done in Stockfish, for example

hgm · Post by **hgm** » Mon Sep 14, 2015 3:27 pm

Doesn't the compiler do that automatically? How else could it store them? If the struct contains an int64 for the signature, C alignment rules force the address to be aligned with 8-byte boundaries, so if the entry is not larger than 16 byte (which it almost never is), all other items must be packed in the second 8-byte machine word.

Or is it that you first make a copy of that part of the struct before accessing any of the individual fields? I guess accessing them separately is indeed ill-advised in a multi-threaded environment, and one should always make a copy first.

This guarantees that the score, move, flags etc. all belong with each other. But does that of any real help when they do not belong to the sought position? If I have a score that is completely wrong because it comes from another position, is there really any value in knowing correctly that it is (say) an upper bound? It seems to me that when you got it wrong, you got it wrong. There doesn't seem much benefit in getting it consistently wrong.

It would actually be much better if the key were distributed over the two int64 words of the entry. E.g. declare

Code: Select all

typedef struct &#123;
  int16 move;
  int16 score;
  int32 key&#91;2&#93;;
  char flags;
  char depth;
  char age;
  char whatever;
&#125; HashEntry;

If you align the hash table with cache-line boundaries this would make key[0] fall in the first word, and key[1] in the second word. You could then (after copying the entire entry) compare the key as

Code: Select all

if&#40;hashKey = *&#40;int64 *&#41;hashEntry.key&#41; ...

to read the key[2] array as a single 64-bit quantity. If the two halves of the key would come from different entries because of the race, they are unlikely to match any other key.

bob · Post by **bob** » Mon Sep 14, 2015 3:27 pm

sje wrote:
bob wrote:One thing you MUST do to make this work well... each 4 byte lock MUST be in a separate 64 byte block of memory, so that you don't share locks within a cache line... The read-for-ownership type overhead becomes untenable...
This is apparently not a problem with 256 locks, even if there is cache line shared residency. If it were otherwise, then there would be a big time difference between 256 regions and 512 regions -- there was none.

I started the testing with 64 regions and doubled that until there was no more time savings.

Perhaps the region count should be a one-time calculation at startup where it is given a value of 16 times the maximum in-use thread count or core count.

I've also run tests with up to 256 threads running on from one to 16 cores. It works surprisingly well, in part because a lock is held for only a few nanoseconds and so there's little chance of a timer interrupt suspending the locking thread.

256 locks = 8 cache lines. So there is contention when you get to a significant number of cores. but more importantly, you generate a TON of cache traffic. My 20 core box will definitely show the problem as you have 20 cores fighting over 8 cache blocks. Number of locks is irrelevant here, it is the cache blocks that are the issue.

bob · Post by **bob** » Mon Sep 14, 2015 5:33 pm

lucasart wrote:OK, so the short answer is no. I was a bit over optimistic here.

I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.

I'd bet that with a decent # of cores (16 and up) this will crash and burn performance-wise on a position like fine 70, since two cores can never have the same tt bucket in a local cache block at the same time due to the atomic access required for the embedded lock.

sje · Post by **sje** » Mon Sep 14, 2015 8:03 pm

bob wrote:256 locks = 8 cache lines. So there is contention when you get to a significant number of cores. but more importantly, you generate a TON of cache traffic. My 20 core box will definitely show the problem as you have 20 cores fighting over 8 cache blocks. Number of locks is irrelevant here, it is the cache blocks that are the issue.

A while back I ran a test on a quad core box with and without spinlocked regions. The test without spinlocks showed some data errors as expected. But it also ran only about three percent faster than the locking version. This was done doing perft() stashing/fetching perft(2) and deeper sums at every opportunity.

With the spinlocked region approach, I don't have to know processor dependent details about 64 bit atomicity, I don't have to be concerned with 32 bit vs 64 bit architecture diferences, and with 128 bit signatures, I don't need to worry about false positives or move verification. So, for only a few speed percentage points, I get peace of mind.

cdani · Post by **cdani** » Mon Sep 14, 2015 8:38 pm

Previously I was storing field by field, and retrieving each field as needed, so maybe the key was the good one, but the move was not because inbetween another thread had modified it.

Now storing and retrieving the whole uint64_t, I can be sure that at least this uint64_t is coherent. In fact now I don't find moves not related to the position, previously happened various times in a single game. That was my simple objective.

I know I can do something to be sure of the second part of the hash entry also, but already what I do is better, if I understand well the code, of what Stockfish is doing (nothing apart of validating the move), so it's enough good for the moment for me.

lucasart · Post by **lucasart** » Mon Sep 14, 2015 9:10 pm

bob wrote:
lucasart wrote:OK, so the short answer is no. I was a bit over optimistic here.

I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.
I'd bet that with a decent # of cores (16 and up) this will crash and burn performance-wise on a position like fine 70, since two cores can never have the same tt bucket in a local cache block at the same time due to the atomic access required for the embedded lock.

We shall see. But I can only test up to 8 cores, and that's using HT…

bob · Post by **bob** » Mon Sep 14, 2015 9:18 pm

lucasart wrote:
bob wrote:
lucasart wrote:OK, so the short answer is no. I was a bit over optimistic here.

I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.
I'd bet that with a decent # of cores (16 and up) this will crash and burn performance-wise on a position like fine 70, since two cores can never have the same tt bucket in a local cache block at the same time due to the atomic access required for the embedded lock.
We shall see. But I can only test up to 8 cores, and that's using HT…

That's not going to say much. Each pair of cores has a shared L1 and L2 cache in that configuration since each pair are really one physical core. Contention for locks won't be so noticeable.

If you have something you want to test, I can test on 20 real cores, if you can automate the testing to keep it simple...

atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT