Hash cache

hgm · Post by **hgm** » Mon Oct 12, 2015 2:41 pm

Would it pay to keep a small software cache where you store the most-recently probed engines of the main TT, that is so small that it can permanently reside in one of the hardware caches? The hardware cache would of course also keep most-recently probed entries in cache, but can do so less efficiently, because of the length of the cache line: if you have 16 bytes per entry every accessed entry would bring 3 others with it in cache, an it would be quite unlikely that any of these would be needed. Picking out only the entry that you need and copy it to an in-cache table that only contains entries you need would use cache memory 4x more efficiently. So probed entries could be kept up to 4 times longer in cache as when you would leave it up to the hardware.

You could implement this as a write-through cache with always-replace, or least-recent-used-of-two replacement, so that the contents of the TT cache always matches that of the main TT. You would always probe the TT cache, and only if that was a miss, probe the main TT (and copy it to the TT cache if it was a hit).

An alternative would be to have an exclusive cache (which could hold stuff not in the main TT) only used for probing and storing in QS nodes. (Again, small enough to always reside in cache.) QS trees branch not nearly as much as full-width nodes, so you can store a tree of sizable depth in a comparatively small table (perhaps already 6 levels in 2K entries), and catch all transposiitons within it. This is a less drastic solution than not probing TT in QS at all.

Aleks Peshkov · Post by **Aleks Peshkov** » Mon Oct 12, 2015 4:01 pm

QS node easily fits in 8 bytes (no need to keep 'depth' and 'age' fields).

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 4:18 pm

hgm wrote:Would it pay to keep a small software cache where you store the most-recently probed engines of the main TT, that is so small that it can permanently reside in one of the hardware caches? The hardware cache would of course also keep most-recently probed entries in cache, but can do so less efficiently, because of the length of the cache line: if you have 16 bytes per entry every accessed entry would bring 3 others with it in cache, an it would be quite unlikely that any of these would be needed. Picking out only the entry that you need and copy it to an in-cache table that only contains entries you need would use cache memory 4x more efficiently. So probed entries could be kept up to 4 times longer in cache as when you would leave it up to the hardware.

You could implement this as a write-through cache with always-replace, or least-recent-used-of-two replacement, so that the contents of the TT cache always matches that of the main TT. You would always probe the TT cache, and only if that was a miss, probe the main TT (and copy it to the TT cache if it was a hit).

An alternative would be to have an exclusive cache (which could hold stuff not in the main TT) only used for probing and storing in QS nodes. (Again, small enough to always reside in cache.) QS trees branch not nearly as much as full-width nodes, so you can store a tree of sizable depth in a comparatively small table (perhaps already 6 levels in 2K entries), and catch all transposiitons within it. This is a less drastic solution than not probing TT in QS at all.

Iv'e been thinking about a separate cache for storing QS nodes as well.
Not because it can eventually reside in the hardware cache, but because storing all these QS nodes (there are a lot of them) is polluting my main TT.

On average I don't see a lot of difference in playing strength between using or not using the TT in quiescense.
When I use the TT in quiescensce the slowdown is about 25%, together with the pollution it annihilates the gain I get from using it.
By using large pages the situation gets somewhat better because it lowers the burden upon the TLB.

Using a small separate cache for the QS sounds like something I want to try.
I don't expect that keeping a small cache with the most recently probed entries from the main TT will give you any gain because overhead in the main search is hardly noticeable in the engine overall.
The largest portion of the time is spent in the QS, at least is this the case in my engine.

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 4:26 pm

Aleks Peshkov wrote:QS node easily fits in 8 bytes (no need to keep 'depth' and 'age' fields).

I don't know if you can omit depth.
For instance at depth 0 I do pawn moves to the 7th rank.
At depths 0 and -1 I also do checking moves.
I have to distinguish between depths 0, -1, and -2.

I also use the move from the TT in quiescense and try that first, it decreases the number of nodes by a small percentage.

hgm · Post by **hgm** » Mon Oct 12, 2015 6:13 pm

Well, 4 bytes signature, 2 bytes score, 2 bytes move would make 8 bytes. The move only needs 12 bits (prune under-promotions in QS), so you would have 4 bits left. Two bits for the depth, and two bits for the bound type.

For purists that do want to be able to encode under-promotions: there are only 3 bound types, so setting the two bits encoding the bound type to the invalid combination can be used as an escape to flag an alternative encoding scheme, where the to-square encodes the bound type, the move direction (left, right or forward) and the promotion choice (N, B, R), 3x3x3 = 27 possibilities. It can be decoded by using it to index three 27-entry tables for flags, move and promo choice.

jdart · Post by **jdart** » Mon Oct 12, 2015 7:29 pm

With NUMA, local caches don't necessarily stay local, since the thread scheduler can migrate the thread to another CPU. You can avoid this by pinning theads to CPUs. But if you don't do this you may find more caching does not help as much as you might think.

--Jon

jdart · Post by **jdart** » Mon Oct 12, 2015 7:30 pm

I used to have a separate eval cache that was mostly used during quiescence (since the first thing all non-check q-search nodes is an evaluation). But I took it out and store the eval in the main hashtable now.

--Jon

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 7:35 pm

When you forget about minor promotions and you don't encode additional information in your move, it is possible to pack a move into 12 bits, sure.
Unfortunately I have some additional information stored in my moves like enpassant capture, castling and other stuff.
I don't want to use a different move format for quiescense.

A key of 32 bits is not going to work, even when you add the number of bits from your index this is way to unreliable.
I guess I need at least 12 bytes for 1 entry, maybe even more.

mvk · Post by **mvk** » Mon Oct 12, 2015 7:40 pm

A depth0-only cache-local mini table can be indexed by hash^alpha and store just the returned score, as long as you don't probe it in PV. No need to even clear it. That is how I do it. The only thing to confirm is that your zobrist scheme doesn't collide between ^alpha vs ^(-(alpha+1)) == ^~alpha, for example in the side-to-move term.

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 7:50 pm

jdart wrote:I used to have a separate eval cache that was mostly used during quiescence (since the first thing all non-check q-search nodes is an evaluation). But I took it out and store the eval in the main hashtable now.

--Jon

In the past I also had a separate eval cache, but not in the current engine.

There are some stupid things in my engine, it always calls the evaluation function including a pawn-table probe after a domove when it is not a checking move.
That means the time spent for probing the pawn-table and evaluation is wasted in case of a TT hit.
There is enough room for improvement.

Hash cache

Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache