Improving TT replacement scheme

tcusr · Post by **tcusr** » Fri Jun 24, 2022 5:56 pm

hgm wrote: ↑Fri Jun 24, 2022 9:28 am Also node that calculating the modulo by a number that is not known at compile time is very expensive (about as much as 70 instructions). And that your TT size is likely not known at compile time, if your engine is UCI and supports the Hash option. Bob Hyatt reported once that Crfty slowed down by some 30% by using modulo instead of the usual ANDing with a mask and liniting the TT size to a power of 2. If you really want to allow other TT sizes, there are faster methods to obtain an index than using modulo TT_size.

there is no speed up for me and i don't understand why.
i use a size known at compile but it's hidden behind some abstractions so the compiler is not able to optimize it, i verified this by looking at the assembly.

jmcd · Post by **jmcd** » Fri Jun 24, 2022 10:01 pm

So I think I have made all of the major changes suggested so far in this thread. Please correct me if I'm wrong.

Code: Select all

    constexpr size_t tt_size = 2097152;
    
    inline int TTable::hash_index(Key key) const {
        return key & (tt_size - 1);
    }
    
     Bucket* ht;

    struct TTEntry {
        Key key;        // 8 bytes
        uint8_t depth;  // 1 byte
        uint8_t flags;  // 1 byte
        short eval;     // 2 bytes
        Move move;      // 4 bytes
     };

    struct Bucket {
        TTEntry e1;
        TTEntry e2;
    };

    TTable::TTable()
    {
        ht = new Bucket[tt_size];
    }

    // empty transposition table
    void TTable::clear()
    {
        std::memset(ht, 0, sizeof(ht));
    }

    TTEntry* TTable::probe(Key key, bool& found) 
    {
        Bucket* b = ht + hash_index(key);
        found = (b->e1.key == key);
        if (found)
            return &b->e1;
        if(b->e1.depth > 0)
            --b->e1.depth;
        found = (b->e2.key == key);
        return &b->e2;
    }

    void TTable::new_entry(Key key, int d, int e, HashFlag f, Move m)
    {
        Bucket* b = ht + hash_index(key);
        if(b->e1.depth <= d)
            b->e1 = TTEntry(key, d, e, f, m);
        else
            b->e2 = TTEntry(key, d, e, f, m);
    }

After all this though, it is still weaker than the "always replace scheme". I assume if there is no problem in the above code, then there must be an issue in the circumstances where I use the transposition table. I only make new transposition entries in 3 areas if the search. 1. beta cutoffs in negamax, 2. at the end of the move loop in negamax once all moves have been searched, and 3. beta cutoffs in quiescent

Code: Select all

	    // beta cutoffs in negamax
	    tt.new_entry(pos.get_key(), depth, beta, HASH_BETA, curr_move.m);

	    // end of negamax moveloop, stores both hash alpha and hash exact entries regardless if new pv was found or not
            tt.new_entry(pos.get_key(), depth, alpha, eval_type, best_move);

	    // quiescent beta cutoff
	    tt.new_entry(pos.get_key(), 0, beta, HASH_BETA, curr_move.m);

Edit: wait, is it because I am entering a score of alpha during fail lows at the end of the negamax moveloop? In the case where no move can even meet alpha, I suppose this would be overstating the strength of that position.

algerbrex · Post by **algerbrex** » Fri Jun 24, 2022 10:16 pm

jmcd wrote: ↑Fri Jun 24, 2022 10:01 pm Edit: wait, is it because I am entering a score of alpha during fail lows at the end of the negamax moveloop? In the case where no move can even meet alpha, I suppose this would be overstating the strength of that position.

Don't think that should be an issue, Jonathan.

I do the same in my search, which makes sense because when you visit that same position again and you want to use the hash table, you'll be able to return early with a value of alpha since you know from searching the position before that the node failed low and the current alpha , or an even worse score, was the best score you could achieve.

jmcd · Post by **jmcd** » Fri Jun 24, 2022 10:36 pm

I found it....

Code: Select all

    void TTable::clear()
    {
        std::memset(ht, 0, sizeof(ht));
    }

ht is a pointer the clear function wasnt working properly

lithander · Post by **lithander** » Sat Jun 25, 2022 2:55 pm

hgm wrote: ↑Fri Jun 24, 2022 3:31 pm Your way of determining age seems a bit sub-optimal. Low-depth probes are enormously more frequent than those with very high depth (e.g. d=0 vs d=8), so a d=8 entry would get way more than 256 misses by d=0 probes before it gets a chance of being probed once.

Intuitively I agreed but now I looked into the numbers today and I'm not so sure anymore.

So, I don't have any TT interactions in Qsearch. So despite having 5M nps I measured only 500K overwrites per second. Even with a rather small hashtable size of only 50MB that means I can store 3.2M entries each being 16 bytes in size: So each entry only risks being overwritten once every few seconds.
Most slots are filled with shallow entries so for a precious high depth entry to get overwritten it needs to be the n-th attempt where n is the depth of the entry. That means in 99% of the cases a high depth entry will only get overwritten after it wasn't found useful for a minute or something.

hgm · Post by **hgm** » Sat Jun 25, 2022 3:51 pm

The number of unique positions in a search tree is typically 10% of its size. The 5Mnps and the 500K overwrites show this is also the case here. It would take 6.5 sec to fill the hash table. So you will only get significant replacement of entries used in the current search when you search longer than 6 sec. And you would probably need 10 times larger searches before you start to see the difference between good replacement / aging, and somewhat poorer replacement (or even just always-replace).

I agree that memory nowadays is so cheap and abundant that it is unlikely that replacement will be any issue at all during games. However, there are also people that analyze a position for hours, or even days. (I have even done that myself, with some of my engines.). And if you doesn't make it such that it works well when it really matters, you might as well not bother at all, and just use always-replace. Micro-Max gets by with that well enough.

To make the most of the table, you would want each higher depth to survive about 3 times longer (if the EBF = 3) than the next-lower depth. Because the time it takes to reach a transposition is proportional to the size of the sub-tree. Your method of using the number of probes as the timescale could work optimally, if you replace d=N not after N misses, but after 3^N misses. But a byte-size counter is not able to count far enough as would be needed for d > 5.

This is why I like the 'undercut replacement': there the event that determines whether you will replace is the number of probes on the next-lower depth (rather than any depth). And the frequency of such probes will also decrease with the EBF when the depth gets higher. And even when you don't want to replace on every such probe, but (say) on half of those, you could make that statistical, and replace with a 50% probability. (E.g. based on the lowest bit of the node counter.) Then you don't need an aging counter in the entry at all, saving space, and avoiding dirtying the cache line other that when a replacement occurs.

lithander · Post by **lithander** » Sun Jun 26, 2022 12:45 am

hgm wrote: ↑Sat Jun 25, 2022 3:51 pm To make the most of the table, you would want each higher depth to survive about 3 times longer (if the EBF = 3) than the next-lower depth. Because the time it takes to reach a transposition is proportional to the size of the sub-tree. Your method of using the number of probes as the timescale could work optimally, if you replace d=N not after N misses, but after 3^N misses. But a byte-size counter is not able to count far enough as would be needed for d > 5.

I changed my current implementation to

Code: Select all

return (++e0.Age - e0.Depth * e0.Depth) > (++e1.Age - e1.Depth * e1.Depth) ? index : index ^ 1;

which I equivalent to what you suggested but assuming an EBF of 2. It would work as intended for depths up to 16. 3^N might be even better but this should already be a big improvement over what I originally did.

But it proved quite hard to prove that superiority in self play. Even with 8 cores playing games all day I can't really simulate enough long time control games to prove it's worth. On shorter time controls (e.g. 5s+200ms increment or even 10s+2s) both versions perform equal.

So I resolved to do a synthetic test: Search the 300 WAC positions to the fixed depth of 20. With 50MB hashsize the new code needs 20% less nodes on average. And the advantage really starts to shine when you limit TT to use only 5 MB of space.

My original code took 196 minutes to complete. The revised version with the depth squared completed after 136 minutes. I didn't test a version with the 3^N you suggested because it would have required bigger changes.

This is why I like the 'undercut replacement': there the event that determines whether you will replace is the number of probes on the next-lower depth (rather than any depth). And the frequency of such probes will also decrease with the EBF when the depth gets higher. And even when you don't want to replace on every such probe, but (say) on half of those, you could make that statistical, and replace with a 50% probability. (E.g. based on the lowest bit of the node counter.) Then you don't need an aging counter in the entry at all, saving space, and avoiding dirtying the cache line other that when a replacement occurs.

That's clever!

I'm currently running the same synthetic test (hashsize=5MB, depth=20) for the undercut approach and it's too early to say for sure but even the most simple implementation is already looking good!

Code: Select all

 
        static int Index(in ulong hash, int newDepth)
        {
            int index = (int)(hash % (ulong)_table.Length);
            ref HashEntry e0 = ref _table[index];
            ref HashEntry e1 = ref _table[index ^ 1];

            if (e0.Hash == hash)
                return index;

            if (e1.Hash == hash)
                return index ^ 1;

            //chose the shallower entry unless the new entry's depth is just one lower
            //this is how we prevent stale, deep entries from sticking around forever.
            return (e0.Depth < e1.Depth || e0.Depth == newDepth + 1) ? index : index ^ 1;
        }

Would be perfect not only for Leorik but especially for MinimalChess.

lithander · Post by **lithander** » Sun Jun 26, 2022 11:01 am

The undercut approach is a regression in small time controls and with small hash size.

Code: Select all

-each tc=5+0.250 book=varied.bin option.Hash=5 
Score of Leorik 2.1.9 vs Leorik 2.1.8: 1572 - 1648 - 2641  [0.494] 5861
Elo difference: -4.5 +/- 6.6, LOS: 9.0 %, DrawRatio: 45.1 %

It's also performing worse than the age counting when you do the deep searches on the 300 WAC positions (hashsize=5MB, depth=20) without clearing the hash table in between positions. In other words: Deep but stale entries now persist for too long.

This could be fixed by re-purposing the age field to store a global counter that you increment each new search like hgm suggested earlier. That would allow you to identify old positions and overwrite them. But then it's not a simplification anymore.

lithander · Post by **lithander** » Mon Jun 27, 2022 2:02 am

I have now settled for this replacement scheme implementing HGMs two main suggestions!

Code: Select all

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static int Index(in ulong hash, int newDepth)
        {
            int index = (int)(hash % (ulong)_table.Length);
            ref HashEntry e0 = ref _table[index];
            ref HashEntry e1 = ref _table[index ^ 1];

            if (e0.Hash == hash)
                return index;

            if (e1.Hash == hash)
                return index ^ 1;

            if (e0.Depth < e1.Depth)
                return index;

            //undercut replacement prevents deep entries from sticking around forever.
            if (newDepth >= e0.Depth - 1 - (byte)(_age - e0.Age))
                return index;

            return index ^ 1;
        }

It's still the same 2 bucket system. Each time an entry is stored it goes in one of two consecutive slots in the table. Both could be accessed 'first'. There's no difference between the two and it depends on the hash which one is going to be which. Usually the one with the shallower entry is chosen to be overwritten but this rule is not enough to get rid of stale, deep entries. So I added something like the undercut replacement HGM explained in his earlier posts: If the depth of the new entry is equal or just below we may place it in the first bucket even if it is holding the deeper entry.

Without considering age this didn't remove stale entries aggressively enough. (my last post was about that) I have since changed the way I define age. Instead of increasing an entries age field on any miss and resetting it to 0 on a hit it's now tied to a global byte counter that is only increased when a new search is started. When an entry is newly created or successfully retrieved (a hit) the age of the entry is set to this global value again. By subtracting the entry's age from the global age counter you know how many 'searches' ago an entry was made. I factor this into the 'undercut replacement' by lowering the threshold by the age of the entry. Assuming an EBF = 3 an entry from the previous search is 3x more likely to get replaced on a collision than one made or used in the current search. If it's two searches old it's already 9x as likely to be replaced on collision and so on...

This is seems to perform well under all relevant conditions. On fast time control it's equal (200ms is not enough for a better replacement to make a difference) but on slower time control and with only a small hash size (10s + 10s increment, 5MB hash) the new version starts to perform significantly better. I need to run more tests to know exactly how much better but after 180 games it's +30 Elo. And last but not least: when you just run a search on the startposition for a few minutes it now takes less time to reach depth 23 than it previously took to reach depth 22.

Thanks for pointing out my old version's weakness and the tips on how to improve it!

algerbrex · Post by **algerbrex** » Mon Jun 27, 2022 11:37 pm

lithander wrote: ↑Mon Jun 27, 2022 2:02 am I have now settled for this replacement scheme implementing HGMs two main suggestions!
Code: Select all
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static int Index(in ulong hash, int newDepth)
        {
            int index = (int)(hash % (ulong)_table.Length);
            ref HashEntry e0 = ref _table[index];
            ref HashEntry e1 = ref _table[index ^ 1];

            if (e0.Hash == hash)
                return index;

            if (e1.Hash == hash)
                return index ^ 1;

            if (e0.Depth < e1.Depth)
                return index;

            //undercut replacement prevents deep entries from sticking around forever.
            if (newDepth >= e0.Depth - 1 - (byte)(_age - e0.Age))
                return index;

            return index ^ 1;
        }
It's still the same 2 bucket system. Each time an entry is stored it goes in one of two consecutive slots in the table. Both could be accessed 'first'. There's no difference between the two and it depends on the hash which one is going to be which. Usually the one with the shallower entry is chosen to be overwritten but this rule is not enough to get rid of stale, deep entries. So I added something like the undercut replacement HGM explained in his earlier posts: If the depth of the new entry is equal or just below we may place it in the first bucket even if it is holding the deeper entry.

Without considering age this didn't remove stale entries aggressively enough. (my last post was about that) I have since changed the way I define age. Instead of increasing an entries age field on any miss and resetting it to 0 on a hit it's now tied to a global byte counter that is only increased when a new search is started. When an entry is newly created or successfully retrieved (a hit) the age of the entry is set to this global value again. By subtracting the entry's age from the global age counter you know how many 'searches' ago an entry was made. I factor this into the 'undercut replacement' by lowering the threshold by the age of the entry. Assuming an EBF = 3 an entry from the previous search is 3x more likely to get replaced on a collision than one made or used in the current search. If it's two searches old it's already 9x as likely to be replaced on collision and so on...

This is seems to perform well under all relevant conditions. On fast time control it's equal (200ms is not enough for a better replacement to make a difference) but on slower time control and with only a small hash size (10s + 10s increment, 5MB hash) the new version starts to perform significantly better. I need to run more tests to know exactly how much better but after 180 games it's +30 Elo. And last but not least: when you just run a search on the startposition for a few minutes it now takes less time to reach depth 23 than it previously took to reach depth 22.

Thanks for pointing out my old version's weakness and the tips on how to improve it!

Nice Thomas!

I was inspired by this thread and a day or two ago tested a basic two-bucket scheme in Blunder, where the second entry is always replaced, and the first is depth preferred, and age is very simple: it's a single-bit counter that's flipped between 0 and 1 each time the engine begins searching. Each entry made in the current search is stored with the current value of the counter. And when considering whether a depth-preferred entry can be replaced, I simply compare the current search counter value with the one in the entry. If they match, which means the entry is from the current search, the entry is not overwritten. If they don't match I know the entry was from the last search and can be overwritten.

Even this very simple scheme of replacing entires just one search apart seemed to work quite well, and Blunder searched several positions faster. And while it looked good in testing at first (10+0.1s and a 32MB hash), indicating a gain of around 35 Elo, it slowly evaporated to a small, perhaps even negligible gain.

But by all other accounts, the changes made Blunder faster, and as you mentioned, I really should do longer time control testing to get an accurate estimate. For now, I accepted my current simple scheme into the code-base, even lieu of not being able to test it at the longer time-control it should be tested at. Soon I'll likely make sure to test a longer time control, and also change my approach to only starting to replace entries that are 2-4 searches old, instead of just 1 search, since I'm sure this makes things very inefficient.

Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme

Re: Improving TT replacement scheme