Fixing bug in TT-store function dropped playing strength...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Fixing bug in TT-store function dropped playing strength...

Post by mvanthoor »

lithander wrote: Mon Jul 12, 2021 3:30 pm But the decision between always-replace and your 2-bucket-system seems hard to make.
With regard to testing changes to the engine, and not only with regard to the TT, I've found that it's exceedingly hard to just reliably test a code change. I went back and forth between versions, and the results are always different.

Let's say I have a version X, and I test it against some random engine E, for 4000 games. With one test, it obtains a rating performance of -35, another run it's -45, and yet another run its -67, all with error margins around 12 Elo or so. Should I go and test for 10.000 games? 20.000? If an engine starts out poorly, by losing 10 games in a row for example, it'll sometimes take a thousand games to catch up to a run where it started equal, scoring for example 10-10 in the first 20 games.

This feels as if testing is just a roll of the dice. I don't have the time to run 10.000 test games for each change I make.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
User avatar
lithander
Posts: 880
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Fixing bug in TT-store function dropped playing strength...

Post by lithander »

Joost Buijs wrote: Mon Jul 12, 2021 3:52 pm Another thing is whether you store in quiescence or not, this will make a huge difference.
Ah, good point! I don't do that currently and will try it before making any rash decisions ;)
mvanthoor wrote: Mon Jul 12, 2021 3:57 pm I've found that it's exceedingly hard to just reliably test a code change.
I try to avoid changes that will only make an expected 10 ELO difference and can only be tested in actual play. At the level our engines are currently playing at there are still a lot of low hanging fruits that can be picked for rather large gains. (Add nullMove pruning if you want to see a huge jump in strength for example)

Some changes I do are not going to be easily measurable in a gauntlet. But that's not always a problem. For example if I expect no change in search behavior but just a speed improvement that can be measured against a test set of positions.

If I expect better pruning but still the same PVs to be found that can also be tested against a set of test positions.

Or if the change was about simplifying the code or making something more expressive then the simplification is attractive enough to warrant a small price in lost strength: I will say I would pay up to 20 ELO for it and then I just have to run the tests until the current result minus the current error margin are better than that. I'm generous with what I'm willing to pay for a simplification (that is MMC specific I guess) and so these tests all conclude rather quickly.

And last but not least you can make a few small changes in bulk and then measure the cumulative benefit. This should also provide a shortcut.

But your explicit goal to get the most out of every feature you include is setting the bar very high. If that's what you need to prove then oftentimes there won't be any good shortcuts. For example these fast games that you do with only 100ms increment to get thousands of games quicker is probably hurting your measurements more than they help! Especially when it is about something like the TT. The error bars go down quickly but your tests are now based on unrealistic conditions and you wont know how well they transfer to more realistic scenarios like the rating lists we all care so much about. ;)
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Fixing bug in TT-store function dropped playing strength...

Post by mvanthoor »

lithander wrote: Mon Jul 12, 2021 6:22 pm I try to avoid changes that will only make an expected 10 ELO difference and can only be tested in actual play. At the level our engines are currently playing at there are still a lot of low hanging fruits that can be picked for rather large gains. (Add nullMove pruning if you want to see a huge jump in strength for example)
Sure enough, but I want at least any bugs or inaccuracies to be gone. I changed the store function to this:

Code: Select all

    pub fn store(&mut self, verification: u32, data: D, used_entries: &mut usize) {
        let mut low = 0;
        let mut found: Option<usize> = None;

        // Find the index of the entry with the lowest depth.
        for (i, entry) in self.bucket.iter().enumerate() {
            // We found the position.
            if entry.verification == verification {
                found = Some(i);
                // So we can stop looking.
                break;
            }

            // We may have a new entry with lowest depth.
            if entry.data.depth() < self.bucket[low].data.depth() {
                low = i;
            }
        }

        // If the incoming position was found...
        if let Some(f) = found {
            // Then replace it if incoming depth is greater.
            if data.depth() > self.bucket[f].data.depth() {
                self.bucket[f] = Entry { verification, data };
            }
        } else {
            // Position not found. Overwrite entry with lowest depth.
            if self.bucket[low].verification == 0 {
                *used_entries += 1;
            }
            self.bucket[low] = Entry { verification, data };
        }
    }
Basically it runs through the entries in the buckets, and finds the one with the lowest depth. It also keeps an eye out for the position; if it is found, then the index is saved and the loop is broken. In the next part, the position is replaced if the incoming version of that position has a higher depth; otherwise, the incoming data is dropped so there are no duplicates. If the position wasn't found, the incoming data is saved in the entry having the lowest depth. I have 4 buckets.

Now test with an 8MB TT...

Code: Select all

Score of Rustic Alpha 3.2.106 8M vs Rustic Alpha 3.2.105 8M: 684 - 687 - 629 [0.499]
...      Rustic Alpha 3.2.106 8M playing White: 392 - 300 - 308  [0.546] 1000
...      Rustic Alpha 3.2.106 8M playing Black: 292 - 387 - 321  [0.453] 1000
...      White vs Black: 779 - 592 - 629  [0.547] 2000
Elo difference: -0.5 +/- 12.6, LOS: 46.8 %, DrawRatio: 31.4 %
2000 of 2000 games finished.
LOL. The correct code is -0.5 Elo. Up until 1700 games or so, the 3.2.106 version had been 7-8 Elo ahead during the entire run. It seems that _HAVING_ the TT with two buckets or more is the thing: the replacement scheme doesn't seem to affect performance at all. The old function just put the incoming data into the entry that happened to be the last one to have a lower depth than the incoming data. The new function specifically updates an entry with a new higher-depth version of the position, or stuffs the data in the actual entry with the lowest depth. The first one is almost a random replacement scheme; the second is the "correct" one, but with an 8MB TT (full at move 13...) they both seem to work the same.

But your explicit goal to get the most out of every feature you include is setting the bar very high. If that's what you need to prove then oftentimes there won't be any good shortcuts. For example these fast games that you do with only 100ms increment to get thousands of games quicker is probably hurting your measurements more than they help! Especially when it is about something like the TT. The error bars go down quickly but your tests are now based on unrealistic conditions and you wont know how well they transfer to more realistic scenarios like the rating lists we all care so much about. ;)
Most engines seem to test with 10s + 0.1s. I don't know if your 5s + 0.5s is better... those 5 seconds are gone in the blink of an eye, and then the game will basically be finished with 0.5 seconds/move. Maybe to make sure that the engine actually thinks for longer than 50ms, I'll test with 10s+1s.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
Angrim
Posts: 97
Joined: Mon Jun 25, 2012 10:16 pm
Location: Forks, WA
Full name: Ben Nye

Re: Fixing bug in TT-store function dropped playing strength...

Post by Angrim »

It looks like a possible issue with the new code is that it overwrites the shallowest entry found even if the node being stored is shallower than it. I would add a line to check that the new node is at least as deeply searched as the node that you are considering replacing with it.