OliverBr wrote: ↑Wed May 26, 2021 1:27 pm
I have to remove 5.9.6 anyway, because for whatever reason very long tests indicate that 5.9.6 is weaker than 5.9.5.
It begins tourneys well, but with time the tide changes in favor of 5.9.5... very late actually.
So 5.9.5 is preferable at the moment.
I have never understood this. Sometimes, I actually feel like CuteChess is rigging the tournaments.
Engine R vs. A: +50 for A
Engine R vs B: +20 for B
Engine R vs. C: -30 for C
Then I add a functionality to Rustic which speeds it up, but doesn't do anything otherwise, such as PVS or aspiration windows, and test this in a gauntlet of 5000 games, 500 games per engine.
Engine R-II vs. A: +20 for A (30 elo gain)
Engine R-II vs B: -10 for B (30 elo gain)
Engine R-II vs. C: -5 for C (30 elo loss against C ?!)
I've also seen instances where my engine is, for example, +30 against the field, with another 50 games to go per engine. It stays +30 against the field, but the entire field reshuffles, as if CuteChess thinks: "Nah. This distribution of points is not how I like it. Let's change it."
I've seen one instance where a 500 game match was equal up to game 450 (Rustic's opponent was around the same level on CCRL), after which Rustic suddenly lost almost 50 games in a row. That feels completely illogical, except if CuteChess decided to select 25 openings in which Rustic can't play well yet.
There are also instances where engines are about equal to mine at 1m+0.6 or 2m+1s, but one of them completely falls apart in 10s+0.1s (losing, not forfeiting or crashing). Hint: it isn't mine...
Stuff like this makes testing feel completely arbitrary and illogical. I don't have the time nor the computer resources to test a new functionality against 20+ engines (if I can even FIND 20+ engines in the 1800-2200 range that work well enough for such long sustained tests at fast TC), and then run 10K games per match.