OliThink 5.9.5 is very small

OliverBr · Post by **OliverBr** » Wed May 26, 2021 1:27 pm

Ras wrote: ↑Wed May 26, 2021 11:45 am Hm besides the missing null pointer check in the memory allocation, 5.9.6 also has a memory leak because the old pointer is neither freed nor reallocated.

Thank you!
I have to remove 5.9.6 anyway, because for whatever reason very long tests indicate that 5.9.6 is weaker than 5.9.5.
It begins tourneys well, but with time the tide changes in favor of 5.9.5... very late actually.

So 5.9.5 is preferable at the moment.

OliverBr · Post by **OliverBr** » Thu May 27, 2021 7:11 pm

OliverBr wrote: ↑Tue May 25, 2021 11:20 pm OliThink 5.9.5d is over 50 lines smaller and it is more elegant, but is this worth -7 ELO?
What do you think?

If you are interested, here is the commit, that I haven't included in 5.9.6:

https://github.com/olithink/OliThink/co ... c39904c6bc

Actually that commit has a small bug. The line

Code: Select all

a = (t & 128) ? PCAP(f, c) : (t & 32) ? PCA3(f, c) : PCA4(f, c);

is supposed to be

Code: Select all

a = (t & 128) ? PCAP(f, c) : (t & 32) ? PCA3(f, c) : (t & 64) ? PCA4(f, c) : 0LL;

and this bug is responsible for losing about 10 ELO.
This bug is a little bit mean because in analysis of a typical opening position there is no difference in the first >20 plies:

I will soon release a 5.9.7 when OliThink has lost more than 50 lines of code, without losing any strength.

Note: 5.9.6 has some issues. I recommend 5.9.5 until the release 5.9.7.

PS: Analysis after (1.e4 e5 2.d4 exd4 3.Qxd4 Nc6 4.Qe3) without this bug:

Code: Select all

...
19    32    278   7310873  g8f6 f1e2 d7d6 g1h3 f8e7 b1c3 d6d5 c3d5 f6d5 e4d5 c6b4 e2d3 c8h3 e3h3 d8d5 d3f5 b4a2 e1g1 a2c1 f1c1 
20    31    342   9073293  g8f6 f1e2 d7d6 g1h3 f8e7 b1c3 d6d5 c3d5 f6d5 e4d5 c6b4 e3e4 e8g8 c1e3 d8d5 e4d5 b4d5 e1g1 f8d8 h3f4 
21    20    475  12746158  g8f6 f1e2 d7d6 g1h3 f8e7 b1c3 e8g8 h3f4 f6g4 e3d2 a7a6 f2f3 g4e5 e1g1 c8d7 f4d5 f7f5 d5e7 c6e7 f1d1 f5f4 
22    22   1307  35504681  g8f6 f1e2 d7d6 b1c3 f8e7 e3g3 c6d4 g3d3 d4e2 g1e2 e8g8 f2f3 f6d7 d3d4 d7c5 c1e3 c8d7 e1g1 e7h4 d4d5

With this bug it's an equal analysis until ply 22:

Code: Select all

...
19    32    270   7310873  g8f6 f1e2 d7d6 g1h3 f8e7 b1c3 d6d5 c3d5 f6d5 e4d5 c6b4 e2d3 c8h3 e3h3 d8d5 d3f5 b4a2 e1g1 a2c1 f1c1 
20    31    335   9073293  g8f6 f1e2 d7d6 g1h3 f8e7 b1c3 d6d5 c3d5 f6d5 e4d5 c6b4 e3e4 e8g8 c1e3 d8d5 e4d5 b4d5 e1g1 f8d8 h3f4 
21    20    475  12746158  g8f6 f1e2 d7d6 g1h3 f8e7 b1c3 e8g8 h3f4 f6g4 e3d2 a7a6 f2f3 g4e5 e1g1 c8d7 f4d5 f7f5 d5e7 c6e7 f1d1 f5f4 
22    27   1149  31033266  f8e7 g1f3 g8f6 f1d3 d7d5 e4e5 f6d7 c2c3 a7a6 b2b4 d7b6 e1g1 e8g8 a2a4 c8g4 e3f4 g4f3 g2f3 d5d4

OliverBr · Post by **OliverBr** » Sun Jun 06, 2021 2:07 pm

OliThink 5.9.8 has been released.

It has 50 lines fewer/less code, yet it is about 20 ELO stronger compared to 5.9.5.

mvanthoor · Post by **mvanthoor** » Sun Jun 06, 2021 2:32 pm

OliverBr wrote: ↑Wed May 26, 2021 1:27 pm I have to remove 5.9.6 anyway, because for whatever reason very long tests indicate that 5.9.6 is weaker than 5.9.5.
It begins tourneys well, but with time the tide changes in favor of 5.9.5... very late actually.

So 5.9.5 is preferable at the moment.

I have never understood this. Sometimes, I actually feel like CuteChess is rigging the tournaments.

Engine R vs. A: +50 for A
Engine R vs B: +20 for B
Engine R vs. C: -30 for C

Then I add a functionality to Rustic which speeds it up, but doesn't do anything otherwise, such as PVS or aspiration windows, and test this in a gauntlet of 5000 games, 500 games per engine.

Engine R-II vs. A: +20 for A (30 elo gain)
Engine R-II vs B: -10 for B (30 elo gain)
Engine R-II vs. C: -5 for C (30 elo loss against C ?!)

I've also seen instances where my engine is, for example, +30 against the field, with another 50 games to go per engine. It stays +30 against the field, but the entire field reshuffles, as if CuteChess thinks: "Nah. This distribution of points is not how I like it. Let's change it."

I've seen one instance where a 500 game match was equal up to game 450 (Rustic's opponent was around the same level on CCRL), after which Rustic suddenly lost almost 50 games in a row. That feels completely illogical, except if CuteChess decided to select 25 openings in which Rustic can't play well yet.

There are also instances where engines are about equal to mine at 1m+0.6 or 2m+1s, but one of them completely falls apart in 10s+0.1s (losing, not forfeiting or crashing). Hint: it isn't mine...

Stuff like this makes testing feel completely arbitrary and illogical. I don't have the time nor the computer resources to test a new functionality against 20+ engines (if I can even FIND 20+ engines in the 1800-2200 range that work well enough for such long sustained tests at fast TC), and then run 10K games per match.

OliverBr · Post by **OliverBr** » Sun Jun 06, 2021 3:06 pm

mvanthoor wrote: ↑Sun Jun 06, 2021 2:32 pm I have never understood this. Sometimes, I actually feel like CuteChess is rigging the tournaments.

I thought it often, too, but it's actually not.

Then I add a functionality to Rustic which speeds it up, but doesn't do anything otherwise, such as PVS or aspiration windows, and test this in a gauntlet of 5000 games, 500 games per engine.

In my experience, 5000 games are often not enough to get a precise number. If two engines/versions are close to another you need at least 20.000 games and the error is still +/-5 ELOs.

Code: Select all

   # PLAYER             :  RATING  ERROR   POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 OliThink 5.9.6f    :       2      4  10059.0   20000  50.3  6016  8086  5898  40.4      86
   2 OliThink 5.9.6e    :       0   ----   9941.0   20000  49.7  5898  8086  6016  40.4     ---

White advantage = 74.72 +/- 1.92
Draw rate (equal opponents) = 42.04 % +/- 0.34

---if I can even FIND 20+ engines in the 1800-2200 range that work well enough for such long sustained tests at fast TC), and then run 10K games per match.

This is way I was creating this article, where stable engines are listed for every league.

http://talkchess.com/forum3/viewtopic.php?f=2&t=75718

I have to update this list soon

Ras · Post by **Ras** » Sun Jun 06, 2021 3:27 pm

mvanthoor wrote: ↑Sun Jun 06, 2021 2:32 pm500 games per engine.

What would you expect? The error margin is roughly +/- 30 Elo at such a small number of games. Which basically means that even if you don't do any changes to your engine and repeat the match, you can expect the result to be different by 30 Elo, or in extremely unlucky cases even by 60 Elo (if the one result is on one end of the error interval and the next on the opposite).

mvanthoor · Post by **mvanthoor** » Sun Jun 06, 2021 4:14 pm

Ras wrote: ↑Sun Jun 06, 2021 3:27 pm
mvanthoor wrote: ↑Sun Jun 06, 2021 2:32 pm500 games per engine.
What would you expect? The error margin is roughly +/- 30 Elo at such a small number of games. Which basically means that even if you don't do any changes to your engine and repeat the match, you can expect the result to be different by 30 Elo, or in extremely unlucky cases even by 60 Elo (if the one result is on one end of the error interval and the next on the opposite).

I know, but how do other people test their engines to obtain somewhat realistic ratings and progression per feature? Self-play is known to inflate the rating of the stronger engine. Running a 10 engine gauntlet with 5000 games per engine is completely infeasible, let alone 10K or 20K per engine.

I wanted to create a somewhat realistic progression table for new features by running gauntlets, but it seems it's not doable.

This also means that, in essence, CCRL could be considered useless, because they run only 32 games per match to test an engine. The only reason why the ratinglist may work, is because all the engines play one another (around the same strength, obviously), and in my gauntlets, they don't.

So, what would be better:
- run a 10 engine gauntlet, with 500 games per engine...
- run an SPRT-test between two versions of the same engine, and if the new version is stronger, call it a day and release it, using the self-play Elo in the progression chart.

Rebel · Post by **Rebel** » Sun Jun 06, 2021 4:26 pm

OliverBr wrote: ↑Sun Jun 06, 2021 2:07 pm OliThink 5.9.8 has been released.

It has 50 lines fewer/less code, yet it is about 20 ELO stronger compared to 5.9.5.

800 games.

http://rebel13.nl/a/grl.htm

Ras · Post by **Ras** » Sun Jun 06, 2021 5:47 pm

mvanthoor wrote: ↑Sun Jun 06, 2021 4:14 pmI know, but how do other people test their engines to obtain somewhat realistic ratings and progression per feature?

I test 50k games in self-play and 10k games each against five selected engines. TC is 10s/game - at the speed of today's computers, that's still a lot. Since my quadcore has hyperthreading, it can run eight games in parallel. For pure speed optimisations without impact on the search tree, I have some test positions where I let the engine calculate until depth 18-20 or so and then see whether it's faster, and by how much.

This also means that, in essence, CCRL could be considered useless, because they run only 32 games per match to test an engine.

The aggregate number of games is what counts, and the margin of error with 1000 games is +/- 20 Elo.

So, what would be better:

If you make improvements that are below the error margin with regard to the number of games, you have no way of telling whether the improvement actually works.

What you could do is selecting the openings yourself so that you have a "typical" set of positions - that would at least get around the problem with getting lucky or unlucky in the opening set because it would always be the same. And of course always playing the same opening once with white and once with black.

mvanthoor · Post by **mvanthoor** » Sun Jun 06, 2021 6:26 pm

Ras wrote: ↑Sun Jun 06, 2021 5:47 pm I test 50k games in self-play and 10k games each against five selected engines. TC is 10s/game - at the speed of today's computers, that's still a lot. Since my quadcore has hyperthreading, it can run eight games in parallel. For pure speed optimisations without impact on the search tree, I have some test positions where I let the engine calculate until depth 18-20 or so and then see whether it's faster, and by how much.

50K games would run for days and days on end... I can't do that yet, until I have a dedicated computer to run those tests. (I don't want to run them on my laptop, to be honest.)

So, what would be better:

If you make improvements that are below the error margin with regard to the number of games, you have no way of telling whether the improvement actually works.[/quote]

I assume killer moves and PVS (and possibly AW) are not below the error margins.

What you could do is selecting the openings yourself so that you have a "typical" set of positions - that would at least get around the problem with getting lucky or unlucky in the opening set because it would always be the same. And of course always playing the same opening once with white and once with black.

I've been thinking about that. Maybe just create a very small opening book.

OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small

Re: OliThink 5.9.5 is very small