testing procedure

cdani · Post by **cdani** » Sun Feb 23, 2014 11:33 am

Hi!
I use cutechess-cli for testing like most people.
I see that when I do matches with my engine against an older version of it, there are a lot of repeated games. I use pgn-extract -D to remove them.
But is this normal? May be the book I use (varied.bin) is too little. Also I don't know how random the line selection of cutechess-cli is.
When the match is against another engine, that happens a lot less, but I think maybe there are a lot of basically equal games with small differences.
Thanks.

hgm · Post by **hgm** » Sun Feb 23, 2014 11:39 am

With random selection of openings, you always run the risk of duplicate games. How large the risk is depends on the width and depth of the book. For this reason I usually run from a list of opening lines, rather than from a book.

velmarin · Post by **velmarin** » Sun Feb 23, 2014 1:07 pm

Always use FEN positions, loads, eg on Ballicora in site Gaviota , there are so many ...

Siempre uso posiciones FEN, hay montones, por ejemplo las de Ballicora en el sitio Gaviota, hay tantas...

Bloodbane · Post by **Bloodbane** » Sun Feb 23, 2014 1:22 pm

Just download the book used by the stock fish development team: 8_moves_GM. It has 32000 different starting positions and I have never had problems. Of course you could just make your engine nondeterministic by generating different random hash keys every time the program is started. This even with a small book every game is going to be different.

cdani · Post by **cdani** » Sun Feb 23, 2014 1:41 pm

Thanks to all! Gracias!
Sure I will do it.

Another idea I used with some success is generate a list of a lot of positions, each with a list of acceptable moves, for example within the range of 0.20:

1B1k4/8/1KR5/1P5q/8/8/7p/8 w - - 0 72
b8c7;b8h2;
1b1r1nk1/1p3qpp/p1p2p2/2P5/1PN2P2/P1B3PP/4Q2K/4R3 b - - 10 40
f7d5;h7h5;f7g6;f7d7;f8g6;
1b1r1rk1/1p2q2p/p1p1npp1/1PPp4/P2BbP2/4P2P/1Q2N1P1/2R1RBK1 b - - 0 26
a6b5;g6g5;c6b5;
1b1r1rk1/4q2p/2p1np2/2Pp4/3BbN2/4P1PP/1Q6/2R1RBK1 b - - 0 30
d8d7;e6g5;e6d4;b8c7;b8f4;
1b1r1rk1/pp2q1p1/2p1p1p1/5pPP/n1PP1P2/P2BBQ2/1P4K1/1R1R4 b - - 0 27
e7e8;e7f7;

For the moment I used this with some parameters of pre-moves cut things:

Code: Select all

static const bool UseRazoring = true;
int RazorDepth = 3;
int ValorEvalMargin=30;
int DepthRazoring=5;
int ValorRazoring1=105;
int ValorRazoring2=245;
int LevelFutilityPruning=4;
int ValorFutilityPruning=250;
int AuxiliarNullReduction=4;
int BigGain=190;
int ValorDeltaCutoff=140;
int fmargin&#91;3&#93; = &#123;0, 120, 370&#125;; //futility pruning

changing them randomly, and seeing how many positions gives acceptable moves and with how many moves analyzed. So optimizing the speed of all these things.
I was able to win some analysis speed winning a little of strength.
Is a type of testsuite but not that restrictive.

Maybe someone is interested on this or have a better idea.
I play to try this to test other things. Not sure to with extent this will be able to substitute the tedious work of playing lots of games.
Dani.

Sven · Post by **Sven** » Sun Feb 23, 2014 3:02 pm

Hi,

two points to add:

1) Regardless which strategy you choose (book or fixed list of positions or whatever), obtaining statistically reliable test results will always require to play lots of games. You can't do anything to avoid that. How much games are needed depends on the Elo differences you want to be able to measure.

2) Some people prefer testing with a fixed set of starting positions over any method involving some randomness also for another reason: reproducability of test results. If you play 3000 different games from an opening book, then make a software change and repeat the test, an improved score may or may not indicate that your software change is actually an improvement, even if the Elo difference resulting from the two test runs exceeds the error bounds. Some people will actually disagree here, so take this point as a suggestion, not as a proven fact, even though I am pretty sure it is correct to say that always using the exact same set of starting positions is superior over using an opening book.

Sven

cdani · Post by **cdani** » Sun Feb 23, 2014 7:36 pm

Thanks also!

Another question/idea I use is, for some type of changes, just play a lot of depth 1 games. I think that is enough for some changes in evaluations.

bob · Post by **bob** » Mon Feb 24, 2014 12:11 am

cdani wrote:Thanks also!

Another question/idea I use is, for some type of changes, just play a lot of depth 1 games. I think that is enough for some changes in evaluations.

It can be both dangerous and misleading. Suppose you add a complex evaluation term. Testing to fixed depth ignores the speed penalty the new term introduces, so all you see is the upside. But in real games, were time is measured, your program gets out-searched and loses badly.

If you want to play timed games in tournaments, you need to play timed games in testing. You can play very fast games, of course, but they do need to be timed or you can distort the results and never realize it.

cdani · Post by **cdani** » Mon Feb 24, 2014 12:24 am

bob wrote: It can be both dangerous and misleading. Suppose you add a complex evaluation term. Testing to fixed depth ignores the speed penalty the new term introduces, so all you see is the upside. But in real games, were time is measured, your program gets out-searched and loses badly.

Sure. I was thinking about just simple things like changing the value of something already existent. And then after some changes like this one, do another round of slower games to validate all the previous changes. I'm with you in evaluating finally all with serious testing procedures.
Thanks!

testing procedure

testing procedure

Re: testing procedure

Re: testing procedure

Re: testing procedure

Re: testing procedure

Re: testing procedure

Re: testing procedure

Re: testing procedure

Re: testing procedure