testing procedure

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

testing procedure

Post by cdani »

Hi!
I use cutechess-cli for testing like most people.
I see that when I do matches with my engine against an older version of it, there are a lot of repeated games. I use pgn-extract -D to remove them.
But is this normal? May be the book I use (varied.bin) is too little. Also I don't know how random the line selection of cutechess-cli is.
When the match is against another engine, that happens a lot less, but I think maybe there are a lot of basically equal games with small differences.
Thanks.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: testing procedure

Post by hgm »

With random selection of openings, you always run the risk of duplicate games. How large the risk is depends on the width and depth of the book. For this reason I usually run from a list of opening lines, rather than from a book.
User avatar
velmarin
Posts: 1600
Joined: Mon Feb 21, 2011 9:48 am

Re: testing procedure

Post by velmarin »

Always use FEN positions, loads, eg on Ballicora in site Gaviota , there are so many ...

Siempre uso posiciones FEN, hay montones, por ejemplo las de Ballicora en el sitio Gaviota, hay tantas...
User avatar
Bloodbane
Posts: 154
Joined: Thu Oct 03, 2013 4:17 pm

Re: testing procedure

Post by Bloodbane »

Just download the book used by the stock fish development team: 8_moves_GM. It has 32000 different starting positions and I have never had problems. Of course you could just make your engine nondeterministic by generating different random hash keys every time the program is started. This even with a small book every game is going to be different.
Functional programming combines the flexibility and power of abstract mathematics with the intuitive clarity of abstract mathematics.
https://github.com/mAarnos
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: testing procedure

Post by cdani »

Thanks to all! Gracias!
Sure I will do it.

Another idea I used with some success is generate a list of a lot of positions, each with a list of acceptable moves, for example within the range of 0.20:

1B1k4/8/1KR5/1P5q/8/8/7p/8 w - - 0 72
b8c7;b8h2;
1b1r1nk1/1p3qpp/p1p2p2/2P5/1PN2P2/P1B3PP/4Q2K/4R3 b - - 10 40
f7d5;h7h5;f7g6;f7d7;f8g6;
1b1r1rk1/1p2q2p/p1p1npp1/1PPp4/P2BbP2/4P2P/1Q2N1P1/2R1RBK1 b - - 0 26
a6b5;g6g5;c6b5;
1b1r1rk1/4q2p/2p1np2/2Pp4/3BbN2/4P1PP/1Q6/2R1RBK1 b - - 0 30
d8d7;e6g5;e6d4;b8c7;b8f4;
1b1r1rk1/pp2q1p1/2p1p1p1/5pPP/n1PP1P2/P2BBQ2/1P4K1/1R1R4 b - - 0 27
e7e8;e7f7;

For the moment I used this with some parameters of pre-moves cut things:

Code: Select all

static const bool UseRazoring = true;
int RazorDepth = 3;
int ValorEvalMargin=30;
int DepthRazoring=5;
int ValorRazoring1=105;
int ValorRazoring2=245;
int LevelFutilityPruning=4;
int ValorFutilityPruning=250;
int AuxiliarNullReduction=4;
int BigGain=190;
int ValorDeltaCutoff=140;
int fmargin[3] = {0, 120, 370}; //futility pruning
changing them randomly, and seeing how many positions gives acceptable moves and with how many moves analyzed. So optimizing the speed of all these things.
I was able to win some analysis speed winning a little of strength.
Is a type of testsuite but not that restrictive.

Maybe someone is interested on this or have a better idea.
I play to try this to test other things. Not sure to with extent this will be able to substitute the tedious work of playing lots of games.
Dani.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: testing procedure

Post by Sven »

Hi,

two points to add:

1) Regardless which strategy you choose (book or fixed list of positions or whatever), obtaining statistically reliable test results will always require to play lots of games. You can't do anything to avoid that. How much games are needed depends on the Elo differences you want to be able to measure.

2) Some people prefer testing with a fixed set of starting positions over any method involving some randomness also for another reason: reproducability of test results. If you play 3000 different games from an opening book, then make a software change and repeat the test, an improved score may or may not indicate that your software change is actually an improvement, even if the Elo difference resulting from the two test runs exceeds the error bounds. Some people will actually disagree here, so take this point as a suggestion, not as a proven fact, even though I am pretty sure it is correct to say that always using the exact same set of starting positions is superior over using an opening book.

Sven
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: testing procedure

Post by cdani »

Thanks also!

Another question/idea I use is, for some type of changes, just play a lot of depth 1 games. I think that is enough for some changes in evaluations.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing procedure

Post by bob »

cdani wrote:Thanks also!

Another question/idea I use is, for some type of changes, just play a lot of depth 1 games. I think that is enough for some changes in evaluations.
It can be both dangerous and misleading. Suppose you add a complex evaluation term. Testing to fixed depth ignores the speed penalty the new term introduces, so all you see is the upside. But in real games, were time is measured, your program gets out-searched and loses badly.

If you want to play timed games in tournaments, you need to play timed games in testing. You can play very fast games, of course, but they do need to be timed or you can distort the results and never realize it.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: testing procedure

Post by cdani »

bob wrote: It can be both dangerous and misleading. Suppose you add a complex evaluation term. Testing to fixed depth ignores the speed penalty the new term introduces, so all you see is the upside. But in real games, were time is measured, your program gets out-searched and loses badly.
Sure. I was thinking about just simple things like changing the value of something already existent. And then after some changes like this one, do another round of slower games to validate all the previous changes. I'm with you in evaluating finally all with serious testing procedures.
Thanks!