Setting up a testing framework

mcostalba · Post by **mcostalba** » Sat May 07, 2011 10:14 pm

Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.

So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.

But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care

because I want to test myself).

So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:

STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.

STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)

STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.

Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.

Michel · Post by **Michel** » Sat May 07, 2011 10:36 pm

So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets.

Perhaps you gave up too easily on cutechess-cli. It is trivial to write a little script (my favorite language is python) that uses cutechess-cli as a driver to run all kinds of tournaments (gauntlet, round robin, team fight). This is what I currently use for testing.

No GUI can beat the flexibility of a scripting language.

bob · Post by **bob** » Sat May 07, 2011 10:47 pm

mcostalba wrote:Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.

So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.

But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care because I want to test myself).

So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:

STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.

STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)

STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.

Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.

if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.

First thing you have to do is run several thousand games against each opponent and extract time losses. In my 30K games matches, I see 1 or 2 at most, none by Crafty...

michiguel · Post by **michiguel** » Sun May 08, 2011 12:04 am

Michel wrote:
So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets.
Perhaps you gave up too easily on cutechess-cli. It is trivial to write a little script (my favorite language is python) that uses cutechess-cli as a driver to run all kinds of tournaments (gauntlet, round robin, team fight). This is what I currently use for testing.

No GUI can beat the flexibility of a scripting language.

+1

Ruby here.

Miguel

nionita · Post by **nionita** » Sun May 08, 2011 1:35 am

bob wrote: if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.

First thing you have to do is run several thousand games against each opponent and extract time losses. In my 30K games matches, I see 1 or 2 at most, none by Crafty...

I just let one quick test run, with 1s + 0.1s, between 4 engines (see below). In this setup it takes about 15 seconds per game. In this test there was no time loss, it looks like those engines can manage such short time controls very good.

Code: Select all

Games Completed = 120 of 120 &#40;Avg game length = 14.902 sec&#41;
Settings = RR/128MB/1000ms+100ms/M 800cp for 12 moves, D 450 moves/PGN&#58;J&#58;\Chess\swcr-fq-openings-v4.1-main-files\swcr-fq-openings-v4.1.pgn&#40;3395&#41;
Time = 1858 sec elapsed, 0 sec remaining
 1.  Stockfish 2.1 JA 64bit   	29.5/60	20-21-19  	&#40;L&#58; m=1 t=0 i=0 a=20&#41;	&#40;D&#58; r=7 i=5 f=7 s=0 a=0&#41;	&#40;tpm=97.2 d=15.5 nps=4577493&#41;
 2.  Critter 1.01 64-bit SSE4 	16.0/60	6-34-20  	&#40;L&#58; m=10 t=0 i=0 a=24&#41;	&#40;D&#58; r=13 i=4 f=3 s=0 a=0&#41;	&#40;tpm=97.5 d=14.0 nps=5098380&#41;
 3.  Houdini 1.5 x64          	41.0/60	32-10-18  	&#40;L&#58; m=1 t=0 i=0 a=9&#41;	&#40;D&#58; r=9 i=3 f=6 s=0 a=0&#41;	&#40;tpm=98.6 d=14.8 nps=7787903&#41;
 4.  IvanHoe 9.47b x64        	33.5/60	22-15-23  	&#40;L&#58; m=3 t=0 i=0 a=12&#41;	&#40;D&#58; r=13 i=6 f=4 s=0 a=0&#41;	&#40;tpm=96.8 d=14.6 nps=5610791&#41;

Milos · Post by **Milos** » Sun May 08, 2011 2:09 am

bob wrote:if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.

Plain and simple, whatever client/gui (or custom scripts) you use for your testing it sucks and it sucks big. Do yourself a favor, stop doing it.
I regularly test different engines in 1''+0.1'' TC in cutechees-cli, in little blitzer and in winboard (all under Windows) and I never have any time losses (unless it's a problematic engine version which I take out easily). And I'm talking about 100Ks of games.

Ferdy · Post by **Ferdy** » Sun May 08, 2011 2:30 am

Michel wrote:
So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets.
Perhaps you gave up too easily on cutechess-cli. It is trivial to write a little script (my favorite language is python) that uses cutechess-cli as a driver to run all kinds of tournaments (gauntlet, round robin, team fight). This is what I currently use for testing.

No GUI can beat the flexibility of a scripting language.

Using cutechess-cli I also tried ms dos batch command

, even making it dynamic - if engine under test is 50 elo or worst (settable) after 2k games (settable) below the model engine then the batch command will stop the test, if it passes then test will continue until the test engine gets significant elo lead (settable) after a number of games (settable). Elo check can be done every after 2k games (settable) also.

Laskos · Post by **Laskos** » Sun May 08, 2011 2:47 am

mcostalba wrote:Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.

So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.

But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care because I want to test myself).

So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:

STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.

STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)

STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.

Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.

I think you are talking about only one test which showed 7 +/- 11 Elo points (95% confidence) increase. This particular test IMHO had problems. Yes, self-testing exaggerates the differences, but I am expecting at least 15-20 Elo points increase in a gauntlet. Use LittleBlitzer 2.72 with some not too middle-gamish openings, I use swcr.pgn and 8moves.epd. Randomize them (there is an option in LittleBlitzer), and you can use them even for 60K games matches (2 Elo points error margins 95% confidence). 1s+0.1s should be fine, less than that some engines are not properly using their time to move (check tpm values). Don't totally discard self-testing, actually this is the most sensitive (maybe too sensitive) to changes. If in LittleBlitzer there will be time losses or illegal move losses at a rate more than 1/1000, increase TC.

Kai

bob · Post by **bob** » Sun May 08, 2011 3:08 am

Milos wrote:
bob wrote:if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.
Plain and simple, whatever client/gui (or custom scripts) you use for your testing it sucks and it sucks big. Do yourself a favor, stop doing it.
I regularly test different engines in 1''+0.1'' TC in cutechees-cli, in little blitzer and in winboard (all under Windows) and I never have any time losses (unless it's a problematic engine version which I take out easily). And I'm talking about 100Ks of games.

I don't use a GUI. I use a custom-written referee that lets Crafty play thousands of game/1 second games with no time losses. If one knows what one is doing, this is not hard. On the other hand, plenty of programs have problems with that fast time control. I've reported on them in the past. The engines I use don't lose on time either, but almost every time I try to add something new, that program won't like very fast time controls...

Not much I can do about it. You, on the other hand, might actually try reading a post every now and then. In the last sentence, which you cleverly snipped...

bob wrote: In my 30K games matches, I see 1 or 2 at most, none by Crafty...

I don't see _any_ time losses with the set of programs I currently use. From that, one might infer that the referee program I uses works well also. One that reads...

mcostalba · Post by **mcostalba** » Sun May 08, 2011 9:10 am

Laskos wrote: Yes, self-testing exaggerates the differences

Just to be clear, exaggerating the differences is a good thing IMHO and is one if the main reasons why we use self-testing. We really want the differences to be exaggerates so to require less games to detect potential good stuff from garbage.

Our main testing framework is based on Linux+cute-chess+self test and I think we will stick to that. What I am wondering is that this is not enough. We really need a testing pipeline where the potential good changes filtered out by self test are further processed with a LittleBlitzer gauntlet step before to be committed.

Our main test framework currently uses 4 threads per engine (it is a QUAD), if the scalability and multi-tournament tests proves successful I could switch to use 1 thread per engine (this should also have side effect benefit of reducing noise due to high SMP undeterminism) and run 2, 3 or even 4 instances of cute-chess in parallel (no pondering), depending on scalability test results.

Setting up a testing framework

Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework