Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.
So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.
But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care because I want to test myself).
So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:
STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.
STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)
STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.
Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.
Setting up a testing framework
Moderators: hgm, Rebel, chrisw
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: Setting up a testing framework
Perhaps you gave up too easily on cutechess-cli. It is trivial to write a little script (my favorite language is python) that uses cutechess-cli as a driver to run all kinds of tournaments (gauntlet, round robin, team fight). This is what I currently use for testing.So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets.
No GUI can beat the flexibility of a scripting language.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Setting up a testing framework
if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.mcostalba wrote:Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.
So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.
But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care because I want to test myself).
So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:
STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.
STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)
STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.
Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.
First thing you have to do is run several thousand games against each opponent and extract time losses. In my 30K games matches, I see 1 or 2 at most, none by Crafty...
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Setting up a testing framework
+1Michel wrote:Perhaps you gave up too easily on cutechess-cli. It is trivial to write a little script (my favorite language is python) that uses cutechess-cli as a driver to run all kinds of tournaments (gauntlet, round robin, team fight). This is what I currently use for testing.So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets.
No GUI can beat the flexibility of a scripting language.
Ruby here.
Miguel
-
- Posts: 175
- Joined: Fri Oct 22, 2010 9:47 pm
- Location: Austria
Re: Setting up a testing framework
I just let one quick test run, with 1s + 0.1s, between 4 engines (see below). In this setup it takes about 15 seconds per game. In this test there was no time loss, it looks like those engines can manage such short time controls very good.bob wrote: if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.
First thing you have to do is run several thousand games against each opponent and extract time losses. In my 30K games matches, I see 1 or 2 at most, none by Crafty...
Code: Select all
Games Completed = 120 of 120 (Avg game length = 14.902 sec)
Settings = RR/128MB/1000ms+100ms/M 800cp for 12 moves, D 450 moves/PGN:J:\Chess\swcr-fq-openings-v4.1-main-files\swcr-fq-openings-v4.1.pgn(3395)
Time = 1858 sec elapsed, 0 sec remaining
1. Stockfish 2.1 JA 64bit 29.5/60 20-21-19 (L: m=1 t=0 i=0 a=20) (D: r=7 i=5 f=7 s=0 a=0) (tpm=97.2 d=15.5 nps=4577493)
2. Critter 1.01 64-bit SSE4 16.0/60 6-34-20 (L: m=10 t=0 i=0 a=24) (D: r=13 i=4 f=3 s=0 a=0) (tpm=97.5 d=14.0 nps=5098380)
3. Houdini 1.5 x64 41.0/60 32-10-18 (L: m=1 t=0 i=0 a=9) (D: r=9 i=3 f=6 s=0 a=0) (tpm=98.6 d=14.8 nps=7787903)
4. IvanHoe 9.47b x64 33.5/60 22-15-23 (L: m=3 t=0 i=0 a=12) (D: r=13 i=6 f=4 s=0 a=0) (tpm=96.8 d=14.6 nps=5610791)
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Setting up a testing framework
Plain and simple, whatever client/gui (or custom scripts) you use for your testing it sucks and it sucks big. Do yourself a favor, stop doing it.bob wrote:if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.
I regularly test different engines in 1''+0.1'' TC in cutechees-cli, in little blitzer and in winboard (all under Windows) and I never have any time losses (unless it's a problematic engine version which I take out easily). And I'm talking about 100Ks of games.
-
- Posts: 4833
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: Setting up a testing framework
Using cutechess-cli I also tried ms dos batch command , even making it dynamic - if engine under test is 50 elo or worst (settable) after 2k games (settable) below the model engine then the batch command will stop the test, if it passes then test will continue until the test engine gets significant elo lead (settable) after a number of games (settable). Elo check can be done every after 2k games (settable) also.Michel wrote:Perhaps you gave up too easily on cutechess-cli. It is trivial to write a little script (my favorite language is python) that uses cutechess-cli as a driver to run all kinds of tournaments (gauntlet, round robin, team fight). This is what I currently use for testing.So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets.
No GUI can beat the flexibility of a scripting language.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Setting up a testing framework
I think you are talking about only one test which showed 7 +/- 11 Elo points (95% confidence) increase. This particular test IMHO had problems. Yes, self-testing exaggerates the differences, but I am expecting at least 15-20 Elo points increase in a gauntlet. Use LittleBlitzer 2.72 with some not too middle-gamish openings, I use swcr.pgn and 8moves.epd. Randomize them (there is an option in LittleBlitzer), and you can use them even for 60K games matches (2 Elo points error margins 95% confidence). 1s+0.1s should be fine, less than that some engines are not properly using their time to move (check tpm values). Don't totally discard self-testing, actually this is the most sensitive (maybe too sensitive) to changes. If in LittleBlitzer there will be time losses or illegal move losses at a rate more than 1/1000, increase TC.mcostalba wrote:Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.
So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.
But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care because I want to test myself).
So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:
STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.
STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)
STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.
Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.
Kai
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Setting up a testing framework
I don't use a GUI. I use a custom-written referee that lets Crafty play thousands of game/1 second games with no time losses. If one knows what one is doing, this is not hard. On the other hand, plenty of programs have problems with that fast time control. I've reported on them in the past. The engines I use don't lose on time either, but almost every time I try to add something new, that program won't like very fast time controls...Milos wrote:Plain and simple, whatever client/gui (or custom scripts) you use for your testing it sucks and it sucks big. Do yourself a favor, stop doing it.bob wrote:if you mean 1 sec + 0.1 sec increment, that's not very good. It is essentially .1 seconds per move, period. You will use that 1 second quickly, or not at all. My fastest testing is 10 secs + 0.1 secs, and even that is a challenge for most programs, and you see too many time losses stack up.
I regularly test different engines in 1''+0.1'' TC in cutechees-cli, in little blitzer and in winboard (all under Windows) and I never have any time losses (unless it's a problematic engine version which I take out easily). And I'm talking about 100Ks of games.
Not much I can do about it. You, on the other hand, might actually try reading a post every now and then. In the last sentence, which you cleverly snipped...
I don't see _any_ time losses with the set of programs I currently use. From that, one might infer that the referee program I uses works well also. One that reads...bob wrote: In my 30K games matches, I see 1 or 2 at most, none by Crafty...
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Setting up a testing framework
Just to be clear, exaggerating the differences is a good thing IMHO and is one if the main reasons why we use self-testing. We really want the differences to be exaggerates so to require less games to detect potential good stuff from garbage.Laskos wrote: Yes, self-testing exaggerates the differences
Our main testing framework is based on Linux+cute-chess+self test and I think we will stick to that. What I am wondering is that this is not enough. We really need a testing pipeline where the potential good changes filtered out by self test are further processed with a LittleBlitzer gauntlet step before to be committed.
Our main test framework currently uses 4 threads per engine (it is a QUAD), if the scalability and multi-tournament tests proves successful I could switch to use 1 thread per engine (this should also have side effect benefit of reducing noise due to high SMP undeterminism) and run 2, 3 or even 4 instances of cute-chess in parallel (no pondering), depending on scalability test results.