I've read things about LOS, error margin for elo, and things like that in this forum.. and at the end all is reduced to play a great number of games between your engine vs older versions or vs other sparring engines.
My question is: If i want to test my engine A against an older version B, and want to know if the new version is better with confidence... how can i generate so many games? I suspect that the use of an opening book is not recommended, because many games can be repeated when the moves are chosen randomly from the book. It hapened to me in Arena.
Now i'm using a little opening database with 80 different lines, and i am switching colors to produce a set of different games.
should i get a larger database with, for example, 1000 games? or there is another way?
Can anybody give me a clue? (or give me a big opening db?)
A newbie question about testing
Moderators: hgm, Rebel, chrisw
-
- Posts: 214
- Joined: Thu Sep 01, 2011 5:38 pm
- Location: Seville, Spain
-
- Posts: 893
- Joined: Mon Jan 15, 2007 11:23 am
- Location: Warsza
Re: A newbie question about testing
If Your engine uses UCI protocol, I'd recommend downloading LittleBlitzer (http://www.kimiensoftware.com/software/downloads) and a decent set of epd positions (prof. Robert Hyatt has a nice set of 4000 of them on his ftp site). You can play much faster games with LittleBlitzer than with Arena, which can have an unpleasant lag at times. Nowadays I use Arena only when I want to watch the games of my program as they are played.
If You use xboard protocol, then there are Winboard-compatibile tournament menager programs.
If You use xboard protocol, then there are Winboard-compatibile tournament menager programs.
Pawel Koziol
http://www.pkoziol.cal24.pl/rodent/rodent.htm
http://www.pkoziol.cal24.pl/rodent/rodent.htm
-
- Posts: 214
- Joined: Thu Sep 01, 2011 5:38 pm
- Location: Seville, Spain
Re: A newbie question about testing
Thanks a lot.
My engine uses UCI, i'm going to download them now, both LittleBlitzer and the epd set.
My engine uses UCI, i'm going to download them now, both LittleBlitzer and the epd set.
-
- Posts: 557
- Joined: Sun Feb 18, 2007 11:07 pm
- Location: Almeria. SPAIN
Re: A newbie question about testing
LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.asanjuan wrote:Thanks a lot.
My engine uses UCI, i'm going to download them now, both LittleBlitzer and the epd set.
Saludos, Andres
-
- Posts: 214
- Joined: Thu Sep 01, 2011 5:38 pm
- Location: Seville, Spain
Re: A newbie question about testing
Andrés, ¿am i suposed to delete repeated games after using a book? This is what you told me once, but is extactly what i want to avoid.
I'd like to know more opinions.
I'd like to know more opinions.
-
- Posts: 613
- Joined: Sun Jan 18, 2009 7:03 am
Re: A newbie question about testing
I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
The only thing that really matters for the reliability of the result is to get enough variance...
Joona Kiiski
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: A newbie question about testing
That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.zamar wrote:I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
The only thing that really matters for the reliability of the result is to get enough variance...
Kai
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A newbie question about testing
Couple of points. First, those positions are just 12 moves into a game, every one. So I am not sure what you mean by "a bit too deep into the middlegame". Second, the goal was to provide a representative sample of all popular openings. I did that by choosing PGN from strong players, and eliminating duplicate positions. This set of positions represents the most popular 4,000 positions from millions of PGN games between IM/GM (only) games...Laskos wrote:That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.zamar wrote:I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
The only thing that really matters for the reliability of the result is to get enough variance...
Kai
I don't like testing with books. If your goal is to tune an engine in the general sense.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: A newbie question about testing
Sorry, then I miss something. Did you modify it? I am sure I saw some positions which were different by 1 half-move only. I am also sure I saw some with a quite diminished material. I checked the file ~ a year ago, maybe it's different now? If all of them are unique, balanced, representative 12 movers, then it's probably very adequate. Yes, I don't like testing with books, and with LB, testing using an identical, arbitrarily long book is probably plainly wrong.bob wrote:Couple of points. First, those positions are just 12 moves into a game, every one. So I am not sure what you mean by "a bit too deep into the middlegame". Second, the goal was to provide a representative sample of all popular openings. I did that by choosing PGN from strong players, and eliminating duplicate positions. This set of positions represents the most popular 4,000 positions from millions of PGN games between IM/GM (only) games...Laskos wrote:That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.zamar wrote:I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
The only thing that really matters for the reliability of the result is to get enough variance...
Kai
I don't like testing with books. If your goal is to tune an engine in the general sense.
Kai
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A newbie question about testing
I doubt you missed anything with regard to the 1/2 move idea. Here's my algorithm to produce those positions.Laskos wrote:Sorry, then I miss something. Did you modify it? I am sure I saw some positions which were different by 1 half-move only. I am also sure I saw some with a quite diminished material. I checked the file ~ a year ago, maybe it's different now? If all of them are unique, balanced, representative 12 movers, then it's probably very adequate. Yes, I don't like testing with books, and with LB, testing using an identical, arbitrarily long book is probably plainly wrong.bob wrote:Couple of points. First, those positions are just 12 moves into a game, every one. So I am not sure what you mean by "a bit too deep into the middlegame". Second, the goal was to provide a representative sample of all popular openings. I did that by choosing PGN from strong players, and eliminating duplicate positions. This set of positions represents the most popular 4,000 positions from millions of PGN games between IM/GM (only) games...Laskos wrote:That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.zamar wrote:I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
The only thing that really matters for the reliability of the result is to get enough variance...
Kai
I don't like testing with books. If your goal is to tune an engine in the general sense.
Kai
(1) I modified Crafty's book create code so that whenever it reaches move 11 with white to move (10 full moves have been played) then it spits out a FEN string for that position. If you have 10M games, you get 10M FEN strings, assuming every game went to at least 11 moves (no GM draws).
(2) I then sort that huge batch of FEN positions to get identical positions consecutive in the file.
(3) I use uniq -c which collapses all duplicated positions into one line in the file, where each resulting position has a count on the front showing how many times that FEN was duplicated.
(4) I then sort again, but this time using the count, and I sort in descending order so that the most frequently played positions come first.
(5) to clean it up, I remove the "count" and choose the first N entries since they were the most popular.
For some openings, it is likely there are several positions that are very close. If there are two popular moves at (say) move 8, then you might get 1/2 of the resulting games for the first move, and 1/2 for the second. If they are sill among the most popular, even though they "split the vote" they would both be included.
To make sure things were not terribly unbalanced, such that white (or black) is winning (hard to do in IM/GM games at move 11 of course) I then played a bunch of cluster matches and extracted the positions where one player won both games (a split of win-lose or draw-draw suggests pretty equal chances). I then took those problematic positions and played them at a longer time control. And out of the 4,000 I posted, I ended up with 2-3-4 that looked to be unbalanced, until you searched deeply enough to discover they were still balanced, but they required a little time to properly handle.
I then randomized (sort -random) the file just to avoid having a bunch of positions that are related get played first, so that when I look at early results I don't see a whopping win or loss advantage, only to discover that as the matches are played out, things begin to balance out more normally...
I've never claimed these positions were optimal, nor even good. But they have provided a stable test platform for tuning, and since they include all popular openings (Ruy, Guico, French, Sicilian, queen-pawn, etc) it gives me confidence that I am not tuning to favor one opening over another, which I have done in the past...