A newbie question about testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

asanjuan
Posts: 214
Joined: Thu Sep 01, 2011 5:38 pm
Location: Seville, Spain

A newbie question about testing

Post by asanjuan »

I've read things about LOS, error margin for elo, and things like that in this forum.. and at the end all is reduced to play a great number of games between your engine vs older versions or vs other sparring engines.

My question is: If i want to test my engine A against an older version B, and want to know if the new version is better with confidence... how can i generate so many games? I suspect that the use of an opening book is not recommended, because many games can be repeated when the moves are chosen randomly from the book. It hapened to me in Arena.

Now i'm using a little opening database with 80 different lines, and i am switching colors to produce a set of different games.

should i get a larger database with, for example, 1000 games? or there is another way?

Can anybody give me a clue? (or give me a big opening db?)
PK
Posts: 893
Joined: Mon Jan 15, 2007 11:23 am
Location: Warsza

Re: A newbie question about testing

Post by PK »

If Your engine uses UCI protocol, I'd recommend downloading LittleBlitzer (http://www.kimiensoftware.com/software/downloads) and a decent set of epd positions (prof. Robert Hyatt has a nice set of 4000 of them on his ftp site). You can play much faster games with LittleBlitzer than with Arena, which can have an unpleasant lag at times. Nowadays I use Arena only when I want to watch the games of my program as they are played.

If You use xboard protocol, then there are Winboard-compatibile tournament menager programs.
asanjuan
Posts: 214
Joined: Thu Sep 01, 2011 5:38 pm
Location: Seville, Spain

Re: A newbie question about testing

Post by asanjuan »

Thanks a lot.
My engine uses UCI, i'm going to download them now, both LittleBlitzer and the epd set.
User avatar
Andres Valverde
Posts: 557
Joined: Sun Feb 18, 2007 11:07 pm
Location: Almeria. SPAIN

Re: A newbie question about testing

Post by Andres Valverde »

asanjuan wrote:Thanks a lot.
My engine uses UCI, i'm going to download them now, both LittleBlitzer and the epd set.
LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
Saludos, Andres
asanjuan
Posts: 214
Joined: Thu Sep 01, 2011 5:38 pm
Location: Seville, Spain

Re: A newbie question about testing

Post by asanjuan »

Andrés, ¿am i suposed to delete repeated games after using a book? This is what you told me once, but is extactly what i want to avoid.
I'd like to know more opinions.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: A newbie question about testing

Post by zamar »

Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.

The only thing that really matters for the reliability of the result is to get enough variance...
Joona Kiiski
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: A newbie question about testing

Post by Laskos »

zamar wrote:
Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.

The only thing that really matters for the reliability of the result is to get enough variance...
That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.

Kai
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A newbie question about testing

Post by bob »

Laskos wrote:
zamar wrote:
Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.

The only thing that really matters for the reliability of the result is to get enough variance...
That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.

Kai
Couple of points. First, those positions are just 12 moves into a game, every one. So I am not sure what you mean by "a bit too deep into the middlegame". Second, the goal was to provide a representative sample of all popular openings. I did that by choosing PGN from strong players, and eliminating duplicate positions. This set of positions represents the most popular 4,000 positions from millions of PGN games between IM/GM (only) games...

I don't like testing with books. If your goal is to tune an engine in the general sense.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: A newbie question about testing

Post by Laskos »

bob wrote:
Laskos wrote:
zamar wrote:
Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.

The only thing that really matters for the reliability of the result is to get enough variance...
That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.

Kai
Couple of points. First, those positions are just 12 moves into a game, every one. So I am not sure what you mean by "a bit too deep into the middlegame". Second, the goal was to provide a representative sample of all popular openings. I did that by choosing PGN from strong players, and eliminating duplicate positions. This set of positions represents the most popular 4,000 positions from millions of PGN games between IM/GM (only) games...

I don't like testing with books. If your goal is to tune an engine in the general sense.
Sorry, then I miss something. Did you modify it? I am sure I saw some positions which were different by 1 half-move only. I am also sure I saw some with a quite diminished material. I checked the file ~ a year ago, maybe it's different now? If all of them are unique, balanced, representative 12 movers, then it's probably very adequate. Yes, I don't like testing with books, and with LB, testing using an identical, arbitrarily long book is probably plainly wrong.

Kai
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A newbie question about testing

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
zamar wrote:
Andres Valverde wrote: LB is a great tool, but IMHO, the EPD set is not a good idea. Your engine won't play tournaments from EPDs but from a book. Use a good book for your engine and your opponents instead.
I think this is only a question of taste. EPD or large book will do just as well. But the end positions must be balanced giving chances for both sides.

The only thing that really matters for the reliability of the result is to get enough variance...
That particular EPD file from Bob Hyatt is not the best thing to use with LB. It is balanced indeed, but a bit too deep into middlegame. Besides that, many positions vary by one move only. One could try some 8-12 movers PGN files, for example from SWCR games (I think Frank put some files for download). I mean, if one needs pretty full, independent games of chess.

Kai
Couple of points. First, those positions are just 12 moves into a game, every one. So I am not sure what you mean by "a bit too deep into the middlegame". Second, the goal was to provide a representative sample of all popular openings. I did that by choosing PGN from strong players, and eliminating duplicate positions. This set of positions represents the most popular 4,000 positions from millions of PGN games between IM/GM (only) games...

I don't like testing with books. If your goal is to tune an engine in the general sense.
Sorry, then I miss something. Did you modify it? I am sure I saw some positions which were different by 1 half-move only. I am also sure I saw some with a quite diminished material. I checked the file ~ a year ago, maybe it's different now? If all of them are unique, balanced, representative 12 movers, then it's probably very adequate. Yes, I don't like testing with books, and with LB, testing using an identical, arbitrarily long book is probably plainly wrong.

Kai
I doubt you missed anything with regard to the 1/2 move idea. Here's my algorithm to produce those positions.

(1) I modified Crafty's book create code so that whenever it reaches move 11 with white to move (10 full moves have been played) then it spits out a FEN string for that position. If you have 10M games, you get 10M FEN strings, assuming every game went to at least 11 moves (no GM draws).

(2) I then sort that huge batch of FEN positions to get identical positions consecutive in the file.

(3) I use uniq -c which collapses all duplicated positions into one line in the file, where each resulting position has a count on the front showing how many times that FEN was duplicated.

(4) I then sort again, but this time using the count, and I sort in descending order so that the most frequently played positions come first.

(5) to clean it up, I remove the "count" and choose the first N entries since they were the most popular.

For some openings, it is likely there are several positions that are very close. If there are two popular moves at (say) move 8, then you might get 1/2 of the resulting games for the first move, and 1/2 for the second. If they are sill among the most popular, even though they "split the vote" they would both be included.

To make sure things were not terribly unbalanced, such that white (or black) is winning (hard to do in IM/GM games at move 11 of course) I then played a bunch of cluster matches and extracted the positions where one player won both games (a split of win-lose or draw-draw suggests pretty equal chances). I then took those problematic positions and played them at a longer time control. And out of the 4,000 I posted, I ended up with 2-3-4 that looked to be unbalanced, until you searched deeply enough to discover they were still balanced, but they required a little time to properly handle.

I then randomized (sort -random) the file just to avoid having a bunch of positions that are related get played first, so that when I look at early results I don't see a whopping win or loss advantage, only to discover that as the matches are played out, things begin to balance out more normally...

I've never claimed these positions were optimal, nor even good. But they have provided a stable test platform for tuning, and since they include all popular openings (Ruy, Guico, French, Sicilian, queen-pawn, etc) it gives me confidence that I am not tuning to favor one opening over another, which I have done in the past...