What your opinion about this testing methodology?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: What your opinion about this testing methodology?

Post by Don »

Kempelen wrote:Should it work?

I have been thinking about ways to improve testing time results. People usually use tournaments from startup position, or tournaments with a set of very limited set of position (i.e. 32), or tournaments with a lot of random positions. I asumme all people is doing this with a minimum of 1000 to 4000 games.

.... but ....

what about repeating the same tournament, with the same opponents, with the same positions per opponent?. Assuming a set of positions would be very large....

example:
Game 1, agains Crafty, black, posicion from FEN file 'myfenpositions.epd', number of position 540
Game 2, agains Critter, white, position from FEN file 'myfenpositions.epd', number of position 3251
....
etc

the idea is that the number of position would be always the same and not choosed ramdomly, without repeating any FEN, but enought varied.
The tournament file from the tournament manager would always be the same, without the need to recreate the tournament. The test would always repeat the same.

Would be results between tests more accurate than randomly choose the startup position.?
I think your suggestion is very sensible.

With Komodo we do all sort of exploratory testing, but our primary test to verify that a version is good is a gauntlet style test where the candidate version of Komodo plays at least 2 foreign programs for about 20,000 games.

The opening set is very shallow and very large - we don't tune Komodo to any particular book because we want it to play all positions well. It will be the future job of a book maker to make an optimize book. Our book is not deep because we want it to also be able to play the opening reasonable well. It is 5 moves per side so that the programs are "on their own" as soon as possible while still being able to provide lot's of variety.

We actually have something like 35,000 openings but they are chosen in a way that we don't just repeat the first 10,000 over and over again (for a 20,000 game match.) However they are played in pairs so that each "pair" of programs plays both side always of the same opening. A hash of the programs determine which subset (or more accurately which order) to play from the 35,000 opening set.

We have some evidence that indicates playing foreign opponents is superior to self-play.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What your opinion about this testing methodology?

Post by bob »

Kempelen wrote:Should it work?

I have been thinking about ways to improve testing time results. People usually use tournaments from startup position, or tournaments with a set of very limited set of position (i.e. 32), or tournaments with a lot of random positions. I asumme all people is doing this with a minimum of 1000 to 4000 games.

.... but ....

what about repeating the same tournament, with the same opponents, with the same positions per opponent?. Assuming a set of positions would be very large....

example:
Game 1, agains Crafty, black, posicion from FEN file 'myfenpositions.epd', number of position 540
Game 2, agains Critter, white, position from FEN file 'myfenpositions.epd', number of position 3251
....
etc

the idea is that the number of position would be always the same and not choosed ramdomly, without repeating any FEN, but enought varied.
The tournament file from the tournament manager would always be the same, without the need to recreate the tournament. The test would always repeat the same.

Would be results between tests more accurate than randomly choose the startup position.?
:) This is what many of us have been doing for years. I even posted a large set of starting positions on my ftp box, the very positions I use in my cluster testing. For each new version of Crafty, I play against the same opponents, using the same set of positions, playing each position with black and white to make sure there is no biased position that has an unwanted influence.

If you repeat for each version, the only thing that changes between tests is your program, which makes comparison pretty easy. You do need enough games. I get an error bar of +/- 4 Elo playing 30,000 game matches...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What your opinion about this testing methodology?

Post by bob »

Sven Schüle wrote:
Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven
It really doesn't influence the BayesElo error bar, obviously, but if you choose random openings, you introduce one more variable, as one run might choose openings you play well, and the next uses openings you play poorly, producing a false impression.