Testing procedure

gingell · Post by **gingell** » Wed May 27, 2009 6:24 pm

I've been doing my testing along the following lines: Make a change, compile with and without that change, and have the new and the old versions play maybe 100 games against each other. I understand that the error bars are still pretty wide with this number of games, but I've found it's a good enough starting point to see whether a change should be pursued further.

The problem I have is the following. Since everything my engine does is deterministic, many of the games I see played out during testing look very similar to each other. For instance just eyeballing the games I see identical or at least very similar positions crop up numerous times.

I've tried two approaches to add randomness. One is to add a small random number to the position evaluation function, and the other is to add small random time bonuses. This works, but I find once I've added some randomness my results over a series of tests take a very long time to converge. For instance I can let 200 games play out of the identical binary against itself and still see very wide differences in the outcomes.

In some respect I suppose this outcome is exactly what I should have expected and what I wanted. I don't have access to hardware though that makes playing out very many more games than this practical.

I'm curious what approaches people have taken to this kind of problem. I do my testing with 1 second moves - if I reduce that to maybe .1 seconds are the result I see going to be meaningful? If I add less randomness, how do I know I'm not playing out very similar games and getting garbage results?

Thanks for any comment.

kranium · Post by **kranium** » Wed May 27, 2009 6:50 pm

Hi Matthew-

i would highly recommend playing a lot more games (~1024) but at an very fast time control...for ex: currently i test each and every change (even the tiniest) 1024 games at 15 secs per side...this usually takes about 14 hours.

i would also highly recommend testing against a reasonable variety of opponents...for ex: choose 8 engines maybe 50 elo apart clustered around the current strength of your engine. one idea: set up arena to load 32 common opening positions...and then create a gauntlet (where each engine plays as black and as white from each position against each one of the reference engines)...for a total of 512 games. probably a good idea to run this gauntlet twice.

if you have more than 1 computer you can test multiple changes simultaneously, tracking the results with a spreadsheet.

in this manner you can be assured that your engine is taking tiny steps in the right direction (after a couple months, you may realize a significant ELO improvement).

to do this well (thoroughly) of course you need multiple computers, really strict change control, documentation, lots of time and loads of patience and determination.

good luck!

PS: what's the name of your engine?

hgm · Post by **hgm** » Wed May 27, 2009 7:18 pm

In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.

mcostalba · Post by **mcostalba** » Wed May 27, 2009 7:40 pm

I think that one of the most valuable things that a kwnoledge contributor can give to the comunity is a paper with the title "The most resource efficent way of testing a chess engine"

Where resource is the product of time * hardware power

I'm thinking in something smarter then saying get a cluster and play 10000000 games each change. I have no enough experience on this subject but IMHO the best way to test CANNOT be independent from the change itself. If I were able to write such a paper I would start categorizing the kind of changes In particular two categorizations. One that I would call "opponent diversity"

- Speed optimizations only

- Alghoritm efficency (as example how to store values in TT)

and so on until

- Evaluation changes.

And the other "Time control diversity"

- Evaluation

- Speed optimizations

and so on until

- extensions or something similar

The goal of this approach is to detect a change as an easy optimizations, that can be tested in the simplest Mod vs Orig engine fashion, until the complex evaluation changes that require an engine pool of opponents to get reliable results. The same logic for time control: from the "ultra fast timing it's enough" changes to the "you need to verify at 40/40 to be sure"

This hypothetical paper should give an easy to follow steps to analyze a patch in this regards and choose the most appropiate testing framework. I think this rationale and resource efficient "method" is something that is missing badly. At least for me.

gingell · Post by **gingell** » Wed May 27, 2009 7:55 pm

My engine is "Chesley the Chess Engine." I began working on it back in January when Chesley Sullenberger was in the news, the pilot who successfully landed a plane on the Hudson river. My office overlooks the Hudson just a little further downtown and I liked the idea of naming it after him.

Now that its in reasonable shape and I've plucked most of the low hanging fruit I'm trying to develop a more rigorous testing strategy. I'm coming to understand that chess programming is as much an experimental science as anything else.

Thanks very much for your comments. I especially see the benefit of using a variety of openings and other opponents.

Matt

Ron Murawski · Post by **Ron Murawski** » Wed May 27, 2009 8:35 pm

gingell wrote:My engine is "Chesley the Chess Engine." I began working on it back in January when Chesley Sullenberger was in the news, the pilot who successfully landed a plane on the Hudson river. My office overlooks the Hudson just a little further downtown and I liked the idea of naming it after him.

Matt

Chesley the Chess Engine is added to the Private Engine List
http://computer-chess.org/doku.php?id=c ... ngine_list

If I've missed any other private engines, please let me know.

Ron

kranium · Post by **kranium** » Wed May 27, 2009 8:37 pm

hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.

Dag H.G.-

how does it work?

kranium · Post by **kranium** » Wed May 27, 2009 8:43 pm

mcostalba wrote:I think that one of the most valuable things that a kwnoledge contributor can give to the comunity is a paper with the title "The most resource efficent way of testing a chess engine"

Where resource is the product of time * hardware power

I'm thinking in something smarter then saying get a cluster and play 10000000 games each change. I have no enough experience on this subject but IMHO the best way to test CANNOT be independent from the change itself. If I were able to write such a paper I would start categorizing the kind of changes In particular two categorizations. One that I would call "opponent diversity"

- Speed optimizations only

- Alghoritm efficency (as example how to store values in TT)

and so on until

- Evaluation changes.

And the other "Time control diversity"

- Evaluation

- Speed optimizations

and so on until

- extensions or something similar

The goal of this approach is to detect a change as an easy optimizations, that can be tested in the simplest Mod vs Orig engine fashion, until the complex evaluation changes that require an engine pool of opponents to get reliable results. The same logic for time control: from the "ultra fast timing it's enough" changes to the "you need to verify at 40/40 to be sure"

This hypothetical paper should give an easy to follow steps to analyze a patch in this regards and choose the most appropiate testing framework. I think this rationale and resource efficient "method" is something that is missing badly. At least for me.

Hi Marco
I agree completely...it's a really difficult aspect of development. Wouldn't it be a tremendous luxury to be able to implement a small change and know quickly and with certainty if it is good or bad?

Unfortunately, this is impossible...i think the closest thing is ultra-fast testing.

i.e.
if an engine wins at 15, 20, 30 secs per game, will the same result manifest itself at longer time controls? perhaps... maybe! ...probably? why not? ...who knows?

i think yes, and by playing a lot of games the error margin decreases

i'm fairly sure that Dr. Hyatt and Vas also test (at least somewhat) in this manner...(with some many years of testing experience, i'm sure Dr. Hyatt has it down to an exact science by now ...

)

hgm · Post by **hgm** » Wed May 27, 2009 9:13 pm

kranium wrote:how does it work?

http://www.talkchess.com/forum/viewtopic.php?p=260829

mcostalba · Post by **mcostalba** » Wed May 27, 2009 9:16 pm

Hi Norman,

thanks for your words. I am fully aware you can appreciate such things because working on improving a very strong engine, as you did with Cyclone, it is the poster child of testing issues. Much more then starting from scratch because while you are adding important features to your engine the development part takes the bigger part of work...also because testing, as example, that NULL move search is a good thing to have does not requires enormous efforts

But please consider the part of the title that says "most resource efficent way" this is the key especually for amateurs or for people that doesn't have big iron to play with.

I know for dr. Hyatt is a science, but his science is to run 32.000 games each change. With this big arsenal he can simply skip analisis of what change he is testing and against what engines is better to test.

Its answer to the question: "What is the best gear to run with your bicycle at the top of that mountain? " is simply "get a car ! "....lucky him!

And I think Rybka author can afford big iron the same. So they simply don't need such a paper. Of course I don't claim they are not able to write it. Especially Dr. Hyatt could easily IMHO...but he doesn't because he doesn't need.

Returning to your ultra fast idea: it is indeed _almost_ the best thing but you have to be careful. As a trivial example consider selective search. Futility pruning in Glaurung starts at ply 7 and more or less is the same with Cyclone. So if your ultra fast search does not reach at least ply 9-10 and change you are testing is about futility then you can get artifacts with ultra fast games.

In any case a conscious analysis of _what_ you want to test is IMHO always required especially when testing with conditions (as ultra fast time controls) that are very far from official 40/40 games where your engine will be evaluated when released.

Testing procedure

Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure