My new testing scheme

hristo · Post by **hristo** » Wed Nov 28, 2007 7:34 pm

Ross Boyd wrote: Absolutely the worst scenario is not having the ability to retreat completely when a promising idea turns out bad.... and that happens very frequently.

Indeed!
How do you handle the case where you are working on a particular change (implementing/testing a new idea) which might take months and at the same time you want to try something different (starting from your tested stable version)?

For instance this is what has happened to me:
1) Modify the evaluation algorithm leads to several possibilities that are mutually exclusive.
1.1) Evaluation modification One
1.2) Evaluation modification Two
1.3) Evaluation modification Three
1.n) Evaluation modification ...
2) Modify the search ("Alpha-Beta", Negascout, something else)
2.1) Alpha-Beta
2.2) Negascout
3) Modify the transposition table handling
3.1) Position aging
3.2) Refutation table

I cannot easily reject some of the changes since it takes a long time to determine their effect, but at the same time I don't like to be stuck on a single issue when there are other venues to explore. At some point, obviously, you have to move on and stop working on a given set of changes (branch) even though you might not have conclusive results, but it is still nice to have the steps that were taken on the branch available for evaluation (merge) at some later point.

Regards,
Hristo

adams161 · Post by **adams161** » Wed Nov 28, 2007 7:57 pm

Testing against your own program or just a few programs can be dicey.
For example i've versioned in one month like 10 versions of pulsar playing atomic, and one program can be better against another but not really better when people play it. It might just exploit one particular vulnerbility in the original program but also introdue other weakneses. this is particularly true for evaluate changes.

Basicly a small sample of opponents such as testing against yourself causes you to learn the opponent but not neccesarily the crowd.

Best testing is internet games on an ics that are rated and at the same time control.

Mike

Zach Wegner · Post by **Zach Wegner** » Thu Nov 29, 2007 4:56 am

adams161 wrote:Best testing is internet games on an ics that are rated and at the same time control.

I don't know about that. Seems playing on ICS would increase randomness a lot, and you don't really know the real ratings of everyone else. I play against "anchors": they are the exact same engine over and over, and have established ratings from thousands of games played by me and everyone else.

I do have a problem of variety of opponents, though. I am on OS X, so I have to settle for unix-compatible open source stuff. And I need stuff in the 2300-2500 range...

I am a little disappointed that nobody thought the application of UCB was interesting. That was the most important part of the post, IMO.

Anyways, since my original post, I found a formula that works well for me for testing distribution:

Code: Select all

score = (&#40;float&#41;&#40;version&#91;n&#93;.elo - 2300&#41; / 200&#41; +
    1.3*sqrtf&#40;log&#40;&#40;float&#41;total_games / 2&#41; / &#40;5 * version&#91;n&#93;.games / 2&#41;) +
    1. / &#40;versions - n + 2&#41;;

So it takes into account the ELO, the number of games of both the version and all other versions, and how recent the version is.

Tony Thomas · Post by **Tony Thomas** » Thu Nov 29, 2007 7:01 am

adams161 wrote:Testing against your own program or just a few programs can be dicey.
For example i've versioned in one month like 10 versions of pulsar playing atomic, and one program can be better against another but not really better when people play it. It might just exploit one particular vulnerbility in the original program but also introdue other weakneses. this is particularly true for evaluate changes.

Basicly a small sample of opponents such as testing against yourself causes you to learn the opponent but not neccesarily the crowd.

Best testing is internet games on an ics that are rated and at the same time control.

Mike

When he said few opponents, he was talking about 30 game matches verses 14 different opponents, which gives 420 games. He also does few 100 game matches versus different opponents (not one of the 120 at a different time control, so you cant really say that the methodology is less accurate.

Ross Boyd · Post by **Ross Boyd** » Thu Nov 29, 2007 11:01 am

hristo wrote:
Ross Boyd wrote: Absolutely the worst scenario is not having the ability to retreat completely when a promising idea turns out bad.... and that happens very frequently.
Indeed!
How do you handle the case where you are working on a particular change (implementing/testing a new idea) which might take months and at the same time you want to try something different (starting from your tested stable version)?

For instance this is what has happened to me:
1) Modify the evaluation algorithm leads to several possibilities that are mutually exclusive.
1.1) Evaluation modification One
1.2) Evaluation modification Two
1.3) Evaluation modification Three
1.n) Evaluation modification ...
2) Modify the search ("Alpha-Beta", Negascout, something else)
2.1) Alpha-Beta
2.2) Negascout
3) Modify the transposition table handling
3.1) Position aging
3.2) Refutation table

I cannot easily reject some of the changes since it takes a long time to determine their effect, but at the same time I don't like to be stuck on a single issue when there are other venues to explore. At some point, obviously, you have to move on and stop working on a given set of changes (branch) even though you might not have conclusive results, but it is still nice to have the steps that were taken on the branch available for evaluation (merge) at some later point.

Regards,
Hristo

It's really tough isn't it!?
Particularly if you are someone who comes up with tons of ideas and has limited time available to experiment.

I have no answer to your/our problem except to choose very carefully where you decide to spend your valuable programming time. Pursue only the most promising avenues first.... then try the ones that seemed counter intuitive... of course you already know all this.

Last week I spent tons of time coding 'safe' mobility. It made everything slower and weaker... it just didn't pay off. I suspect that if I had persisted by (additionally) generating attack tables while traversing the mobility rays then it may have paid off by giving very accurate king attack data. However, I decided to cut my losses, pack up the source and return to the main development tree... such is life....

Good luck,

Ross

mjlef · Post by **mjlef** » Fri Nov 30, 2007 3:47 pm

My testing scheme is a little different, but also much like that of many chess programmers I know.

I start off with an idea, and begin to implement it. About half way through I change it a lot, then add in more stuff. I begin a test run, but a few games into the run I have another idea and implement that. A few more games get played then I ignore the test results to try something else. It is important during this process to never get enough games played to reach any statistical conclusions at all!

After playing a bunch of games, I toss out the results and go with what feals good!

Mark

BubbaTough · Post by **BubbaTough** » Fri Nov 30, 2007 5:06 pm

I have tried both serious formal test beds, and Mark Lefler's approach, and I can confirm with 87% confidence that Mark's approach is the more effective one.

-Sam

My new testing scheme

Re: My new testing scheme

Re: My new testing scheme

Re: My new testing scheme

Re: My new testing scheme

Re: My new testing scheme

Re: My new testing scheme

Re: My new testing scheme