If testing was done like this ...

Discussion of chess software programming and technical issues.

Moderator: Ras

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

If testing was done like this ...

Post by Michael Sherwin »

ver_A plays 80 game matches against 5 stronger opponents and the number of different positions won is counted. So any number of wins from one side of one position is only counted as one. Pick opponent engines so that anywhere from 40 to 60 points are aquired.

Then ver_B plays the same.

If ver_B scores more points (based on getting at least one win in each position) then could that be a good indication that ver_B is better? How many more points is needed?

The idea is to ignore the random accumulation of pure score in favor of seeing if a new version can win positions that the earlier version could not win.

My gut feeling is that less games might be needed in a test like this.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: If testing was done like this ...

Post by bob »

Michael Sherwin wrote:ver_A plays 80 game matches against 5 stronger opponents and the number of different positions won is counted. So any number of wins from one side of one position is only counted as one. Pick opponent engines so that anywhere from 40 to 60 points are aquired.

Then ver_B plays the same.

If ver_B scores more points (based on getting at least one win in each position) then could that be a good indication that ver_B is better? How many more points is needed?
The first thing to try is to run a few of these 80 game matches and measure the variability of the results, which will be _amazingly_ large. That is where I started in late 2006/early 2007.


The idea is to ignore the random accumulation of pure score in favor of seeing if a new version can win positions that the earlier version could not win.

My gut feeling is that less games might be needed in a test like this.
Unfortunately, once you test this, reality sets in, and you will discover you need to add three more zeros or so to the total number of games required unless your changes being tested are _huge_ improvements.