Engine Testing - Statistics

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, Dann Corbit, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Engine Testing - Statistics

UncombedCoconut wrote: If you're still interested in the idea I'll develop it further. I think one would want to pick a threshold ELO difference to detect, and optimize the expected # of games needed to see it. I'd bet the resulting rules would be related to a Sequential Probability Ratio Test (with an extra rule about when to call the programs equal.)
Yes, it is interesting, but I would think you need to validate your program with real data.

For instance simulating a match using rand(), like:

Code: Select all

``````int do_game&#40;int p_eng1, int p_eng2, int p_draw&#41;
&#123;
/* Sum of winning probabilities + draw must be 100 */
assert&#40;p_eng1 + p_eng2 + p_draw == 100&#41;;

/* number between 1-100 */
result = rand&#40;) % 100 + 1;

if &#40;result <= p_en1&#41;
return WHITE_WIN;
else if &#40;result <= p_en1 + p_eng2&#41;
return BLACK_WIN;
else
return DRAW;
&#125;
``````
And verify that calculated probabilities corresponds to the one measured through simulated matches.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

I just wrote a simulator myself now; with it I was able to reproduce Joseph Ciarrochi's table.

Furthermore I tried how many percent of the early stops, according to the cutoff values were proven good. Here four example results:

Code: Select all

``````alpha 5%
n = 10   ... cutoff = 70%
n = 20   ... cutoff = 65%
=> 29.7% of the early cutoffs at 10 instead of 20 were good

n = 100   ... cutoff = 57%
n = 200   ... cutoff = 55.2%
=> 33.8% of the early cutoffs at 100 instead of 200 were good

alpha 1%
n = 10   ... cutoff = 80%
n = 20   ... cutoff = 72.5%
=> 14.2% of the early cutoffs at 10 instead of 20 were good

n = 100   ... cutoff = 60%
n = 200   ... cutoff = 57%
=> 26.2% of the early cutoffs at 100 instead of 200 were good``````
So I would interpret these figures now as such as if for example I was running a 200 games tournament with alpha = 5%; and I wanted to cutoff after 100 games I would at least need a cutoff based on alpha = 5% * 33.8% = 1.69%
In this case the cutoff would be 59%

does this sound valid?

regards,
Edmund

bob
Posts: 20920
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Engine Testing - Statistics

zamar wrote:
bob wrote:
MattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
This is the same nonsensical stuff
What here is nonsense? For me it makes sense perfectly. (the only possible loophole is that there might be connection between being bored and the result, but no example is perfect)
I hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit.
I invented this technique when I was ten, used it succesfully for gambling six months, then thought more about it, realized it doesn't work and stopped Common sense can be badly misleading with statistics.
SO stopping a test at some arbitrary point chosen _before_ the match starts eliminates using that kind of logic.
Matt said exactly this, didn't he?
Wasn't arguing with him. Was arguing _with_ him and against the idea of stopping when you like or don't like the result.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

If anyone is interested on how to build a LOS table, please take a look on the ChessProgramming Wiki. I just added a new page there.

http://chessprogramming.wikispaces.com/Statistics

regards,
Edmund

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Engine Testing - Statistics

bob wrote: Wasn't arguing with him. Was arguing _with_ him and against the idea of stopping when you like or don't like the result.
I think you cannot state something like this without doing the math.

An easier way, for people that is not able to produce the needed statistical formula (almost everybody in this forum I would guess , perhaps only Rémi Coulom has the theoretical background to be able to write some formulas) is to use a game simulator and plot the corresponding table. Also this is not trivial because you have to ask the correct questions to your simulator.

Admittedly I didn't think a lot on this and I would need a bit of time to come up with a proper simulation procedure, but I see this task doable by me, while instead deduce the statistical formulas is beyond my skills.

P.S: Of course just stating "can be done" or "cannot be done" is pure handwaving, I would think we all agree on this point

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Engine Testing - Statistics

Edmund wrote:If anyone is interested on how to build a LOS table, please take a look on the ChessProgramming Wiki. I just added a new page there.

http://chessprogramming.wikispaces.com/Statistics

regards,
Edmund
Thanks very interesting, but to be usable you would need to add at least 200, 500, 1000, 2000 games for match column, because 100 is useless !

You can also remove most of the columns like 10, 20, 30 etc.. so to reduce cluttering.

IMHO I would just write values for 50, 100, 200, 500, 1000, 2000 games

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

mcostalba wrote:
Edmund wrote:If anyone is interested on how to build a LOS table, please take a look on the ChessProgramming Wiki. I just added a new page there.

http://chessprogramming.wikispaces.com/Statistics

regards,
Edmund
Thanks very interesting, but to be usable you would need to add at least 200, 500, 1000, 2000 games for match column, because 100 is useless !

You can also remove most of the columns like 10, 20, 30 etc.. so to reduce cluttering.

IMHO I would just write values for 50, 100, 200, 500, 1000, 2000 games
My limitation for calculating it is the lacking precission, to calculate more than 200 games I need a different programs. Note that one of the formulas has to handle fact(n) at one point.

Do you know any tools for that?

MattieShoes
Posts: 718
Joined: Fri Mar 20, 2009 7:59 pm

Re: Engine Testing - Statistics

Edmund wrote:
mcostalba wrote:
Edmund wrote:If anyone is interested on how to build a LOS table, please take a look on the ChessProgramming Wiki. I just added a new page there.

http://chessprogramming.wikispaces.com/Statistics

regards,
Edmund
Thanks very interesting, but to be usable you would need to add at least 200, 500, 1000, 2000 games for match column, because 100 is useless !

You can also remove most of the columns like 10, 20, 30 etc.. so to reduce cluttering.

IMHO I would just write values for 50, 100, 200, 500, 1000, 2000 games
My limitation for calculating it is the lacking precission, to calculate more than 200 games I need a different programs. Note that one of the formulas has to handle fact(n) at one point.

Do you know any tools for that?
Java has the BigInt and BigDecimal class which will handle any precision you like (within memory constraints of course...)

Python handles huge integers natively I believe.

Of course, there are lots of third party libraries for doing such in C or C++

If you have the actual formulas, writing something to spit out a CSV for different results should be pretty easy.

UncombedCoconut
Posts: 319
Joined: Fri Dec 18, 2009 10:40 am
Location: Naperville, IL

Re: Engine Testing - Statistics

Some comments (all grouped together; sorry if this messes with the threaded view):
Edmund wrote:it is a much stricter margin indeed, but I would have imagined the closer you approach the final value the smaller the difference should become. More like in the following sketch:
There was indeed a bug. Here's a fixed version. If you clamp the cutoffs to zero, the max cutoffs at each round now match Joseph Ciarrochi's numbers. (I also don't know when roundoff starts to kill the calculation, but the order of operations makes me think it would take quite a while.)
mcostalba wrote:Yes, it is interesting, but I would think you need to validate your program with real data.

For instance simulating a match using rand(), like: [...]
And verify that calculated probabilities corresponds to the one measured through simulated matches.
The code must be verified independently, and that's a good way. But in fact my code is just a continuous version of that simulation: it just counts the percentage of players with each score, as they play simulated games.
Edmund wrote:If anyone is interested on how to build a LOS table, please take a look on the ChessProgramming Wiki. I just added a new page there.

http://chessprogramming.wikispaces.com/Statistics
Nice work. I agree that a multiple precision library would help (esp. if it has log-of-factorial or log-of-gamma-function built in). You can also be smarter with the computations. You can build a table of this f(w,d,l) recursively:

Code: Select all

``````f&#40;w+1,d,l&#41; = f&#40;w,d,l&#41; * p_win * &#40;w+1+d+l&#41; / &#40;w+1&#41;
f&#40;w,d+1,l&#41; = f&#40;w,d,l&#41; * p_draw * &#40;w+d+1+l&#41; / &#40;d+1&#41;
f&#40;w,d,l+1&#41; = f&#40;w,d,l&#41; * p_loss * &#40;w+d+l+1&#41; / &#40;l+1&#41;``````
This way you won't ever overflow.
BTW, what's windows.h doing in your code?
mcostalba wrote:An easier way, for people that is not able to produce the needed statistical formula (almost everybody in this forum I would guess , perhaps only Rémi Coulom has the theoretical background to be able to write some formulas) is to use a game simulator and plot the corresponding table. Also this is not trivial because you have to ask the correct questions to your simulator.
The simulator is not less powerful than a formula, if you (1) apply it to the probability density rather than one simulated player at a time and (2) don't write buggy code. Of course #2 means I don't have a chance.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

UncombedCoconut wrote:
Edmund wrote:it is a much stricter margin indeed, but I would have imagined the closer you approach the final value the smaller the difference should become. More like in the following sketch:
There was indeed a bug. Here's a fixed version. If you clamp the cutoffs to zero, the max cutoffs at each round now match Joseph Ciarrochi's numbers. (I also don't know when roundoff starts to kill the calculation, but the order of operations makes me think it would take quite a while.)
Maybe I am doing something wrong here, but I am still not getting quite the same values. eg n=100; alpha=5% --> should be 57.5%

Code: Select all

``After   100 games, continue if  19.0 <= score <=  81.0 ( 19.0% -  81.0%). Cumulative type-I error 0.0500``
Furthermore, would you elaborate on the "Optimize for very different strength." feature, I don't quite understand that.

UncombedCoconut wrote:
Edmund wrote:If anyone is interested on how to build a LOS table, please take a look on the ChessProgramming Wiki. I just added a new page there.

http://chessprogramming.wikispaces.com/Statistics
Nice work. I agree that a multiple precision library would help (esp. if it has log-of-factorial or log-of-gamma-function built in). You can also be smarter with the computations. You can build a table of this f(w,d,l) recursively:

Code: Select all

``````f&#40;w+1,d,l&#41; = f&#40;w,d,l&#41; * p_win * &#40;w+1+d+l&#41; / &#40;w+1&#41;
f&#40;w,d+1,l&#41; = f&#40;w,d,l&#41; * p_draw * &#40;w+d+1+l&#41; / &#40;d+1&#41;
f&#40;w,d,l+1&#41; = f&#40;w,d,l&#41; * p_loss * &#40;w+d+l+1&#41; / &#40;l+1&#41;``````
This way you won't ever overflow.
Great idea! I didn't think of that. The problem I see though is that you may worsen rounding errors by that. Do you have any ideas on how to travers the permutations most efficiently given that win is changed at last (to generate the output) and to minimize the rounding errors (so maybe go from biggest to smallest)

In the meantime I found a nice multiple precision library. But unfortuntaly it takes much longer to compute. So I tried to pregenerate tabels for factorials and powers. This sped up the process and I was able to compute the values for n=1000 (which you find on the chessprogrammingwiki now) but I soon encountered the limitations of my stack. So a version without the risk for overflow sounds definitely promising.

Back to the original problem; any ideas on how we could use these new findings to answer the question on when to stop without the need of a simulation.

regards,
Edmund