margin of error

Don · Post by **Don** » Mon Sep 24, 2012 7:42 pm

Michel wrote:
When I test Komodo versions head to head on one particular level on one particular machine, I get about 51% draws. Should I set the draw ratio to 0.51?
BTW It occurred to me that with your simulation program you could actually verify if the results of wald are correct.

I have never confirmed them by simulation, only by some obvious sanity checking (like verifying that certain probabilities sum to 1).

The mathematics for deriving the formulas is a bit complicated so one has to be on the lookout for mistakes.

I finally got it to outperform the simple 20,000 game method. But I had to make the resolution very low, I think I used 0.2 ELO and I had to use very high alpha and beta values - something close to 0.50! So those terms are not making sense to me but the simulation is returning a more sure evaluation with less games.

It also seems that if I am willing to up the 20,000 to 30,000 or 40,000 games I can get way up there, I can get 93% with 32,000 games.

If I want to almost guarantee no false positives, I'm sure this becomes much easier. It's obvious that you can get arbitrarily close to 100% by throwing out any version you are not sure of - but that of course is also a tradeoff - how many good versions are you willing to throw away because it is not clear?

Don

Michel · Post by **Michel** » Mon Sep 24, 2012 8:03 pm

I'm running the simulation now and it is not looking good.

Hmm in that case I will have to do simulations myself.

Just to be sure were your parameter alpha=0.05, beta=0.05 epsilon=2?
I am asking since I get slightly different values for the run time parameters
(the difference is in the 3decimal or so). Still I am a bit surprised.

Code: Select all

Design parameters
=================
False positives            :   5.0%
False negatives            :   5.0%
Resolution                 :  2.0 Elo
Truncation                 :  304937 games
Draw ratio                 :  27.3%
DrawElo                    :  97.3 Elo

Runtime parameters
==================
Win                        :  +0.0073123572041512
Draw                       :  -0.0000306679224112
Loss                       :  -0.0073430251265621
H1 threshold               :  +3.5628530206607243
H0 threshold               :  -3.5627844975008256
Truncation point           :  304936
H1 threshold at truncation :  -0.0000102246122515

Using the following parameters with a resolution of 2 ELO I get slightly over 97% which is great, but I have to play 145,715 games on average.

This seems to be close to the prediction of wald if you look at the graph.

Here is a comparison. Using simple 20,000 game matches I make the "right decision" 82 percent of the time for a 2 ELO improvement

This means beta=0.18. To compare this with a wald test you should also tell met what alpha is for your usual test (i.e. the probability of a false positive when elo=0).

Don · Post by **Don** » Mon Sep 24, 2012 8:13 pm

Houdini wrote:
hgm wrote:An alternative strategy could therefore be to immediately crank up the number of games from 15k to 30k for a version that qualified based on the 15k vs (old) 30k comparison. And then run the next version for 15k to compare against that.
Correct, I'd rather play the extra 15k (or whatever number) games for the new reference version immediately after its "qualification".

You have to take a lot of care that you don't get into the habit of manipulating the results. We have to fight that because we sometimes "really like" some version because we have a strong a priori sense that it should be good. That sense is REALLY strong when all you have done is speed up the program and it's otherwise functionally identical. But what if it doesn't come out on top? You KNOW for a fact that it has to be better than the one it is based on.

So let's say that you make some version, it comes out second place but just misses the mark by a fraction of an ELO and you have a strong belief that it is better. I guess what I am asking is what role does a priori intuition play in testing? Larry and I have had many arguments about this, I don't trust our intuition and he believes it should play a much larger role. So we make all kinds of decisions based on our gut feeling. I don't trust that, Larry does.

Nevertheless, he has a point. A priori information is not completely worthless. Once in a while some search change prunes less moves and yet searches the same average depth. That is not as clear as the pure speedup because it can still be wrong - but we know that 95% of the time it's right. But when you get into the mode of manipulating how you test based on your intuition, how much is too much?

My solution to this is to use your intuition, but specify the testing conditions in advance but even that seems a bit shaking to me.

Don · Post by **Don** » Mon Sep 24, 2012 8:15 pm

Michel wrote:
I'm running the simulation now and it is not looking good.
Hmm in that case I will have to do simulations myself.

Do you want me to send the code? It now actually runs your python script and parses the output to make it easy to set up.

Just to be sure were your parameter alpha=0.05, beta=0.05 epsilon=2?
I am asking since I get slightly different values for the run time parameters
(the difference is in the 3decimal or so). Still I am a bit surprised.
Code: Select all
Design parameters
=================
False positives            :   5.0%
False negatives            :   5.0%
Resolution                 :  2.0 Elo
Truncation                 :  304937 games
Draw ratio                 :  27.3%
DrawElo                    :  97.3 Elo

Runtime parameters
==================
Win                        :  +0.0073123572041512
Draw                       :  -0.0000306679224112
Loss                       :  -0.0073430251265621
H1 threshold               :  +3.5628530206607243
H0 threshold               :  -3.5627844975008256
Truncation point           :  304936
H1 threshold at truncation :  -0.0000102246122515
Using the following parameters with a resolution of 2 ELO I get slightly over 97% which is great, but I have to play 145,715 games on average.

This seems to be close to the prediction of wald if you look at the graph.

Here is a comparison. Using simple 20,000 game matches I make the "right decision" 82 percent of the time for a 2 ELO improvement
This means beta=0.18. To compare this with a wald test you should also tell met what alpha is for your usual test (i.e. the probability of a false positive when elo=0).

Michel · Post by **Michel** » Mon Sep 24, 2012 8:32 pm

I finally got it to outperform the simple 20,000 game method. But I had to make the resolution very low, I think I used 0.2 ELO and I had to use very high alpha and beta values

This is wrong. A Wald test will considerably outperform a fixed length test for the same design parameters (alpha,beta,epsilon). You can see this on the graphs that wald generates.

I think that when you make the comparison with your usual testing you only look at beta (the probability of a false negative when elo=epsilon) and not
at alpha (the probability of a false positive when elo=0). You have
to look at both alpha and beta since otherwise you are comparing
apples and oranges.

Do you want me to send the code?

Sure. Thanks a lot (parsing the output of the scripts seems indeed very convenient). But you have not convinced me yet that the results
of your simulations contradict the predictions by wald....

Don · Post by **Don** » Mon Sep 24, 2012 8:38 pm

Michel wrote:
I finally got it to outperform the simple 20,000 game method. But I had to make the resolution very low, I think I used 0.2 ELO and I had to use very high alpha and beta values
This is wrong. A Wald test will considerably outperform a fixed length test for the same design parameters (alpha,beta,epsilon). You can see this on the graphs that wald generates.

I think that when you make the comparison with your usual testing you only look at beta (the probability of a false negative when elo=epsilon) and not
at alpha (the probability of a false positive when elo=0). You have
to look at both alpha and beta since otherwise you are comparing
apples and oranges.

Do you want me to send the code?
Sure. Thanks a lot (parsing the output of the scripts seems indeed very convenient). But you have not convinced me yet that the results
of your simulations contradict the predictions by wald....

Another set of eyes on the code will reveal if I'm doing anything wrong. Please send your email address to me by PM or just send it to dailey.don@gmail.com and I'll sent you the code.

hgm · Post by **hgm** » Mon Sep 24, 2012 8:48 pm

Don wrote:You have to take a lot of care that you don't get into the habit of manipulating the results.

The proposed method does not leave any room for manipulating the result. It just provides more info at lower cost. You accept a new version with a confidence that is exactly equal to the one you would accept it with in the "always 20k" method. At that point you decide to base future versions on that new best version.

The difference then is that you will get your 20k-vs-20k comparison accuracy for those new version by 30k-vs-15k comparison instead. This takes 10k games more than needed, (bacause you had to play 2 x 15k games, in stead of only 20k for the new one), which should be seen as an investment. Each rejected change after that earns you back 5k, because you now only do 15k games. If you accept less than 33% of what you try, you save on the average. (Note the 15k number came out of an optimization for 1-in-5 aceptance; if you knew you were doing 1 in 3, you would have chosen other numbers, and run already at a profit after 2 rejections.)

In addition you get extra information, showing you after the fact which changes you accepted just because their test results were flukes, while in fact they were a step backwards. But if you would never use that info, the method would be 100% equivalent to the 20k-always method. Of course it is likely that using that info in a some objective way, rather than foolhardily ignoring it, could make you do a lot better. (At no extra cost in games!

)

Tom Likens · Post by **Tom Likens** » Tue Sep 25, 2012 4:55 am

Don wrote:
Michel wrote:
drd@greencheeks ~/tmp $ ./wald.py
Traceback (most recent call last):
File "./wald.py", line 20, in <module>
from scipy.stats import norm
ImportError: No module named scipy.stats
Are you using Ubuntu? Then fire up synaptic and search for scipy.
Yes, I am using Mint which is basically Ubuntu.

It works now, that did the trick. Now let me see if I understand it.

Don

Interesting, I just made the switch to Mint 13 running Mate. Previously, I ran Ubuntu
but I can't stand Unity.

Tom

Don · Post by **Don** » Tue Sep 25, 2012 5:01 am

Tom Likens wrote:
Don wrote:
Michel wrote:
drd@greencheeks ~/tmp $ ./wald.py
Traceback (most recent call last):
File "./wald.py", line 20, in <module>
from scipy.stats import norm
ImportError: No module named scipy.stats
Are you using Ubuntu? Then fire up synaptic and search for scipy.
Yes, I am using Mint which is basically Ubuntu.

It works now, that did the trick. Now let me see if I understand it.

Don
Interesting, I just made the switch to Mint 13 running Mate. Previously, I ran Ubuntu
but I can't stand Unity.

Tom

I also switched in order to run Mate. It's more like the classic gnome but maybe even more solid and I'm loving it.

Don · Post by **Don** » Sat Oct 06, 2012 7:26 pm

Rémi Coulom wrote:That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.

Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.

Rémi

I did exactly that and came up with something pretty useful. One finding that should be pretty obvious when you think about it but that is surprising if you don't, is illustrated by the following thought experiment:

Assume that you 50% of your experiments are regressions, and 50% are improvements and that in total this is a zero sum game (the total ELO summing the regressions and improvements come out to zero.) Also, assume that you can generate these experiments at any rate of speed desired (in other words, no setup time between experiments.) What is the ideal number of games per match to run to determine whether to keep a change or not and then move on to the next experiment? The answer is non-intuitive, at least to me.

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error