## Engine Testing - Statistics

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, Dann Corbit, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
bob
Posts: 20920
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: Engine Testing - Statistics

BubbaTough wrote:
bob wrote: The idea is this. If you start a session, and play until you are one unit ahead, and then quit, you are one unit up. And with the significant variance in the game of 21, the odds of your going up at one particular instant are quite high. And each time you do this, you win one bet. Suppose you play on a \$5 min bet table. Every session will end up at +5 bucks. Except for the occasional session where you start off with a loss and continue to play and lose.

The point is that lifetime winnings occurs over a lifetime, counting everything. If you quit because you are ahead, you start off at the same probability of winning the next hand whether you quit and don't play until tomorrow, or you stay at the table and play the next hand. But I know of many gamblers that swear they are winning overall, just by playing the game without any sort of "advantage play" to really give them an actual edge over the house. It is always the "I win more sessions than I lose" idea. If you play 20 sessions, of 500 hands each (3-4-5 hours of play depending on crowd) you play 10,000 hands. If you flat-bet \$10 per hand, you then had \$100,000 of "action". and the house will keep pretty close to \$500 bucks of your money. It doesn't matter whether you change that and play 1 session of 10,000 hands, or 10,000 sessions of one hand each. You end up the same \$500 in the hole.

Now if you do as some con-artists do, and play multiple sessions and report the number of times you walked away a winner vs a lower, you can win more sessions, while still losing that same 500 bucks, because you can lose more in one session than you win in all the rest. But if you claim "more winning sessions than losing sessoins" a "noob" will buy that book, which is the way the author gets rich, not by playing the game with his losing system. And the "noob" will verify that he wins more sessions than he loses, all the time asking himself "where in the devil is my money going?"

If you test and when you like the result, you stop, all you did was find a place where the god of variance was kind to you, and reach the conclusion you wanted to reach. If you play toward some set goal (say N thousand games, then you ignore that variance, and the more games you play, the smaller that variance becomes.

If you look at the data I posted the other day, after a thousand games, I could have stopped, and been "sure" that this version was better, because the variance had taken the Elo up quite a bit. It did come down to reality, given enough games as you could see.
I understand the fallacy you are referring to, and that many gamblers fall into. I am just not sure the analogy is fitting just right.

Is the "bankroll" equivalent to ELO?
Yes. If your bankroll, or Elo goes positive, you can stop the test and conclude you are better, but that is not valid. Play enough hands, or enough games, and your bankroll goes to zero, or the Elo drops to below where it was previously.

Is the "stopping after you are down \$500" equivalent to stopping your test when you are down a certain number of games?
That's the idea. The "game" never ends. You just break it up into chunks. A single small chunk can show most anything, while if you play long enough, you break the casino or they break you...

Is the "lifetime" equivalent to playing an infinity number of games in your test?
No. In blackjack, we often refer to "n0" which is the number of games required to get you to approach the expected value with a +1.0% edge. Typical values are 15-20,000 hands to give you small enough SD to make -1SD a still positive bankroll effect...

I guess its a little silly for me to nitpick the analogy. I think what is throwing me is that, generally speaking in gambling, you figure out your odds with math, and deduce how control your bankroll based on that. If I am understanding the analogy correctly, in computer chess you are kind of doing the opposite. You are watching your results, and trying to deduce the math.

-Sam
No, I actually have watched the error bar, and decided how many games are needed to make the error bar small enough to be useful. For huge rating differences, you don't need a small error bar to reach a 95% confidence that A is worse than B or vice-versa.. But for more reasonable values, say 10-20 Elo at the most, a few hundred just won't do.

UncombedCoconut
Posts: 319
Joined: Fri Dec 18, 2009 10:40 am
Location: Naperville, IL

### Re: Engine Testing - Statistics

Edmund wrote:I was taking conservative values: eg LOS when 20 points ahead after 200 games is 95.67% or 95.50% when 14 points ahead after 100 games. So maybe if you combine the two you actually don't need to be 34 points ahead, but 33 would do. But that is no major problem I think.

Or what other false positives do you see?
Actually I might be wrong if you're doing a 2-sided test (though I'm not sure). I assumed for some reason you meant a 1-sided test; sorry if that was wrong.

If you're doing a 1-sided test, you're modifying a neutral-or-positive test by adding more positive cases, hence you get more false positives. Whether you bring it up above 5% false positives, and if so how much you need to compensate by making the test after 200 games harder to pass, should IMO be answered by calculation. I guess that's my bias as a mathematician though.

BTW, congrats on the fast generation of LOS tables!

bob
Posts: 20920
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: Engine Testing - Statistics

zamar wrote:
BTW sequential probability ( http://en.wikipedia.org/wiki/Sequential ... ratio_test ) is not easy and you don't get it reading some introductionary statistical book.
No idea. But you do get it by visiting any of the online blackjack forums and start to discuss "progressive betting strategies". There are world-class statisticians which will be quite happy to explain why the idea is broken. It happens every day.
Sorry Bob, but I don't get what you are trying to say.

I understand fully well why "progressive betting strategies" won't work, because I could prove it when I was kid, but I can't see the connection between "progressive betting strategies" and "determining chess engine relative strength through sequential probability ratio test".
The idea being discussed was this. You start a test. And you note that the results are quite bad after 100 games, so you stop the test and throw the change out. That's wrong. Same thing when you start the test, and after 100 or 1000 games, the Elo is better than the old best version. So you stop the test and declare that the new version is better. Which is also wrong.

The results bounce around a lot early on, and slowly settle down as the number of games increases. If you want a short test, pick the number of rounds _first_ and then run the test. It will not be as accurate as more games, but it is better than waiting until you see some high point (or low point) and deciding at that instant in time.

That was the idea I was describing as being flawed. I even posted some results here last week showing this very thing. That you could have picked a stop point where the change looked very bad, or one where it looked very good. And neither would have been correct.

I didn't say "no way it can be done". And I gave a methodology where one could create a simple table indexed by elo difference in one axis and acceptable error on the other, and the intersection would give the number of games you need before terminating the test.

But just looking at the results after (say) 200 games where the new version is doing better or worse than the old, and stopping at that point, without some real statistical support behind you will lead to errors.

So yes, it can be done. But the number of games is absolutely a function of the acceptable error range and the difference in elo between the two engines. The wider the gap, the fewer games you need. The higher the acceptable error can be, the fewer games you need. Just stopping because one is a bit (or a lot) ahead is not going to work.

I'll explain my view once more: When the match between two engines goes on, every new result provides us with information. And always when we get information, probabilities change. Now the question is how can we use that information to adjust match length so that we reach wanted confidence level. You say that there is no way, and that's false, and very easy to prove. Here is counter-example: Suppose we want to play 1000 games match and accept the confidence level we get from there. After 700 games match A-B is 700-0. You claim "because progressive betting strategies won't work, you have to play the full 1000 games before we can know anything with wanted confidence level", although it can clearly be seen from here that match could at worst end 700-300 and A could be clear winner with wanted confidence level and we can immediately stop the match.
That's not what I said at all, unfortunately. Nowhere has anyone discussed 700-0 results. We are talking about results between two versions that may or may not be pretty close. With a small number of games, you might get 50-10 and conclude something dead wrong, had you gone to 500 games where the results may well even out. That's the flaw here.

Of course 700-0 is just extreme, but then you start "milding" it step by step 699-1, 698-2, ..., but when can we stop? And that's the question where we are looking for the answer...

jwes
Posts: 778
Joined: Sat Jul 01, 2006 5:11 am

### Re: Engine Testing - Statistics

What do you do about the statistic that if you use a 95% confidence level, 1 out of 20 tests will give wrong results?

Uri Blass
Posts: 8836
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

### Re: Engine Testing - Statistics

jwes wrote:What do you do about the statistic that if you use a 95% confidence level, 1 out of 20 tests will give wrong results?
No

I do not know how many tests are going to give wrong results with 95% confidence for the following reasons:

1)You may have many small improvements when the test is going to give no improvement result and it is also a wrong result.

2)If the change is negative change the probability to get wrong positive result is clearly less than 5%

The probability for wrong positive result is 5% only if the change is not positive and not negative.

Uri

Uri Blass
Posts: 8836
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

### Re: Engine Testing - Statistics

mcostalba wrote:
Uri Blass wrote: 4)In other cases choose the winner
So after 6 months you have grown up a total mess out of the engine

It is important to know if a change works or doesn't because many changes are correlated and you come out with idea B and C only after you have found idea A is good.

Another problem is that your engine becomes features inflated and you end up throwing in any garbage together with good stuff.

I prefer to add 1 proven good idea then 5 good and 5 bad ones even if the result is less in terms of ELO, but for program maintainability and for reference with future research / studies I want to know that what is in is useful and does work.
I understand your reasons in case that the change the code(and not only change default numbers) but if you change something like tuning the value of pieces then I see no reason not to accept a very small improvement(of 1 elo) that you have only 60% confidence on them.

Even if you do not want to accept a change that you are only 60% sure that it is good
I still think that the idea to stop the match if the leader leads by some constant difference may be a good idea and I think that the difference between it and the case that you do not stop the match is not big.

My guess is that if you need normally result of 50.9% to get X% confidence after 10,000 games then with my suggestion you may need a difference of slightly more than 180 points to get X% confidence after 10,000 games but not much more than it(difference of 180 points after 10,000 games is result of 5090-4010 that is exactly 50.9%)

Uri