nondeterministic testing

hgm · Post by **hgm** » Thu Sep 06, 2007 6:09 pm

bob wrote:This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.

The point is that you could do billions tests on Crafty, and it would still tell you zilch about how Eden or uMax behave....

I did play Eden against uMax, and it is indeed as Nicolai claims: all games are always the same. (Well, it would be strange if he did not know his own engine, wouldn't it?

) At least in so far as I tried; I did not have the stamina to play the same game over and over again for thousands of times, just in the hope that I finally might get one that is different.

Actually this is expected for engines that always finish an iteration once they start it. The only possible source of randomness is if an iteration finishes extremely close to the time limit where you would start a new one. Then, due to timing variability, you might sometimes decide to do a new iteration, and other times not. The probability for that is extremely small, as timing variability usually is only a few milliseconds, and search times many seconds. And even doing an iteration doesn't mean you will pick another move. Most of the time the move stays the same....

nczempin · Post by **nczempin** » Thu Sep 06, 2007 6:19 pm

bob wrote:
nczempin wrote:A little perspective on weak engines; I think the situation is different once the strength exceeds about 2000 (in any semi-realistic scale).

My engine is pretty weak, and at this level the variability depends on a few factors, I'll list the ones that come immediately to mind.

1. Engines that have no book, or (like mine) always choose the same line given the same previous moves, when playing each other will play exactly the same game each time. So the variability will be zero if you play two games with swapping colours.
This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.

Well I did test this, and so far I am usually right in this assessment:
When two weak engines (this usually means there are no non-deterministic factors involved... I can't speak for stronger ones, and most surely not for any SMP ones) that have no book (or have a "deterministic book", always playing the same moves if the other side does too) play each other, overall the results are the same. I am only talking about my conditions, which involve Arena; can't speak for e. g. winboard, polyglot and all that other stuff that might introduce non-determinism (oh, and I use deterministic seeds for the zobrist hashes too btw).

At higher levels this behaviour is less desirable, in fact non-determinism is good because your opponents have a harder time preparing against you. But at these low levels the competition is not so strong, and e. g. having reproducible behaviour to help find bugs is much more important.

If you play a weak engine against a strong engine, what you say might actually happen, but then the weaker engine is going to lose no matter what. And there is no way to measure improvement if you can't win some games.

But your basic premise about an opening book being needed to avoid repeated games is simply wrong. If you want data I can send you data from a million games or so run on my cluster here. The variability (non-deterministic behavior) is absolutely remarkable.

I don't think I said that a book is needed to avoid repetitions. After all, Eden has a book (albeit a deterministic one, so to speak). If anything, I said that a non-deterministic book will lead to more variability. Your statement is not equivalent logically.

2. Most engines at this weak level are comparatively stronger in calculation than in evaluation. They regularly play 1. Nh3 or just make more or less random pawn moves as long as there's no material to be gained. If your engine (or let's say, my engine plays those engines (that also have no book) I can easily "prepare" my book against those engines. My engine will seem stronger than it really is if it can get into positions where there superior tactics are no longer sufficient.
A good reason for not using a book. It introduces yet another random aspect into your games that is bad for testing/evaluation purposes. yes you need a good book for tournaments. No, you don't need any kind of a book to evaluate changes. Just start the games at the same point and see if your changes produce better results. A book adds too much random noise, and when you factor in learning, it becomes useless for evaluating changes to the program itself.

Again, remember what kind of book Eden currently has: For any position there's always just one move, or none. I will change this in the future, and I am certain that my variability will skyrocket. This is one reason why I haven't done it yet.

I also haven't mentioned this (or perhaps I have, in that case I'd like to repeat/stress it): When I update Eden's book, I always do so _after_ I am satisfied that the new version is stronger. So in effect I'm already doing what you're suggesting; starting the games at the same point (at least with those opponents that are deterministic; still the majority, but it seems the percentage is decreasing as Eden moves up the ladder).

3. At this weak level, improvements are likely fairly significant. We are usually concerned with huge bugs, plus techniques that have been proven.

So for my engine, where I explicitly have the goal that the new version should be _significantly_ stronger than the previous one, but not exceedingly so, this is the technique I use (and so far it has been pretty consistent):

I find a group of engines that are reasonably close in strength (this selection gets better with time) and start playing a double-round tournament with them with my old version plus the new version that I consider to be worthy of testing for sufficient strength.

Usually I then decide that my version is sufficiently strong if it has at least one other engine placed between it and the old version. Sometimes it places relatively much higher.
And if you don't play thousands of games, sometimes the results are going to be exactly the opposite of what is really going on...

Well, you're invited to test each version of Eden against the older versions (and other engines, of course, because just each new opening book in that "main line" when it plays itself is likely to be improved) and see if your results are different from mine. I. e. if you find an older version that is stronger than some newer version, at the significance level you've chosen.

This may actually be the case for the change from 0.0.11 to 0.0.12, where I explicitly mentioned the fact that the newer version may be only "slightly" better than the older one. Some people have told me that 11 is better, or at least that 12 is not stronger. I was willing to risk that, because there was a lot of time and bad things between 11 and 12, and I just had to get something out again. I was actually surprised that the results were indeed that it was better.

So far I have had no reason to take back my claim that for me at least, my method is working.

BTW you should also note that there are many different versions of Eden even for just one release:
1. The basic java version
2. The "server" version, which uses the JIT VM, which even fewer people have installed than just a JRE for 1)
3. The "JA" version, compiled by Jim Ablett. Although it does have really weird behaviour regarding the nps etc. reporting (that I can't explain for now; it's "lying") it is significantly stronger than the basic version. There are theoretically sound reasons for that, plus my few experiments have confirmed the theory.

And for each of these versions there is a WB version and a UCI version. They actually play differently (the UCI version may lose a little bit of time when having to go through an adapter, the WB version doesn't claim draws, and this has actually led to losing a drawn game because I threw an IllegalStateException when the position was repeated for the fourth time

But for my tests, I always keep the conditions as much the same as before, precisely to reduce variability.

And even when I have received a result which superficially would indicate that the strength has increased, I check the games to see if there are any anomalies. When there's e. g. just one engine in between the old version's placement in the tourney and the new, I am more doubtful than when version A came last out of 15 engines and the new one takes first or even third place.

nczempin · Post by **nczempin** » Thu Sep 06, 2007 6:26 pm

bob wrote: Fixed-depth is better, if you only change evaluation stuff.

BTW right now I'm more or less changing anything BUT the evaluation, because my engine is severely nps-bound/time-to-depth-bound (and just a few low-level optimizations, or rather removal of huge inefficiences, led to more or less a whole ply at the 2/6 time controls I'm using for my tests-->from an average of around 5 to an average of around 6, which your engine probably does in the time it takes me to do one eval

. My opponents usually play horrendously, I mean truly badly, as I said stuff like Nh6 on the first move, and then Eden gets outsearched by one or two plies and misses something in the late middle-game.

I intensely prefer the way my engine plays to most of its competitors.

I wanted to build it on the way I would teach a beginner about chess, and so far I am very happy about the results. Right now I would teach my pupil to work on the tactics

nczempin · Post by **nczempin** » Thu Sep 06, 2007 6:34 pm

hgm wrote:
bob wrote:This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.
The point is that you could do billions tests on Crafty, and it would still tell you zilch about how Eden or uMax behave....

I did play Eden against uMax, and it is indeed as Nicolai claims: all games are always the same. (Well, it would be strange if he did not know his own engine, wouldn't it? ) At least in so far as I tried; I did not have the stamina to play the same game over and over again for thousands of times, just in the hope that I finally might get one that is different.

Actually this is expected for engines that always finish an iteration once they start it. The only possible source of randomness is if an iteration finishes extremely close to the time limit where you would start a new one. Then, due to timing variability, you might sometimes decide to do a new iteration, and other times not. The probability for that is extremely small, as timing variability usually is only a few milliseconds, and search times many seconds. And even doing an iteration doesn't mean you will pick another move. Most of the time the move stays the same....

Just a quick note: Eden does NOT finish the iteration. So far there's only a simplistic time control. I know this is a problem, of course, yet another angle for future improvement (but by itself probably not significant for a new version

)

Marc Lacrosse · Post by **Marc Lacrosse** » Thu Sep 06, 2007 6:48 pm

bob wrote: The bottom line is that the only practical solution is a _bunch_ of games. Anything else leads to reduced (significantly reduced) accuracy. I'm working on a paper for the JICGA that really illustrates how this variability is created, and the effect it has on games. It will convince anyone testing that 100 games is absolutely worthless unless the change you make is major, as maybe in adding null-move vs no null move or something drastic like that. Reasonable changes (hopefully improvements) require thousands of games to verify. That seems to be sadly overlooked in most of the posts here.

Hi Bob : hundreds of games yes, thousands no when the change is significant.

Here is the result of a test I just finished for some engine these days.

An engine has a new adjustable parameter (let's say it's called "QGE") for which five values (2,3,4,5,6) are possible.
I just let it play 256 games with each possible value of QGE against my standard opponents (Rybka 1.0b, Toga 1.3x4, Spike 1.2, Naum 2.0).

Here are the raw results :

Code: Select all

Percentage achieved against  Points  Comment
 Ry   To   Sp   Na   All     /256
51.6 53.9 75.8 74.2  63.9%   163.5  QCE = 4
43.7 64.1 64.8 64.8  59.4%   152.0  QCE = 2
50.8 53.9 63.3 68.0  59.0%   151.0  QCE = 3
46.1 49.2 69.5 65.6  57.6%   147.5  QCE = 5
46.9 51.6 63.3 68.0  57.4%   147.0  QCE = 6

According to Bayeselo the significance of these results is :

Code: Select all

LoS table for QCE values :
         04 02 03 05 06
-----------------------
QCE = 4  xx 89 92 95 95
QCE = 2  10 xx 57 66 67
QCE = 3   7 42 xx 59 60
QCE = 5   4 33 40 xx 50
QCE = 6   4 32 39 49 xx

Which reads that there is 89% probability that QCE=4 is significantly superior to QCE=2, 92% that it's superior to QCE= 3 and so on.

In this case observed difference between QCE = 4 and QCE = 2 (163.5/256 vs 152/256) is 32 elo points. Ok this is not a tiny difference but it is much less than the differnce between the same engine with and without a major feature like null-move.

My example was run in four days on a single PC.

I do not consider that per se it is a full proof : but it gives me (and the author) intersting preliminary informations :
- this tunable feature is worth working on (it really affects strength)
- QCE = 4 could well be the best value

Not too bad in just four days on a single processor, isn't it ?

Marc

MartinBryant · Post by **MartinBryant** » Thu Sep 06, 2007 7:48 pm

You aren't disagreeing with Bob.
If you read his post carefully he is talking about 1000's of games to try to measure small changes, which unfortunately the vast majority of changes are.

Uri Blass · Post by **Uri Blass** » Thu Sep 06, 2007 7:55 pm

The main question is if tests at super fast time control(1 second per game) are good indicator for longer time control(at least for evaluation changes).

I wonder what is Bob's experience about this because if tests at super fast time control are good indicator then you can practically try a lot of candidate changes at super fast time control and test later the promising changes at slower time control.

Uri

MartinBryant · Post by **MartinBryant** » Thu Sep 06, 2007 8:07 pm

Does anyone have any theories/thoughts why a change would be more/less effective at different time controls?

Apart from the odd exception I would think that most changes would be equally valid at any time control?

bob · Post by **bob** » Thu Sep 06, 2007 8:13 pm

Marc Lacrosse wrote:
bob wrote: The bottom line is that the only practical solution is a _bunch_ of games. Anything else leads to reduced (significantly reduced) accuracy. I'm working on a paper for the JICGA that really illustrates how this variability is created, and the effect it has on games. It will convince anyone testing that 100 games is absolutely worthless unless the change you make is major, as maybe in adding null-move vs no null move or something drastic like that. Reasonable changes (hopefully improvements) require thousands of games to verify. That seems to be sadly overlooked in most of the posts here.
Hi Bob : hundreds of games yes, thousands no when the change is significant.

I believe I said that. But an eval change, even if it is fairly extensive, generally doesn't produce a quantum-leap forward in playing skill, because tactics are still tactics, depth is still depth, and the rest of the (unchanged) knowledge still contributes. Most of the changes we make nowadays are not going to make a significant and repeatable difference.

However, I am not convinced that a hundred or two games is enough for _anything_. I previously posted two different 80 game (40 position) matches where one showed A was better, the second showed B was better, etc. That kind of dynamic variability must be weeded out, and it really does take _hundreds_ (not hundred) games to get it under control...

Here is the result of a test I just finished for some engine these days.

An engine has a new adjustable parameter (let's say it's called "QGE") for which five values (2,3,4,5,6) are possible.
I just let it play 256 games with each possible value of QGE against my standard opponents (Rybka 1.0b, Toga 1.3x4, Spike 1.2, Naum 2.0).

Here are the raw results :
Code: Select all
Percentage achieved against  Points  Comment
 Ry   To   Sp   Na   All     /256
51.6 53.9 75.8 74.2  63.9%   163.5  QCE = 4
43.7 64.1 64.8 64.8  59.4%   152.0  QCE = 2
50.8 53.9 63.3 68.0  59.0%   151.0  QCE = 3
46.1 49.2 69.5 65.6  57.6%   147.5  QCE = 5
46.9 51.6 63.3 68.0  57.4%   147.0  QCE = 6
According to Bayeselo the significance of these results is :
Code: Select all
LoS table for QCE values :
         04 02 03 05 06
-----------------------
QCE = 4  xx 89 92 95 95
QCE = 2  10 xx 57 66 67
QCE = 3   7 42 xx 59 60
QCE = 5   4 33 40 xx 50
QCE = 6   4 32 39 49 xx

Forget the Elo stuff. What happens if you run the _same_ test a second time? That's the part that is worrisome. You will likely get results that are different. Personally I wouldn't conclude anything from that. I tried a similar test with varying the history threshold for LMR (when I used history info). The results were so variable I could not use 160 games to draw any conclusion...

The interesting thing is to run your same test twice and see how close they agree. (I bet they won't). That means the bayselo stuff is simply flawed because it doesn't include the dynamic variability that a program introduces compared to a human.

Which reads that there is 89% probability that QCE=4 is significantly superior to QCE=2, 92% that it's superior to QCE= 3 and so on.

In this case observed difference between QCE = 4 and QCE = 2 (163.5/256 vs 152/256) is 32 elo points. Ok this is not a tiny difference but it is much less than the differnce between the same engine with and without a major feature like null-move.

My example was run in four days on a single PC.

I do not consider that per se it is a full proof : but it gives me (and the author) intersting preliminary informations :
- this tunable feature is worth working on (it really affects strength)
- QCE = 4 could well be the best value

Not too bad in just four days on a single processor, isn't it ?

Marc

Depends. Would I choose to use that value in an important tournament? Not a chance. It suggests that tuning helps, but I would personally resort to my current testing to really make the decision about which is correct. Because with my current testing, I do get repeatable results by averaging 5000+ games per opponent, to get stability. When I now conclude that B is better, I am not guessing, I conclude that with a high degree of reliability.

But the testing does take time.

bob · Post by **bob** » Thu Sep 06, 2007 8:18 pm

Uri Blass wrote:The main question is if tests at super fast time control(1 second per game) are good indicator for longer time control(at least for evaluation changes).

I wonder what is Bob's experience about this because if tests at super fast time control are good indicator then you can practically try a lot of candidate changes at super fast time control and test later the promising changes at slower time control.

Uri

Two issues. There are many things you can do that are depth independent. Evaluation changes are, for the most part. But if you add aggressive king safety stuff, you will begin to lean on the tactical search to help refute things that are too risky. And testing at shallow depths might give different answers.

My approach is to try them all (long and short) to see if they are pretty well in agreement. If so, I will use short for testing. But I have found at least one case where the shorter the time control, the more significant my (Crafty's) advantage over a particular opponent (which shall remain unnamed).

It seems to me that there are two issues. If you are trying to prove program A is better than program B, the longer the time control, the better. But if you are comparing A to B and A' to B (where A' is a slightly modified A) then you really don't care so much about the final result, as you do about the difference in the results for A and A'. And there, faster is better from a development point of view. I a m doing a lot of 2+1 games, as with 5000+ games each for 4 opponents, that takes a day on my cluster. And occasionally I will run a 60+60 run just to see how the results track and to be sure that my changes are not good in fast games, but bad in longer games, which has happened.

It is not easy to test. It is not each to benchmark. For both of those, I always prefer to test or benchmark under the identical conditions that I plan on running under, for a final sanity check.

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing