Nondeterministic Testing in Weak engines such as Eden 0.0.13

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

nczempin

Nondeterministic Testing in Weak engines such as Eden 0.0.13

Post by nczempin »

I'm opening this new thread because I think we are getting into quite some detail regarding Eden, which may stray a little too far from the original other topic.
hgm wrote:Well, it would be interesting to see at which move they first deviate, and if the game was already decided by then. (At what time control were you doing this)?

As you can see from my tests, Eden is quite standfast in his moves, and it was uMax that caused the variation.

But the problem with this kind of variability (i.e. deciding to make one more iteration or not) is that, even if you can calculate that there is a 5% probability per move that this will happen, in practice, when you play from a fixed initial position, it will always be the _same_ 95% of the moves where the iteration finishes so far from the timeout that they cannot be affected. And on the 5% of the moves that can be affected, there can only be 2 choices, it either starts the extra iteration (resulting in another move) or it doesn't (sticking to the old move). Which moves are subject to this 1-out-of]-2 variability is in itself fully deterministic.

So even with 5% of the moves deviating, if games on the average last 100 moves, you would have only ~5 decision points, and you could play only 32 different games. This means in a match of 10 games you might imagine there is no problem, but if you play 1000 games, every game would apear about 32 times. This makes it totally useless playing thousands of games. The exact probability that is choses one of the 32 possible continuations compared to the alternative is of little interest.

You have 2 games of 94 ply. Are these move-for-move the same?
I'll post the pgns later. In the meantime, I'll do the test that I originally meant to do: Eden against Eden, match of 10.
nczempin

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by nczempin »

Eden 0.0.12 JA against Eden 0.0.13 results in exactly the same games.

Regarding Hashing: My replace mechanism is a "never replace" scheme, when the HT is full, it is full. I also clear it before each move.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by hgm »

So perhaps Philemon was not as reproducible as you thought?
nczempin

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by nczempin »

hgm wrote:So perhaps Philemon was not as reproducible as you thought?
Yes, that is definitely the case. However, Eden against Eden is a special case, in that because "both" use a book without varying lines, the one line both sides play is quite long, and probably already decisive.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by bob »

what we need is science and not random experiments. In this case, to test reproducibility of the _engine_ you need to eliminate the book, any sort of learning outside of the book, etc. Then you can tell whether there is reproducibility or not. Sticking one side in a lost position out of book won't tell a thing about the issue at hand...
nczempin

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by nczempin »

bob wrote:what we need is science and not random experiments. In this case, to test reproducibility of the _engine_ you need to eliminate the book, any sort of learning outside of the book, etc. Then you can tell whether there is reproducibility or not. Sticking one side in a lost position out of book won't tell a thing about the issue at hand...
Hmm, but we're not yet at the stage of submitting a research proposal (figuratively speaking). At least I am still in an exploratory stage.

I proposed a hypothesis, performed some experiments, and those contradicted my hypothesis. Eden does not play 2x5 identical games against Philemon.

The next hypothesis was that Philemon is the less-deterministic engine and that it is the one causing the fluctuations. So I played Eden 0.0.13 against 0.0.12 JA and found support for the hypothesis.

You're saying that the conditions under which I test are insufficient. That's because you seem to be interested in a much stronger hypothesis than the one I'm using as a working model: "Under the current conditions I use for testing each new Eden version, my methods are successful in ensuring that each new version is sufficiently stronger than the one before".

Some of these conditions should be stated:
  • - I always test at 2'/6". I am willing to accept for now (other factors apparently being more significant) as an assumption that time settings are not significant overall. Any statement of result would have to include "assuming that this result can be extended to other time controls). The reason for this time control, incidentally, is that it is long enough for Eden to reach a non-trivial ply depth of around 6 (especially considering hash tables), but still short enough to allow me to leave the one computer I use for development for running the tourneys in the majority of the day and night when I'm not developing the engine.
    - I spend a lot of time in determining the members of the tournament each time. Exactly how I do it I haven't yet described. I know you said that making the opponents close to my engine in strength would increase variability. I'm still not convinced that this is true, at least when looking at the extremes: If I chose to use engines that are much stronger (without loss of generality), any improvements would have to be extremely significant before any conclusion could be drawn. For example, if I used only Rybkas and Fruits and Craftys, the only way Eden could even theoretically get a draw would be by an extremely drawish line in the common book. Not using any book would make any progess totally invisible; all versions of Eden so far would show the same result: 0-n (n being number of games in the match :-)
    - I consciously include "own book" because this is the level of competition I am interested in. Just like a human player would work on his repertoire, Eden can improve by improving its book. Your statement about "reproducibility of the _engine_" loses the intended meaning in this context, because by my definition the own book is part of the engine. I am well aware that at higher levels this definition makes far less sense. In the Eden vs. Eden case, just for that one common position they reach in the one book, the new version winning _both_ games is IMHO a good indicator (albeit only one, certainly not sufficient for a decision to release) that the new version is stronger.
    - I try to find at least around ten engines that I play a double-round RR against (with both the old released version and the "release candidate" playing in the tournament). One factor that truly does seem to decrease the variability overall seems to be related to some Central Limit Theorem that you most likely know much more about than I do, that the fluctuations tend to cancel out overall.
    - At least my engine so far has only minimal factors that you mentioned earlier that could adversely affect variability, and among the opponents at its level, the overall variability is much less than it would be at a much higher (actually, probably starting at the "non-trivial" :-) level.
    - Again, the release "go or not-yet-go" decision is not objectivated; it is a subjective process, and thus the diametrally opposite of any ideas of science. Perhaps this is the most significant factor; if I had some objective release criteria my guess is that the number of games that would be needed would be much higher. You can guess why I won't go that route...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by bob »

nczempin wrote:
bob wrote:what we need is science and not random experiments. In this case, to test reproducibility of the _engine_ you need to eliminate the book, any sort of learning outside of the book, etc. Then you can tell whether there is reproducibility or not. Sticking one side in a lost position out of book won't tell a thing about the issue at hand...
Hmm, but we're not yet at the stage of submitting a research proposal (figuratively speaking). At least I am still in an exploratory stage.
But even while exploring, you want to explore correctly, otherwise any conclusions you draw are flawed.

I proposed a hypothesis, performed some experiments, and those contradicted my hypothesis. Eden does not play 2x5 identical games against Philemon.

The next hypothesis was that Philemon is the less-deterministic engine and that it is the one causing the fluctuations. So I played Eden 0.0.13 against 0.0.12 JA and found support for the hypothesis.
You are not quite testing your hypothesis. You need to factor in the book as well, as if one side starts off dead lost, the chances of reproducing the game over and over is quite high. If one side starts off in a position where everything is forced for many moves, same problem.

So get rid of the book, and the problems it introduces, so that you can test the randomness (or lack thereof) of the actual engine itself. I use the Silver positions Albert posted here a year or two back. You could also use the Nunn positions. Or any other set that is reasonably spread out over several opening systems, and which are fairly balanced...


You're saying that the conditions under which I test are insufficient. That's because you seem to be interested in a much stronger hypothesis than the one I'm using as a working model: "Under the current conditions I use for testing each new Eden version, my methods are successful in ensuring that each new version is sufficiently stronger than the one before".

Some of these conditions should be stated:
  • - I always test at 2'/6". I am willing to accept for now (other factors apparently being more significant) as an assumption that time settings are not significant overall. Any statement of result would have to include "assuming that this result can be extended to other time controls). The reason for this time control, incidentally, is that it is long enough for Eden to reach a non-trivial ply depth of around 6 (especially considering hash tables), but still short enough to allow me to leave the one computer I use for development for running the tourneys in the majority of the day and night when I'm not developing the engine.
    - I spend a lot of time in determining the members of the tournament each time. Exactly how I do it I haven't yet described. I know you said that making the opponents close to my engine in strength would increase variability. I'm still not convinced that this is true, at least when looking at the extremes: If I chose to use engines that are much stronger (without loss of generality), any improvements would have to be extremely significant before any conclusion could be drawn. For example, if I used only Rybkas and Fruits and Craftys, the only way Eden could even theoretically get a draw would be by an extremely drawish line in the common book. Not using any book would make any progess totally invisible; all versions of Eden so far would show the same result: 0-n (n being number of games in the match :-)
Again, you are measuring the wrong thing. You want to play against equal opponents. You can learn just as much (if not more) by playing against stronger opponents. If you know little about passed pawns, and only play against equal opponents that also know little about passed pawns, that missing component of your evaluation won't show up clearly.

A simple example: You play against program X (nearly equal) with your old version and score +50/-50 (ignoring draws completely for now). Then you play the same test with your new version and score +51/-49. Then you play against Rybka with the original and score +1/-99, and the new version scores +4/-96. That second result is far more significant than your first test, because even though you are getting killed by Rybka, you made a clear improvement. The problem with choosing nearly equal opponents is that it now takes 10x (or more) the number of games to detect an improvement that it does with much weaker and much stronger opponents added into the mix.

- I consciously include "own book" because this is the level of competition I am interested in. Just like a human player would work on his repertoire, Eden can improve by improving its book. Your statement about "reproducibility of the _engine_" loses the intended meaning in this context, because by my definition the own book is part of the engine. I am well aware that at higher levels this definition makes far less sense. In the Eden vs. Eden case, just for that one common position they reach in the one book, the new version winning _both_ games is IMHO a good indicator (albeit only one, certainly not sufficient for a decision to release) that the new version is stronger.

And there's the lack of science I talked about. You make a change. you score worse because you play worse book lines. Or because your change is bad. Which is it? It violates the age-old principle in testing of "change only one thing at a time..."

Yes, if you want to test the entire system to see how it compares to another engine, books are necessary. But we are talking about comparing version A to version A' to see if A' is better or worse, not whether A or A' is better than any single opponent we test against. That's the key to improving an engine.

- I try to find at least around ten engines that I play a double-round RR against (with both the old released version and the "release candidate" playing in the tournament). One factor that truly does seem to decrease the variability overall seems to be related to some Central Limit Theorem that you most likely know much more about than I do, that the fluctuations tend to cancel out overall.

You might as well flip a coin if that is how you decide whether A' is better than A or not. Too many random factors involved. Your book. The books of the other opponents. Learning. Timing variance. 20 games is worthless.


- At least my engine so far has only minimal factors that you mentioned earlier that could adversely affect variability, and among the opponents at its level, the overall variability is much less than it would be at a much higher (actually, probably starting at the "non-trivial" :-) level.
- Again, the release "go or not-yet-go" decision is not objectivated; it is a subjective process, and thus the diametrally opposite of any ideas of science. Perhaps this is the most significant factor; if I had some objective release criteria my guess is that the number of games that would be needed would be much higher. You can guess why I won't go that route...

If you don't get out of that mode, you are going to make lots of mistakes, false steps, and giant steps backward. "go/no-go" can certainly be made to be very objective. But it has to be based on just the engine, or use the same engine and tune the book and then evaluate the results with respect to go/no-go on the new book. But don't test both as it is impossible to determine which is the cause of the better or worse result... engine... book... simple randomness...


[/list]
nczempin

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by nczempin »

bob wrote:...
You made some very valid comments.


I guess for now I just have to say: My engine is severely nps-challenged and time-to-depth compared to other engines at a similar level. The changes I am making are 99 % just optimizations that should to a large extent lead to the engine getting stronger (except in those rare cases where looking deeper causes you to dismiss the better move that you would find were you to look even better).

So under this condition it is mainly a question of: Have I optimized enough so I can get a whole ply more on average (yes my engine still has a lot of potential in that area), and when that is the case my tests are there to find out if that one ply was actually significant (which it doesn't have to be).

I am not changing the eval, the move ordering, or introducing any known techniques such as null-move. All I'm doing is finding bottlenecks, changing Java objects into primitives, representing "blackness" with a bit instead of <0, etc.

So I guess this factor pretty much makes the previous discussions on Eden meaningless, or at least any conclusions that anyone would like to draw from them.

And yet the questions remains: Why does my approach still seem to work? Who is willing to test my hypothesis that each version of Eden is stronger than the preceding one, even under the conditions you propose to be necessary?


I would also like to make one thing clear: I am very well aware of statistical issues such as random fluctuations (not solely because I play Poker sometimes), especially the fact that the human mind by default seems to be unable to deal with them.

I'm normally the guy that says in that joke I'm sure you've heard: "no, you can't say that all sheep in Scotland are black, not even that at least one sheep is, but the only thing you can say is that at least one sheep in Scotland is black on at least one side"
I'm always the guy that points out that e. g. salesperson competitions are more or less meaningless because the random fluctuations completely dominate any skill there might be.
I always shrug off "amazing" events that I easily determine to be very possibly caused simply by randomness.

I even seem to be too far on the side of randomness, being very skeptical even of scientific articles that claim to have found this or that correlation and/or even causation.

I take an interest in Statistical Process Control even to the extent of owning (although not yet having worked through) Shewhart, W A (1939) Statistical Method from the Viewpoint of Quality Control, plus Deming and lots of other stuff that precedes Watts Humphrey's work on Software Engineering Processes.


So...


It feels weird when I'm being placed "on the other side" :-)
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by hgm »

bob wrote:A simple example: You play against program X (nearly equal) with your old version and score +50/-50 (ignoring draws completely for now). Then you play the same test with your new version and score +51/-49. Then you play against Rybka with the original and score +1/-99, and the new version scores +4/-96. That second result is far more significant than your first test, because even though you are getting killed by Rybka, you made a clear improvement. The problem with choosing nearly equal opponents is that it now takes 10x (or more) the number of games to detect an improvement that it does with much weaker and much stronger opponents added into the mix.
This is a very debatable statement / misleading example.

Yes, going from 1-99 to 4/96 is statistically somewhat more significant than going from 50-50 to 51-49. But not orders of magnitude. The standard deviation of the numbver of wins (ignoring draws) in 100 games at 1% win probability is sqrt(100*0.01*0.99) = 1, while at 50% its is sqrt(100*0.5*0.5)=5.

But what you say will never happen. To get a 3% better score around the 2% level requires an enormously larger rating increase than to get a 3% increase around 50%. The details depend on the rating model that one uses, but in the popular 1/(1+exp(rating_diff/400)) the slope of score-vs-rating curve is ~12.5 times steeper. So an improvement that would up your score from 1% to 4% against Rybka, would most likely give an improvement of 37% against the formerly-equal opponent. So you would not win by 51-49, but more something like 85-15. And that is more significant, even in the face of the larger statistical error.

Believing that a 1% improvement against an equal opponent would measurably improve your score against Rybka is just day-dreaming...
nczempin

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Post by nczempin »

hgm wrote:
bob wrote:A simple example: You play against program X (nearly equal) with your old version and score +50/-50 (ignoring draws completely for now). Then you play the same test with your new version and score +51/-49. Then you play against Rybka with the original and score +1/-99, and the new version scores +4/-96. That second result is far more significant than your first test, because even though you are getting killed by Rybka, you made a clear improvement. The problem with choosing nearly equal opponents is that it now takes 10x (or more) the number of games to detect an improvement that it does with much weaker and much stronger opponents added into the mix.
This is a very debatable statement / misleading example.

Yes, going from 1-99 to 4/96 is statistically somewhat more significant than going from 50-50 to 51-49. But not orders of magnitude. The standard deviation of the numbver of wins (ignoring draws) in 100 games at 1% win probability is sqrt(100*0.01*0.99) = 1, while at 50% its is sqrt(100*0.5*0.5)=5.

But what you say will never happen. To get a 3% better score around the 2% level requires an enormously larger rating increase than to get a 3% increase around 50%. The details depend on the rating model that one uses, but in the popular 1/(1+exp(rating_diff/400)) the slope of score-vs-rating curve is ~12.5 times steeper. So an improvement that would up your score from 1% to 4% against Rybka, would most likely give an improvement of 37% against the formerly-equal opponent. So you would not win by 51-49, but more something like 85-15. And that is more significant, even in the face of the larger statistical error.

Believing that a 1% improvement against an equal opponent would measurably improve your score against Rybka is just day-dreaming...
I don't draw any conclusions whatsoever from matches against only one opponenet. I certainly wouldn't judge an increase from 50 -> 51 to be significant.

I will publish my actual tests next time around; unfortunately I lost all the previous ones because Vista kept all Arena pgns and .ats in some kind of shadow directory that was not accessible via the normal Explorer, but only within Arena. Once I changed Arena's settings to run in Admin mode, everything was gone.

So for version 0.0.14, I'll include my tests so we no longer have to talk in such generic terms. Because inevitably I'm leaving out significant pieces of information because the process for me has become so second-nature.