nondeterministic testing

bob · Post by **bob** » Sat Sep 08, 2007 3:41 pm

I have tested with both types of time controls, with no measurable difference. game/120, game/60, 60+60, etc. Makes no difference that I can measure in the way the results vary.

If you don't time out in the middle of an iteration, I can see how that will hurt performance, because sometimes you start an iteration you really should not have started. And if you change your mind several times during that iteration, the search blows up badly. And the iteration takes _way_ longer than planned, hurting future searches. If you do interrupt in the middle, even though you try to avoid it by not starting something you can't finish, the variability should show up I would think. Just as I see it.

Michael Sherwin · Post by **Michael Sherwin** » Sun Sep 09, 2007 3:28 am

bob wrote:
Michael Sherwin wrote:
bob wrote:You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...
I guess the way that I am testing now can be looked at like a type of genetic evolution. A small change and a hundred game match to test it. If it gains a new record it lives and if not, it dies. Every time a new record is gained some old ideas or values that were close in results to the old record are retried. This style of testing has resulted in a slow but steady increase in results. Just today a new record was set by RomiChess against Hamsters 0.2 of 68.5 of 100 which is a full point ahead of the old record. Never before in a hundred or so matches played so far has RomiChess scored as much as 68.5 of 100 points against Hamsters 0.2, so I am rather optomistic that this latest result reppresents an improvement. But, I am not going to try to verify it. I am just going to try to get a new record!
The problem is that your "result" is way more random than you suspect.

Run the _same_ match twice. post the results of each match. I did this a while back and showed that one match had A better, one had B better. While a long match showed they were nearly equal. 100 games is not quite as bad as flipping a coin to keep/discard a change, but it is not a _lot_ better...

But you have to test this hypothesis to fully appreciate it. Just run the same N positions against the same opponent, with the same time control, twice, and look at the results. It is eye-opening.

Just saw your reply, don't know how I missed it.

Thanks for the suggestion, I will run the test 2 more times just to see how variable it can be and post the results. It will take a couple of days.

I doubt that it will cause me to change my testing strategy though as I try many small changes in quick succession and this quick testing method is getting results for me.

After I get a new record I then cycle back through most of the old ideas/values to see if something was over looked, due to randomness. I only abandon forever an idea that repeatedly results in worse results. So in the end most ideas/values get their thousands of games and nothing (I hope) is overlooked along the way. Sometimes, I find old ideas/values that did not work before are now good that something else has been added, removed or changed.

Uri Blass · Post by **Uri Blass** » Sun Sep 09, 2007 10:05 am

The rybka team use the method of games at very fast time control(in less than one second) in order to test changes.

They also play rybka-rybka games.

I wrote some code to enable this testing in movei but I do not know how to write a code that let me test movei against other programs in these conditions.

Note that I need to clear the hash after every move(otherwise movei may remember wrong information in pawn hash or eval hash) but it is not a serious problem because I use very small hash in these very fast games.

Uri

hgm · Post by **hgm** » Sun Sep 09, 2007 1:50 pm

bob wrote:I have tested with both types of time controls, with no measurable difference. game/120, game/60, 60+60, etc. Makes no difference that I can measure in the way the results vary.

Of course not. You would expect _enhancement_ of the randomness near the end of the time slice. So in engines where you already have randomness during most of the game, you won't see any difference.

The question was if in engines that normally play reproducibly, one might observe an effect.

If you don't time out in the middle of an iteration, I can see how that will hurt performance, because sometimes you start an iteration you really should not have started. And if you change your mind several times during that iteration, the search blows up badly. And the iteration takes _way_ longer than planned, hurting future searches. If you do interrupt in the middle, even though you try to avoid it by not starting something you can't finish, the variability should show up I would think. Just as I see it.

Indeed. But the performance is not hurt that much. uMax occasionally loses on time, especially if it starts searching from a position that was on the board before (so that a search to full nominal depth is already present in the hash, and the first iteration it starts is deeper). But this happens only once every 100 games, or so. This means it suffers a loss on time where a draw was likely, i.e. 0.5% less[performance = 3.5 Elo. For uMax design goal that makes fixing it counterproductive, as I have not found a fix of less than 3 characters. I just minimize the effect by taking a somewhat larger safety margin (which doesn't take any extra characters).

My suspicion is that the impact of performance on this is extremely small.

nczempin · Post by **nczempin** » Mon Sep 10, 2007 2:41 pm

bob wrote:
nczempin wrote:
bob wrote:
You said that you don't finish an iteration, you cut the search off when time runs out. That alone should make it impossible to choose the same move every time, unless your eval is _very_ simple. Because there should be plenty of cases where you change your mind if you finish an iteration, but if time runs out early, you don't get to finish the new best move search and play the first (worse) move.
You are right in saying that this is undesirable. There are some cases where from one iteration to the next it is found that the current-best-move has a lower eval than it had on the previous, full iteration.

It would be prudent to let the iteration finish in that case.

However, it is not guaranteed that finishing the iteration will find a better move. I have not examined how many times it happens and how often the old first move would have to be changed after all. At the kinds of immature (or weak, as we called them before) levels that Eden is playing, the difference, even if it were significant, is likely to be less significant than most of the "low-hanging fruit" changes I have been able to make so far. So for the moment I'm willing to accept that this area makes Eden weaker than necessary, just like it is with many other things.

Incidentally, you seem to be contradicting yourself; I'm not sure if I understand you correctly: In another post you said that I'm probably doing it the right way, now you're saying it's wrong. What have I misunderstood?
I meant that not synchronizing at iteration boundaries is the "right way". But I am still trying to figure out why you are not seeing much non-deterministic behavior if you are doing it that way (essentially the same way I am doing it).

Over the weekend I've had my subconscious working on this question.

So far it has come up with this explanation:
My statement about non-determinism in Eden tests mostly reflects the current situation. The current situation is that I am only removing inefficiencies, therefore increasing nps and time-to-depth only. These changes are very much more likely to achieve significant results with fewer tests than changes that affect many things at the same time would (e. g. some weight in the eval).
I suspect (well, actually I'm pretty sure) that Crafty (or any other "mature" engine) already includes all (or at least a very strong majority of) such changes that are possible (perhaps even by definition of "mature"), while Eden still has a long way to go in this area.

I am pretty sure that if I/when I introduce changes that I am not so confident about that I require far higher standards for a result to be significant.

I. e. if a change involves a bug fix only that, let's say, removes a superflous printout in the middle of the inner loop, I'd be confident that just placing higher in the tournament by two places would be sufficient for a new release.
When I introduce changes that I'm not so sure about, I intuitively ask for a much better relative performance.

There are two variables that you can look at when asking for significance:
a) when the difference is small, you need more results
b) when the difference is large, you can do with fewer.

With Eden, which is almost 1000 points weaker than Crafty, I still have lots of opportunities to find improvements easily.

Perhaps my previous statements have not emphasised this point enough, and therefore perhaps misrepresented my confidence on the theory (which I haven't actually tested, but have only a "feel" /circumstantial evidence for) that Eden has 100 % determinism. Just the fact of GC is a strong clue against that theory. But I can definitely argue that there is more determinism at the Eden level than at the Crafty level.

bob · Post by **bob** » Mon Sep 10, 2007 4:04 pm

nczempin wrote:
bob wrote:
nczempin wrote:
bob wrote:
You said that you don't finish an iteration, you cut the search off when time runs out. That alone should make it impossible to choose the same move every time, unless your eval is _very_ simple. Because there should be plenty of cases where you change your mind if you finish an iteration, but if time runs out early, you don't get to finish the new best move search and play the first (worse) move.
You are right in saying that this is undesirable. There are some cases where from one iteration to the next it is found that the current-best-move has a lower eval than it had on the previous, full iteration.

It would be prudent to let the iteration finish in that case.

However, it is not guaranteed that finishing the iteration will find a better move. I have not examined how many times it happens and how often the old first move would have to be changed after all. At the kinds of immature (or weak, as we called them before) levels that Eden is playing, the difference, even if it were significant, is likely to be less significant than most of the "low-hanging fruit" changes I have been able to make so far. So for the moment I'm willing to accept that this area makes Eden weaker than necessary, just like it is with many other things.

Incidentally, you seem to be contradicting yourself; I'm not sure if I understand you correctly: In another post you said that I'm probably doing it the right way, now you're saying it's wrong. What have I misunderstood?
I meant that not synchronizing at iteration boundaries is the "right way". But I am still trying to figure out why you are not seeing much non-deterministic behavior if you are doing it that way (essentially the same way I am doing it).
Over the weekend I've had my subconscious working on this question.

So far it has come up with this explanation:
My statement about non-determinism in Eden tests mostly reflects the current situation. The current situation is that I am only removing inefficiencies, therefore increasing nps and time-to-depth only. These changes are very much more likely to achieve significant results with fewer tests than changes that affect many things at the same time would (e. g. some weight in the eval).
I suspect (well, actually I'm pretty sure) that Crafty (or any other "mature" engine) already includes all (or at least a very strong majority of) such changes that are possible (perhaps even by definition of "mature"), while Eden still has a long way to go in this area.

I am pretty sure that if I/when I introduce changes that I am not so confident about that I require far higher standards for a result to be significant.

I. e. if a change involves a bug fix only that, let's say, removes a superflous printout in the middle of the inner loop, I'd be confident that just placing higher in the tournament by two places would be sufficient for a new release.
When I introduce changes that I'm not so sure about, I intuitively ask for a much better relative performance.

There are two variables that you can look at when asking for significance:
a) when the difference is small, you need more results
b) when the difference is large, you can do with fewer.

With Eden, which is almost 1000 points weaker than Crafty, I still have lots of opportunities to find improvements easily.

Perhaps my previous statements have not emphasised this point enough, and therefore perhaps misrepresented my confidence on the theory (which I haven't actually tested, but have only a "feel" /circumstantial evidence for) that Eden has 100 % determinism. Just the fact of GC is a strong clue against that theory. But I can definitely argue that there is more determinism at the Eden level than at the Crafty level.

OK, If you are testing against other programs, and they are much stronger than yours, then I agree that the results will be more deterministic, particularly with respect to the game results, although there should be plenty of different moves being played. But against a stronger opponent, different moves will likely just result in a different loss. But against more equal results, the non-determinism should be more apparent, because rather than just different moves, same result, you should see more of different moves, different results...

if you are seeing the same exact moves, that would be an interesting issue...

nczempin · Post by **nczempin** » Mon Sep 10, 2007 4:26 pm

bob wrote:
OK, If you are testing against other programs, and they are much stronger than yours, then I agree that the results will be more deterministic, particularly with respect to the game results, although there should be plenty of different moves being played. But against a stronger opponent, different moves will likely just result in a different loss. But against more equal results, the non-determinism should be more apparent, because rather than just different moves, same result, you should see more of different moves, different results...

if you are seeing the same exact moves, that would be an interesting issue...

No, quite the opposite. I test against programs that are not much stronger, but roughly the same strength. And my decision to release is based on the tournament results (and not any individual results).

This means there may very well be different results (some of the weak engines at Eden's level have quite substantial books, so that by itself is a factor).

But I'm judging the result of the overall tournament, where my hope is that some of the non-deterministic results even out. And in the end, it seems that my approach has been successful. But, again, I think Eden's situation is not representative.

I may run a test of Eden against another engine which is likely deterministic, and publish the results (the easiest would be Eden against itself, but perhaps that's not so meaningful).

Another interesting experiment would be to find out what would be needed to get Crafty to play the same game against itself over and over again. Which configuration changes would be needed, and which code changes?

bob · Post by **bob** » Mon Sep 10, 2007 5:42 pm

nczempin wrote:
bob wrote:
OK, If you are testing against other programs, and they are much stronger than yours, then I agree that the results will be more deterministic, particularly with respect to the game results, although there should be plenty of different moves being played. But against a stronger opponent, different moves will likely just result in a different loss. But against more equal results, the non-determinism should be more apparent, because rather than just different moves, same result, you should see more of different moves, different results...

if you are seeing the same exact moves, that would be an interesting issue...
No, quite the opposite. I test against programs that are not much stronger, but roughly the same strength. And my decision to release is based on the tournament results (and not any individual results).

This means there may very well be different results (some of the weak engines at Eden's level have quite substantial books, so that by itself is a factor).

But I'm judging the result of the overall tournament, where my hope is that some of the non-deterministic results even out. And in the end, it seems that my approach has been successful. But, again, I think Eden's situation is not representative.

I may run a test of Eden against another engine which is likely deterministic, and publish the results (the easiest would be Eden against itself, but perhaps that's not so meaningful).

Another interesting experiment would be to find out what would be needed to get Crafty to play the same game against itself over and over again. Which configuration changes would be needed, and which code changes?

OK, it's becoming clearer. If you are testing with books, you are not testing _program improvements_ at all. This means that your test results are not a reliable indicator of whether or not your recent changes are good. Because random book lines will produce different results.

Again, if you want to test _program_ improvements, you have to eliminate as many random contributors as possible. Starting with books, any kind of learning, even pondering, parallel search, etc. There will still be a random component left caused by timing differences, and since that can't be eliminated, you just have to play enough games to produce repeatable results.

If you can run the same test twice and get different results, the test is not very useful in deciding "better/worse/same". The closer your opponents are to your program's strength, the _MORE_ games you need to play to produce reliable results. And we are talking thousands, not dozens....

bob · Post by **bob** » Mon Sep 10, 2007 5:45 pm

Michael Sherwin wrote:
jswaff wrote:
Michael Sherwin wrote:
bob wrote:You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...
I guess the way that I am testing now can be looked at like a type of genetic evolution. A small change and a hundred game match to test it. If it gains a new record it lives and if not, it dies. Every time a new record is gained some old ideas or values that were close in results to the old record are retried. This style of testing has resulted in a slow but steady increase in results. Just today a new record was set by RomiChess against Hamsters 0.2 of 68.5 of 100 which is a full point ahead of the old record. Never before in a hundred or so matches played so far has RomiChess scored as much as 68.5 of 100 points against Hamsters 0.2, so I am rather optomistic that this latest result reppresents an improvement. But, I am not going to try to verify it. I am just going to try to get a new record!
First, congrats on your new record. I'm sure Romi is improving.

That said, I don't think your methodology is a good one. If you just keep playing Romi v. Hamsters in a continuous loop, without making any code changes, eventually you'll find a 100 game "slice" where Romi wins by more than 68.5. And, you'll find many where it performs worse than 68.5. What does that tell you?

--
James
Well, since Romi has never scored 68.5 points against Hamsters 0.2 before in the 100+ matches that they have played together, it tells me the changes are worth keeping. In Swaminathan's tournaments Romi has gone from the mid 2300's to almost 2600. Romi's rating has always trended upwards when a new record is set against Hamsters. Which also tells me that I should try to set all the new records that I can!

It could also be that you just happened to get one of those "2 standard deviations away" results because of the significant randomness in your testing results that is caused by playing far too few games.

I realize it is difficult to test as I have described, and that compromises must be made. But those compromises have a huge cost in terms of bad decisions that are made based on those results.

There is _NO_ shortcut to reliable testing, except for less reliable / unreliable testing. Hundreds of games is less reliable. < 100 games is completely unreliable.

nczempin · Post by **nczempin** » Tue Sep 11, 2007 9:59 am

bob wrote:
nczempin wrote:
bob wrote:
OK, If you are testing against other programs, and they are much stronger than yours, then I agree that the results will be more deterministic, particularly with respect to the game results, although there should be plenty of different moves being played. But against a stronger opponent, different moves will likely just result in a different loss. But against more equal results, the non-determinism should be more apparent, because rather than just different moves, same result, you should see more of different moves, different results...

if you are seeing the same exact moves, that would be an interesting issue...
No, quite the opposite. I test against programs that are not much stronger, but roughly the same strength. And my decision to release is based on the tournament results (and not any individual results).

This means there may very well be different results (some of the weak engines at Eden's level have quite substantial books, so that by itself is a factor).

But I'm judging the result of the overall tournament, where my hope is that some of the non-deterministic results even out. And in the end, it seems that my approach has been successful. But, again, I think Eden's situation is not representative.

I may run a test of Eden against another engine which is likely deterministic, and publish the results (the easiest would be Eden against itself, but perhaps that's not so meaningful).

Another interesting experiment would be to find out what would be needed to get Crafty to play the same game against itself over and over again. Which configuration changes would be needed, and which code changes?
OK, it's becoming clearer. If you are testing with books, you are not testing _program improvements_ at all. This means that your test results are not a reliable indicator of whether or not your recent changes are good. Because random book lines will produce different results.

Again, if you want to test _program_ improvements, you have to eliminate as many random contributors as possible. Starting with books, any kind of learning, even pondering, parallel search, etc. There will still be a random component left caused by timing differences, and since that can't be eliminated, you just have to play enough games to produce repeatable results.

If you can run the same test twice and get different results, the test is not very useful in deciding "better/worse/same". The closer your opponents are to your program's strength, the _MORE_ games you need to play to produce reliable results. And we are talking thousands, not dozens....

Hmm. So how do you explain that my method of testing so far has produced increasingly better results in independent testing (for example the UCI engines ligue; they test with a common book, incidentally)? Coincidence? Really? For all the versions from 0.0.6 to 0.0.13?

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing