nondeterministic testing

nczempin · Post by **nczempin** » Tue Sep 11, 2007 10:27 am

bob wrote:
nczempin wrote:
bob wrote:
OK, If you are testing against other programs, and they are much stronger than yours, then I agree that the results will be more deterministic, particularly with respect to the game results, although there should be plenty of different moves being played. But against a stronger opponent, different moves will likely just result in a different loss. But against more equal results, the non-determinism should be more apparent, because rather than just different moves, same result, you should see more of different moves, different results...

if you are seeing the same exact moves, that would be an interesting issue...
No, quite the opposite. I test against programs that are not much stronger, but roughly the same strength. And my decision to release is based on the tournament results (and not any individual results).

This means there may very well be different results (some of the weak engines at Eden's level have quite substantial books, so that by itself is a factor).

But I'm judging the result of the overall tournament, where my hope is that some of the non-deterministic results even out. And in the end, it seems that my approach has been successful. But, again, I think Eden's situation is not representative.

I may run a test of Eden against another engine which is likely deterministic, and publish the results (the easiest would be Eden against itself, but perhaps that's not so meaningful).

Another interesting experiment would be to find out what would be needed to get Crafty to play the same game against itself over and over again. Which configuration changes would be needed, and which code changes?
OK, it's becoming clearer. If you are testing with books, you are not testing _program improvements_ at all. This means that your test results are not a reliable indicator of whether or not your recent changes are good. Because random book lines will produce different results.

Again, if you want to test _program_ improvements, you have to eliminate as many random contributors as possible. Starting with books, any kind of learning, even pondering, parallel search, etc. There will still be a random component left caused by timing differences, and since that can't be eliminated, you just have to play enough games to produce repeatable results.

If you can run the same test twice and get different results, the test is not very useful in deciding "better/worse/same". The closer your opponents are to your program's strength, the _MORE_ games you need to play to produce reliable results. And we are talking thousands, not dozens....

Why do I sometimes get the feeling that you're reading only half of what I wrote?

Don't get me wrong, I feel flattered that you would even enter this discussion with me, but I feel that I have to repeat myself.

(Edit: I think I should work on my communication if people are not getting what I'm trying to say. Please keep in mind that English is not my native language)

I'm very well aware that using a book which has random factors will dramatically lower the statistical significance of that one game. Indeed, I have criticised (or, let's say commented) the UCI engines ligue for using a bundle of strong opponents to get my considerably weaker engine's "initial rating". That way it has happened numerous times that a drawish line was entered, and Eden managed to hold on against opposition > 500 points stronger.

At the level Eden is playing, only a small minority uses their own book. In fact those are primarily the engines that introduce increased randomness. I still like to include them, in part because I like to improve my book and find good lines against their extensive books. _Again_, I change my book only _after_ I have determined that my new version is significantly improved, so that I am sure that it's not merely the new book that's better.

Of those factors you've mentioned, books are the only one actually present in my tournaments. Engines at this level don't ponder (or I switch pondering off), they don't have learning and they don't do any parallel search. I would agree that these would significantly add to the randomness. However, you'll have to give me better ones than those to convince me that I have to change my testing methods any time soon. Remember, my engine is currently rated around 1650!

hgm · Post by **hgm** » Tue Sep 11, 2007 11:41 am

Well, seeing is believing (I hope

).

First two games I tried between Eden 0.0.13 and micro-Max 1.6:

[Event "Computer chess game"]
[Site "Pentium IV 2.8GHz"]
[Date "2007.09.11"]
[Round "-"]
[White "Eden 0.0.13"]
[Black "umax1_6w"]
[Result "1/2-1/2"]
[TimeControl "40/60"]

1. e4 Nf6 2. e5 Ng8 3. Bb5 c6 4. Be2 d5 5. Nf3 Bf5 6. O-O Qc7 7. Nd4 Qxe5
8. Nxf5 Qxf5 9. Bg4 Qe4 10. d4 Nf6 11. Bc8 a5 12. Nc3 Qh4 13. g3 Qh5 14.
Qxh5 Nxh5 15. g4 Kd8 16. Bf5 g6 17. Bd3 Nf6 18. g5 Nh5 19. Na4 Nd7 20. Be3
Kc7 21. Kh1 f5 22. Rfe1 f4 23. Bd2 Bg7 24. Rxe7 Bxd4 25. f3 Kd6 26. Ree1
Ne5 27. Be2 Nf7 28. c4 Nxg5 29. Rad1 Ne6 30. cxd5 cxd5 31. b4 axb4 32.
Bxb4+ Ke5 33. Bb5+ Be3 34. Nb6 Nf6 35. Nxa8 Rxa8 36. Re2 Rc8 37. Ba3 Ra8
38. Be7 Ng8 39. Bb4 Nh6 40. Rd3 Nf5 41. Ra3 Rxa3 42. Bxa3 Ned4 43. Bb2 Ke6
44. Bxd4 Nxd4 45. Rb2 Nxf3 46. Be2 Nh4 47. Rxb7 h5 48. a4 Bd4 49. Rb3 Be3
50. a5 Ke5 51. a6 Bd4 52. Rh3 Nf5 53. Bd3 Ne7 54. Kg2 Kf6 55. Kf3 Be3 56.
Bb5 Nf5 57. Kg2 Bc5 58. Rc3 Bd4 59. Rc6+ Ke5 60. Rc8 Nd6 61. Rb8 Ba7 62.
Rb7 Nxb7 63. axb7 Kf5 64. Be8 Kf6 65. Kf3 Bb8 66. h4 Bd6 67. Bb5 Be5 68.
Bc6 Ke6 69. Be8 Kf6 70. Bc6 Ke6 71. Be8 Kf6 72. Bc6
{Draw by repetition} 1/2-1/2

[Event "Computer chess game"]
[Site "Pentium IV 2.8GHz"]
[Date "2007.09.11"]
[Round "-"]
[White "Eden 0.0.13"]
[Black "umax1_6w"]
[Result "1/2-1/2"]
[TimeControl "40/60"]

1. e4 Nf6 2. e5 Ng8 3. Bb5 c6 4. Be2 d5 5. Nf3 Bf5 6. O-O Qc7 7. Nd4 Qxe5
8. Nxf5 Qxf5 9. Bg4 Qe4 10. d4 Nf6 11. Bc8 a5 12. Nc3 Qh4 13. g3 Qh5 14.
Qxh5 Nxh5 15. g4 Kd8 16. Bf5 g6 17. Bd3 Nf6 18. g5 Nh5 19. Na4 Nd7 20. Be3
Kc7 21. Kh1 f5 22. Rfe1 f4 23. Bd2 Bg7 24. Rxe7 Bxd4 25. f3 Kd6 26. Ree1
Ne5 27. Be2 Nf7 28. c4 Nxg5 29. Rad1 Ne6 30. cxd5 cxd5 31. b4 axb4 32.
Bxb4+ Ke5 33. Bb5+ Be3 34. Nb6 Nf6 35. Nxa8 Rxa8 36. Re2 Rc8 37. Ba3 Ra8
38. Be7 Ng8 39. Bb4 Nh6 40. Rd3 Nf5 41. Ra3 Rxa3 42. Bxa3 Ned4 43. Bb2 Ke6
44. Bxd4 Nxd4 45. Rb2 Nxf3 46. Be2 Nh4 47. Rxb7 h5 48. a4 Bd4 49. Rb3 Be3
50. a5 Ke5 51. a6 Bd4 52. Rh3 Nf5 53. Bd3 Ne7 54. Kg2 Kf6 55. Kf3 Be3 56.
Bb5 Nf5 57. Kg2 Bc5 58. Rc3 Bd4 59. Rc6+ Ke5 60. Rc8 Nd6 61. Rb8 Ba7 62.
Rb7 Nxb7 63. axb7 Kf5 64. Be8 Kf6 65. Kf3 Bb8 66. h4 Bd6 67. Bb5 Be5 68.
Bc6 Ke6 69. Be8 Kf6 70. Bc6 Ke6 71. Be8 Kf6 72. Bc6
{Draw by repetition} 1/2-1/2

They look pretty much identical to me. Not just the result. Even at this rather short time control. (I was careful not to touch the machine during the games. The game was played under Winboard. Eden was playing through Polyglot 1.4.) No cheating of any kind, I did not select games. Just load the engines under Winboard, play 'two machines', reset, play again. First time bull's eye.

To avoid such duplicates, I normally test from the Nunn positions. Than at least I can have 20 games between Eden and uMax, and be sure that they are different.

(uMax 1.6 is the smallest (1433 character) version of micro-Max, that goes without almost anything. In particular no hash table. No member of the uMax family has killer or history, or indeed any move ordering at all, other than that the best move of the previous iteration goes first.)

hgm · Post by **hgm** » Tue Sep 11, 2007 12:21 pm

Note that it also can be different:

[Event "Computer chess game"]
[Site "Pentium IV 2.8GHz"]
[Date "2007.09.11"]
[Round "-"]
[White "Eden 0.0.13"]
[Black "umax1_6w"]
[Result "1/2-1/2"]
[TimeControl "40/300"]

1. e4 Nf6 2. e5 Nd5 3. c4 Nb6 4. d4 d5 5. c5 N6d7 6. Bd3 Nc6 7. Ne2 e6 8.
Bb5 Be7 9. O-O O-O 10. Bf4 f6 11. exf6 Nxf6 12. Nbc3 Bd7 13. Rb1 a6 14. Ba4
Nh5 15. Be3 Kf7 16. g4 Nf6 17. g5 Ng4 18. Qb3 Bxg5 19. Qxb7 Ra7 20. Bxg5
Qxg5 21. Qb3 Ne3+ 22. Ng3 Nxf1 23. Rxf1 Nxd4 24. Qd1 Bxa4 25. Qxd4 Bb5 26.
Re1 Qf6 27. Qg4 Bd3 28. c6 Rb8 29. Nxd5 exd5 30. Qd7+ Kf8 31. Qxd5 Qd6 32.
Qf3+ Kg8 33. Rd1 Rd8 34. Nf5 Qg6+ 35. Kh1 Bxf5 36. Qb3+ Kf8 37. Rxd8+ Ke7
38. Rg8 Qxc6+ 39. Kg1 Qg6+ 40. Kh1 Bd3 41. Qb4+ Ke6 42. Qb3+ Ke7 43. Qb4+
Ke6 44. Qb3+ Ke7
{Draw by repetition} 1/2-1/2

[Event "Computer chess game"]
[Site "Pentium IV 2.8GHz"]
[Date "2007.09.11"]
[Round "-"]
[White "Eden 0.0.13"]
[Black "umax1_6w"]
[Result "0-1"]
[TimeControl "40/300"]

1. e4 Nf6 2. e5 Nd5 3. c4 Nb6 4. d4 d5 5. c5 N6d7 6. Bd3 Nc6 7. Ne2 e6 8.
Bb5 Be7 9. O-O O-O 10. Bf4 f6 11. exf6 Nxf6 12. Nbc3 Bd7 13. Rb1 a6 14. Ba4
Nh5 15. Be3 Kf7 16. g4 Nf6 17. g5 Ng4 18. Qb3 Bxg5 19. Qxb7 Ra7 20. Bxg5
Qxg5 21. Qb3 Ne3+ 22. Ng3 Nxf1 23. Rxf1 Nxd4 24. Qd1 Bxa4 25. Qxd4 Bb5 26.
Re1 Qf6 27. Qg4 Bd3 28. c6 Bb5 29. Nxb5 axb5 30. Qb4 Rxa2 31. Re2 Qf3 32.
Rxe6 Kxe6 33. Qd4 Ra1+ 34. Nf1 Qh3 35. Qe3+ Qxe3 36. fxe3 Raxf1+ 37. Kg2
R8f3 38. b4 Kd6 39. h4 Kxc6 40. h5 R1f2+ 41. Kg1 Rb2 42. h6 gxh6 43. e4
dxe4 44. Kh1 Rf1#
{Black mates} 0-1

In the next two games (played at 40 / 5min) it is finally micro-Max that deviates on move 28, to convert its advantage into a win. Note that uMax has no book, and has to start thinking from move 1. (In the native versions of uMax this could not happen, as they play by the number of nodes, but for the Winboard version I made it read the clock to implement the time command, as many testers get upset by engines that lose on time too often.)

So variability does occur (as expected), but just not frequently enough that you won't have many copies of the same games in a longer match. When I play them a third time at 40/5', they actually repeat the second game, because uMax decides again on 28. ..., Bb5 (smart, because that is the one it won!

).

But the point is, that even when there is some limited variability after move 28, the game is already decided by then (uMax leading by the exchange plus a pawn). So playing many games from the opening position would unjustly make you thing uMax almost always beats Eden, while there are other positions where this is just the reverse.

Uri Blass · Post by **Uri Blass** » Tue Sep 11, 2007 12:37 pm

I think that at the playing strength of your program you need almost no games to know that there is an improvement.

The only reason that you may need games is in order to check that there is no significant new bug.

Note that Bob also does not use pondering in his games so pondering is not relevant here but his engine is less deterministic and it has nothing to do with playing strength.

Movei is not weaker than Crafty but unlike Crafty it has deterministic search that mean that it does not remember information from previous searches(it does something equivalent to clearing the hash by increasing generation number in the hash and not using information from hash of previous search when the only difference is that it does not waste time on clearing hash).

I know that remembering information from previous search can be productive for playing strength but I think that it is not very important at the weak level of movei(only 108 elo above Crafty20.14 32 bit based on the ccrl list) and I prefer to have reproducable behaviour because I think that it is more productive for future developement in order to detect bugs because I do not like to see a situation when movei blunders and I cannot reproduce the blunder.

Uri

hgm · Post by **hgm** » Tue Sep 11, 2007 12:48 pm

Uri Blass wrote:... (it does something equivalent to clearing the hash by increasing generation number in the hash and not using information from hash of previous search when the only difference is that it does not waste time on clearing hash).

Perhaps a bit of topic: I used an elegant solution for exactly that in uMax 4.0. In stead of clearing the hash (which uMax 3.2 did) I just incremented the hash key in the root by 1 before every search. This means it can no longer recognize any of the old entries, as their hash keys were all off by 1. (If you probe 4 consecutive entries, and do not store the full key, you would of course have to increment by at least 4.) This solves the problem with a single instruction per search, no extra space in the hash entry required to store a generation number, no overhead to update it and (relevant for uMax) only two '+' characters required.

nczempin · Post by **nczempin** » Tue Sep 11, 2007 1:32 pm

Hmm. I just did a test that I was going to use as a baseline, a 10-game match between Eden 0.0.13 and Philemon C. I was sure that this would show that 5 games each would be identical, move by move.

Here are the ply counts:
Eden - Philemon: 140, 86, 119, 66, 78
Philemon - Eden: 97, 93, 94, 91, 94.

So I guess I stand corrected; I couldn't think of a better engine to substantiate my theory.

The result was 9.5-0.5 for Eden, BTW.

The fact remains that so far my judgements of when a version is strong enough have been pretty good. It may have to do with what Uri said, or other factors, such as my intuitive taking into account what should be straightforward improvements vs. changes that I can't prove to actually increase the strength.

hgm · Post by **hgm** » Tue Sep 11, 2007 1:51 pm

Well, it would be interesting to see at which move they first deviate, and if the game was already decided by then. (At what time control were you doing this)?

As you can see from my tests, Eden is quite standfast in his moves, and it was uMax that caused the variation.

But the problem with this kind of variability (i.e. deciding to make one more iteration or not) is that, even if you can calculate that there is a 5% probability per move that this will happen, in practice, when you play from a fixed initial position, it will always be the _same_ 95% of the moves where the iteration finishes so far from the timeout that they cannot be affected. And on the 5% of the moves that can be affected, there can only be 2 choices, it either starts the extra iteration (resulting in another move) or it doesn't (sticking to the old move). Which moves are subject to this 1-out-of]-2 variability is in itself fully deterministic.

So even with 5% of the moves deviating, if games on the average last 100 moves, you would have only ~5 decision points, and you could play only 32 different games. This means in a match of 10 games you might imagine there is no problem, but if you play 1000 games, every game would apear about 32 times. This makes it totally useless playing thousands of games. The exact probability that is choses one of the 32 possible continuations compared to the alternative is of little interest.

You have 2 games of 94 ply. Are these move-for-move the same?

Uri Blass · Post by **Uri Blass** » Tue Sep 11, 2007 3:04 pm

hgm wrote:
Uri Blass wrote:... (it does something equivalent to clearing the hash by increasing generation number in the hash and not using information from hash of previous search when the only difference is that it does not waste time on clearing hash).
Perhaps a bit of topic: I used an elegant solution for exactly that in uMax 4.0. In stead of clearing the hash (which uMax 3.2 did) I just incremented the hash key in the root by 1 before every search. This means it can no longer recognize any of the old entries, as their hash keys were all off by 1. (If you probe 4 consecutive entries, and do not store the full key, you would of course have to increment by at least 4.) This solves the problem with a single instruction per search, no extra space in the hash entry required to store a generation number, no overhead to update it and (relevant for uMax) only two '+' characters required.

I do not understand how it solve the problem.
It seems to me that it only can cause problems.

problem 1:you will not be able to store positions in the hash(because of remembering information from previous searches).

This is probably solved by replace always strategy.

Problem 2:You make wrong pruning or wrong order of moves based on information of earlier searches(in other words hash collision with information from previous search).

I see no solution to that problem without clearing the hash or increasing the generation number.

You can claim that problem 2 does not happen often but if collision of different searches does not happen often to be important you can also clear the hash and use less bits for the hash key so I do not see the save in bits.

I only see possible non deterministic behaviour because you may get different output in the search in case of hash collision with earlier searches so I dislike your solution.

Uri

hgm · Post by **hgm** » Tue Sep 11, 2007 4:40 pm

OK, indeed uMax uses a replace-always scheme.

The other probabilty I would consider small enough to simply ignore it. Storing a search ID does not guard you from key collissions within the same search anyway, so you will need a certain minimum number of key bits to keep that number acceptable. Even with only 32-bit of the key stored, if you make one probe examining 4 entries at 1M nps you would only have one accidental match every 1000 sec. So most searches would not have a collision. Collisions with entries from other searches are even rarer, if your searches last longer than a few seconds, as after the first few seconds almost all entries from the previous search would have been obliterated.

Furthermore, if you would ever see strange behavior that asks for debugging, and you would see that it did not reproduce, you could simply ascribe it to a key collision with an entry from an old search. And then there would be no reason to debug at all, as this is a calculated risk of the hashing strategy. It would only be a nuisance if this frequently happened in combination with another error that you want to debug.

bob · Post by **bob** » Tue Sep 11, 2007 5:37 pm

Are you playing without any opening book? Otherwise you can land in a lost position or won position before the engine itself has to think. This then is not a test of the engine, but more a test of the book.

That's why many of us factor the book out when evaluating changes, and play matches from known starting positions that are pretty equal, and then we play both sides to be sure there is no advantage.

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing