nondeterministic testing

bob · Post by **bob** » Fri Sep 07, 2007 9:16 pm

Rein Halbersma wrote:
bob wrote:[
Now, let me explain what I do and why.

[snip]

I've been using this for at least 10+ years in Crafty, and for at least 15 years prior to that in Cray Blitz. I have tweaked things from time to time, but I have never looked at a game and thought "dang, If I had just searched another 2 seconds, I would have found the move that would have won, or that would have avoided losing..."

The only thing I don't do in Crafty, that I did do in Cray Blitz, is to try to figure out what to do with the "extra" time I accumulate. If you are supposed to average 180 seconds per move, and when you get to move 20 you have 90 minutes left, you are 30 minutes ahead on the clock. Do you search all remaining moves a little longer, do you save that extra for those annoying fail-low cases, or do you save part of it to carry over to the next time control or use it all in the current one?
Do you also use the trick used by the Chinook team where they restart the entire search (i.e. starting the iterative deepening again with d=1) after a fail-low at the root? They claim that it will find a better move within a couple of seconds that way because of the information present in the transposition table.

No. I tried it on several different occasions, but it never worked for me... John Stanback originally proposed that idea (to the best of my knowledge) years ago on r.g.c.c (before Chinook again so far as I know). But I have never gotten it to be better than what I do now...

bob · Post by **bob** » Fri Sep 07, 2007 9:50 pm

hgm wrote:
bob wrote:Now, let me explain what I do and why.
Yes, that sounds like a very sensible way to do things. It is in fact very close to what I do in Joker:

I also have an absolute time limit in case of emergensies, to interrupt cases where a single move would already cause a loss on time. (This sometimes happens with KeepHash, if you are in a repeat position, and a very deep iteration is already satisfied in zero time from the hash, so that it embarks on a way-too ambitious iteration. And if you are unlucky, such this iteration might dramatically fail low, in a position that is still won.) I never use more than 1/3 of the remaining time for this.

In all other cases I only stop the search after finishing a root move. I have two timeout interval for that, the first set to half the nominal time, the other 1.5 the nominal time. If the first time-out interval is expired, I don't start a new iteration. This means that in a stable search an iteration, once started, will almost always have time to nearly complete before the second time-out occurs. (Well, my branching factor is not really 3 yet, so in an unlucky case I might miss a ver very late moves.)

After the second time-out, I don't start the search of a new root move in the current iteration if the score of the best move I have so far is not more than 20 centiPawn below the score of the previous iteration. In this case the search continues until such a move is found, or the emergency time-out occurs. That I set to 3x nominal time (so I only can overrun if the next iteration suddenly takes more than 6 times longer than the previous).

This might be a little short, and I don't have the refinement of alotting still more time on a material loss. Doing so might indeed bring a measurable improvement, as when such a timeout occurs, it now often does a move that immediately loses the game. But apart from this, the scheme is nearly identical to yours.

I have added one refinement you don't seem to have, though: If the time left falls below 80% of what the opponent has left, all time intervals are shortened by 20%. This to prevent development of a situation where the opponent can outsearch us by a large factor in the last few moves before the time control.

However, there is one potential form for timing randomness you have not recognized.
I can't understand why you say that, as this is what I have been saying from the very beginning!

But I also pointed out that in a scheme that finishes every iteration, or even in your scheme, one is really quite insensitive to this timing, especially at longer time controls. You have only a hand full of decision points, all at the root, namely the end of an iteration, and after moves within an iteration once the score exceeds S-delta. These decision points are spaced by seconds. The probability that a time-out fall within the millisecond (or so) timing uncertainty you can expect on a quiet machine woul cause a difference only once every 1000 moves or so. (Once it happens, the game diverges, of course. Even if the move actually chosen was the same, all tables might now be different, with future consequences.)

So as long as the emergency abort never occurs, schemes like yours lead to highly deterministic play. And my guess is that setting the parameters such that the emergency abort is very unlikely will improve playing strength, as the emergency abort is very detrimental, and a lot of time is wasted when it occurs. In Joker I even try to avoid the second type of time-out, by being rather conservative when to start a new iteration.

But the $64-question is still unanswered: how many Elo points does your engine gain by using this elaborate scheme compared to simply always finishing the iteration?

I have never run that test. This timing algorithm was developed from years of playing in human and computer tournaments, and watching what goes on. I remember when I first came up with the idea of extending on a fail low, at one of the ACM events where chess 4.x was playing and Atkin was bouncing in his chair wondering if his program would change its mind from a bad move to a better move. Back then we all just printed the new score and best move after an iteration completed, not in the middle.

I immediately modified my program to report PV changes at the root whenever they happened, and then the idea of extending on a drop in score was an obvious addition. So this came about from watching others wonder "can we find it" and turning that into "If I can see it is bad, I will find something better if it exists..."

So all of the above ideas were developed to address specific bad behaviors I saw in live games, rather than being developed experimentally and tested with matches...

But let's talk about non-determinism. I can run the same position on Crafty, over and over, and it will find the same move within the same exact number of nodes each time. But time varies a bit. You are thinking a single millisecond. You are off by at least an order of magnitude, it is really more in terms of 10-100ms:

log.001: time=6.47 mat=0 n=12083521 fh=94% nps=1.9M
log.002: time=6.49 mat=0 n=12083521 fh=94% nps=1.9M
log.003: time=6.47 mat=0 n=12083521 fh=94% nps=1.9M
log.004: time=6.40 mat=0 n=12083521 fh=94% nps=1.9M

Or, for longer searches:
log.001: time=18.13 mat=0 n=36425309 fh=93% nps=2.0M
log.002: time=18.00 mat=0 n=36425309 fh=93% nps=2.0M
log.003: time=18.11 mat=0 n=36425309 fh=93% nps=2.0M
log.004: time=17.95 mat=0 n=36425309 fh=93% nps=2.0M

That will introduce plenty of non-determinism whether you interrupt an iteration in progress, or wait until the end. And I don't see how you are not seeing that problem when you test. I find it _very_ difficult to produce the same exact game in 100 tries from the starting position, choosing two opponents from the set {Crafty, Fruit2, Glaurung1, Glaurung2, Arasan9, Arasan10, GnuchessX)

I won't claim to have tried every possibility, but I initially thought all the non-deterministic behavior I was seeing must be an artifact of Crafty, so I tested without Crafty involved and saw the same problem with the others mentioned.

Hence my original claim that a hundred games is meaningless when trying to measure an improvement, unless it is a very big one.

bob · Post by **bob** » Fri Sep 07, 2007 9:52 pm

nczempin wrote:
bob wrote:
You said that you don't finish an iteration, you cut the search off when time runs out. That alone should make it impossible to choose the same move every time, unless your eval is _very_ simple. Because there should be plenty of cases where you change your mind if you finish an iteration, but if time runs out early, you don't get to finish the new best move search and play the first (worse) move.
You are right in saying that this is undesirable. There are some cases where from one iteration to the next it is found that the current-best-move has a lower eval than it had on the previous, full iteration.

It would be prudent to let the iteration finish in that case.

However, it is not guaranteed that finishing the iteration will find a better move. I have not examined how many times it happens and how often the old first move would have to be changed after all. At the kinds of immature (or weak, as we called them before) levels that Eden is playing, the difference, even if it were significant, is likely to be less significant than most of the "low-hanging fruit" changes I have been able to make so far. So for the moment I'm willing to accept that this area makes Eden weaker than necessary, just like it is with many other things.

Incidentally, you seem to be contradicting yourself; I'm not sure if I understand you correctly: In another post you said that I'm probably doing it the right way, now you're saying it's wrong. What have I misunderstood?

I meant that not synchronizing at iteration boundaries is the "right way". But I am still trying to figure out why you are not seeing much non-deterministic behavior if you are doing it that way (essentially the same way I am doing it).

bob · Post by **bob** » Fri Sep 07, 2007 9:55 pm

MartinBryant wrote:
Alessandro Scotti wrote:
MartinBryant wrote:Also I test each change individually from some baseline version, not incrementally building up the changes on top of each other.
Hi Martin, can you tell me more about this? I'm currently always testing from the current best version, but I do that only because I like to see the program do some progress, no science behind!

P.S. And last Hamsters version would be doing great in the current test if it wasn't being excessively spanked by Colossus!
I'm not sure there's a great deal of science behind it, it just seems more sensible to test each change in isolation first, before testing them in combination (easier to draw conclusions perhaps?).

For instance, I've currently tested 21 individual changes on top of the baseline Colossus 2007c, 9 of which have proved succesful. Eventually these 9 will (probably) be put together and re-tested to become 2007d.

If you incrementally built them up, you might perhaps miss something good (or adopt something bad), as normal non-deterministic fluctuation in the scores because of the earlier changes would mask the real effect of the latest change that you're really interested in.

BTW, I think everybody's been very impressed with the recent improvements you've made in Hamsters! Well done!!

I think that you need to test A (the original) against A' (the original plus one change). Then the next cycle is A' against A''. If time were not an issue, it would maybe be even better to test A, then A+1 (a plus 1st change), then A+2 (A plus second change) and then finally A+1+2 (A + both changes). But that quickly becomes impossibly long in terms of time required. I simply do A, then A', then A'', then A''' and so forth, and any time something looks worse, I stop and figure out why.

bob · Post by **bob** » Fri Sep 07, 2007 9:59 pm

Michael Sherwin wrote:
bob wrote:You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...
I guess the way that I am testing now can be looked at like a type of genetic evolution. A small change and a hundred game match to test it. If it gains a new record it lives and if not, it dies. Every time a new record is gained some old ideas or values that were close in results to the old record are retried. This style of testing has resulted in a slow but steady increase in results. Just today a new record was set by RomiChess against Hamsters 0.2 of 68.5 of 100 which is a full point ahead of the old record. Never before in a hundred or so matches played so far has RomiChess scored as much as 68.5 of 100 points against Hamsters 0.2, so I am rather optomistic that this latest result reppresents an improvement. But, I am not going to try to verify it. I am just going to try to get a new record!

The problem is that your "result" is way more random than you suspect.

Run the _same_ match twice. post the results of each match. I did this a while back and showed that one match had A better, one had B better. While a long match showed they were nearly equal. 100 games is not quite as bad as flipping a coin to keep/discard a change, but it is not a _lot_ better...

But you have to test this hypothesis to fully appreciate it. Just run the same N positions against the same opponent, with the same time control, twice, and look at the results. It is eye-opening.

Michael Sherwin · Post by **Michael Sherwin** » Fri Sep 07, 2007 10:08 pm

jswaff wrote:
Michael Sherwin wrote:
bob wrote:You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...
I guess the way that I am testing now can be looked at like a type of genetic evolution. A small change and a hundred game match to test it. If it gains a new record it lives and if not, it dies. Every time a new record is gained some old ideas or values that were close in results to the old record are retried. This style of testing has resulted in a slow but steady increase in results. Just today a new record was set by RomiChess against Hamsters 0.2 of 68.5 of 100 which is a full point ahead of the old record. Never before in a hundred or so matches played so far has RomiChess scored as much as 68.5 of 100 points against Hamsters 0.2, so I am rather optomistic that this latest result reppresents an improvement. But, I am not going to try to verify it. I am just going to try to get a new record!
First, congrats on your new record. I'm sure Romi is improving.

That said, I don't think your methodology is a good one. If you just keep playing Romi v. Hamsters in a continuous loop, without making any code changes, eventually you'll find a 100 game "slice" where Romi wins by more than 68.5. And, you'll find many where it performs worse than 68.5. What does that tell you?

--
James

Well, since Romi has never scored 68.5 points against Hamsters 0.2 before in the 100+ matches that they have played together, it tells me the changes are worth keeping. In Swaminathan's tournaments Romi has gone from the mid 2300's to almost 2600. Romi's rating has always trended upwards when a new record is set against Hamsters. Which also tells me that I should try to set all the new records that I can!

hgm · Post by **hgm** » Fri Sep 07, 2007 11:20 pm

bob wrote:log.001: time=6.47 mat=0 n=12083521 fh=94% nps=1.9M
log.002: time=6.49 mat=0 n=12083521 fh=94% nps=1.9M
log.003: time=6.47 mat=0 n=12083521 fh=94% nps=1.9M
log.004: time=6.40 mat=0 n=12083521 fh=94% nps=1.9M

Or, for longer searches:
log.001: time=18.13 mat=0 n=36425309 fh=93% nps=2.0M
log.002: time=18.00 mat=0 n=36425309 fh=93% nps=2.0M
log.003: time=18.11 mat=0 n=36425309 fh=93% nps=2.0M
log.004: time=17.95 mat=0 n=36425309 fh=93% nps=2.0M

Well, that is indeed a bit more than I see on my computer (except on Friday nights, when it starts running the virus check...

). I guess it depends a little on what else you are running on the machine.

You seem to have a timing variability of about 1%. That means about once every 100 moves your iteration would end in the critical time interval around the nominal timeout. The probability that for two tries the actual timeout would be on different sides of the iteration end is 1 in 3. So once every 300 moves you would do an extra iteration. That iteration _might_ then give you another move. If it doesn't, and the next move is an obvious one, the introduced differences in hash and other table might be erased before they affect the move choice.

As most games do last shorter than 300 moves, most games would actually be the same with the timing variability you show, in engines where the only decision point is at the end of an iteration.

bob · Post by **bob** » Fri Sep 07, 2007 11:30 pm

hgm wrote:
bob wrote:log.001: time=6.47 mat=0 n=12083521 fh=94% nps=1.9M
log.002: time=6.49 mat=0 n=12083521 fh=94% nps=1.9M
log.003: time=6.47 mat=0 n=12083521 fh=94% nps=1.9M
log.004: time=6.40 mat=0 n=12083521 fh=94% nps=1.9M

Or, for longer searches:
log.001: time=18.13 mat=0 n=36425309 fh=93% nps=2.0M
log.002: time=18.00 mat=0 n=36425309 fh=93% nps=2.0M
log.003: time=18.11 mat=0 n=36425309 fh=93% nps=2.0M
log.004: time=17.95 mat=0 n=36425309 fh=93% nps=2.0M
Well, that is indeed a bit more than I see on my computer (except on Friday nights, when it starts running the virus check... ). I guess it depends a little on what else you are running on the machine.

You seem to have a timing variability of about 1%. That means about once every 100 moves your iteration would end in the critical time interval around the nominal timeout. The probability that for two tries the actual timeout would be on different sides of the iteration end is 1 in 3. So once every 300 moves you would do an extra iteration. That iteration _might_ then give you another move. If it doesn't, and the next move is an obvious one, the introduced differences in hash and other table might be erased before they affect the move choice.

As most games do last shorter than 300 moves, most games would actually be the same with the timing variability you show, in engines where the only decision point is at the end of an iteration.

That was run on my suse 10.2 linux box. I just booted windows (32 bit unfortunately) and ran a windows executable and had even greater variability there. I am running no other applications of any kind, other than the normal linux daemons that are not really doing anything. I had no networking up, no nothing...

There are other issues. Since I interrupt in the middle of an iteration, at a time that has some standard error of measurement, every time I run the search takes a little longer or a little less time than the last time. And that influences what is stored in the hash table to start off the next search, which then influences the results of that search. Note all my testing is single-cpu, ponder=off, no book, no tablebases, just raw computing, to eliminate as many variables as I can. But I can't eliminate that timing variability.

Don't forget, timing changes are cumulative. If you save 50ms this search, the next search, and the one after that, pretty soon you begin to have a different target time. Cumulative quantization errors in the time measurements can probably add up quicker than your above estimation allows...

hgm · Post by **hgm** » Sat Sep 08, 2007 9:18 am

Well, if I understood your scheme right most of the interruptions during an iteration are still at well-define moments in the root. So you have more decision points, but not orders of magnitude more. It remains a 'quantized' system, not really a chaotic one where infinitesimal small differences caused by just searching a couple of nodes extra (as could happen on your emergency abort) seeds an avelange of change that in the end affects the move choice.

But indeed, if you have 30 times as many decision points (one for every root move), you would see a different decision every 10 moves in stead of every 300 moves. And then the chances for two identical games would become very small.

So I think the difference is well explained. With the timing vaiability you have one would expect engines that always finish an iteration to play mostly the same games, and engines that can stop after any root move to play mainly different games. That explins the behavior of uMax and Crafty. I am not sure if Eden even has a hash table. If not, the only effective decision point if an abort can occur at any time would be the times where it changes move in the root. That is not much more frequent than starting a new iteration, and that would explain Eden. Joker is not dependent on time variability for non-determinism, it uses a time-seeded random to its evaluation.

The accumulation of the vaiability is not so much a factor to worry, I think. In the first place it only accumulates with sqrt(N) statistics. So if the variability for 1 move is 1%, for 39 moves it would have risen to 6.3% of the time of a single move. Now at normal time controls it would depend how much time you keep as a safety margin. Typically you leave the nominal time for 3-4 moves, to be able to handle a fail-low in the 40th move. So that would then only result in a 1.5-2 times larger timing uncertainty than normal. So you would have a little extra probability on the 40th move to deviate, but not dramatically. (And the game might have been decided before that anyway.) In uMax I even divide the RemainingTime by RemainingMoves+7, as it doesn't have an emergency abort, and I want to be sure to have no time losses. That makes the effect of accumulated time variability almost negligible.

At incremental time controls with very small increment, the situation in the time-squeeze phase might be different, I have done little testing with that, so my claims of deterministic play don't extend to that regime.

Wardy · Post by **Wardy** » Sat Sep 08, 2007 12:22 pm

Always a pleasure Martin

There's also some merit in testing engines on different hardware set ups. Certainly I can think of at least one Hash bug early on with Colossus where it wouldn't use more then 512mb hash regardless of the GUI setting purely because Martin's machine at the time couldn't test beyond that level whereas I could.

For sure there's a lot of processing power and genuine help available to any programmer should they only ask the question of the community.

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing