nondeterministic testing

Alessandro Scotti · Post by **Alessandro Scotti** » Fri Sep 07, 2007 10:12 am

MartinBryant wrote:Also I test each change individually from some baseline version, not incrementally building up the changes on top of each other.

Hi Martin, can you tell me more about this? I'm currently always testing from the current best version, but I do that only because I like to see the program do some progress, no science behind!

P.S. And last Hamsters version would be doing great in the current test if it wasn't being excessively spanked by Colossus!

Rein Halbersma · Post by **Rein Halbersma** » Fri Sep 07, 2007 10:20 am

bob wrote:[
Now, let me explain what I do and why.

[snip]

I've been using this for at least 10+ years in Crafty, and for at least 15 years prior to that in Cray Blitz. I have tweaked things from time to time, but I have never looked at a game and thought "dang, If I had just searched another 2 seconds, I would have found the move that would have won, or that would have avoided losing..."

The only thing I don't do in Crafty, that I did do in Cray Blitz, is to try to figure out what to do with the "extra" time I accumulate. If you are supposed to average 180 seconds per move, and when you get to move 20 you have 90 minutes left, you are 30 minutes ahead on the clock. Do you search all remaining moves a little longer, do you save that extra for those annoying fail-low cases, or do you save part of it to carry over to the next time control or use it all in the current one?

Do you also use the trick used by the Chinook team where they restart the entire search (i.e. starting the iterative deepening again with d=1) after a fail-low at the root? They claim that it will find a better move within a couple of seconds that way because of the information present in the transposition table.

hgm · Post by **hgm** » Fri Sep 07, 2007 10:58 am

bob wrote:Now, let me explain what I do and why.

Yes, that sounds like a very sensible way to do things. It is in fact very close to what I do in Joker:

I also have an absolute time limit in case of emergensies, to interrupt cases where a single move would already cause a loss on time. (This sometimes happens with KeepHash, if you are in a repeat position, and a very deep iteration is already satisfied in zero time from the hash, so that it embarks on a way-too ambitious iteration. And if you are unlucky, such this iteration might dramatically fail low, in a position that is still won.) I never use more than 1/3 of the remaining time for this.

In all other cases I only stop the search after finishing a root move. I have two timeout interval for that, the first set to half the nominal time, the other 1.5 the nominal time. If the first time-out interval is expired, I don't start a new iteration. This means that in a stable search an iteration, once started, will almost always have time to nearly complete before the second time-out occurs. (Well, my branching factor is not really 3 yet, so in an unlucky case I might miss a ver very late moves.)

After the second time-out, I don't start the search of a new root move in the current iteration if the score of the best move I have so far is not more than 20 centiPawn below the score of the previous iteration. In this case the search continues until such a move is found, or the emergency time-out occurs. That I set to 3x nominal time (so I only can overrun if the next iteration suddenly takes more than 6 times longer than the previous).

This might be a little short, and I don't have the refinement of alotting still more time on a material loss. Doing so might indeed bring a measurable improvement, as when such a timeout occurs, it now often does a move that immediately loses the game. But apart from this, the scheme is nearly identical to yours.

I have added one refinement you don't seem to have, though: If the time left falls below 80% of what the opponent has left, all time intervals are shortened by 20%. This to prevent development of a situation where the opponent can outsearch us by a large factor in the last few moves before the time control.

However, there is one potential form for timing randomness you have not recognized.

I can't understand why you say that, as this is what I have been saying from the very beginning!

But I also pointed out that in a scheme that finishes every iteration, or even in your scheme, one is really quite insensitive to this timing, especially at longer time controls. You have only a hand full of decision points, all at the root, namely the end of an iteration, and after moves within an iteration once the score exceeds S-delta. These decision points are spaced by seconds. The probability that a time-out fall within the millisecond (or so) timing uncertainty you can expect on a quiet machine woul cause a difference only once every 1000 moves or so. (Once it happens, the game diverges, of course. Even if the move actually chosen was the same, all tables might now be different, with future consequences.)

So as long as the emergency abort never occurs, schemes like yours lead to highly deterministic play. And my guess is that setting the parameters such that the emergency abort is very unlikely will improve playing strength, as the emergency abort is very detrimental, and a lot of time is wasted when it occurs. In Joker I even try to avoid the second type of time-out, by being rather conservative when to start a new iteration.

But the $64-question is still unanswered: how many Elo points does your engine gain by using this elaborate scheme compared to simply always finishing the iteration?

Michael Sherwin · Post by **Michael Sherwin** » Fri Sep 07, 2007 11:28 am

bob wrote:You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...

I guess the way that I am testing now can be looked at like a type of genetic evolution. A small change and a hundred game match to test it. If it gains a new record it lives and if not, it dies. Every time a new record is gained some old ideas or values that were close in results to the old record are retried. This style of testing has resulted in a slow but steady increase in results. Just today a new record was set by RomiChess against Hamsters 0.2 of 68.5 of 100 which is a full point ahead of the old record. Never before in a hundred or so matches played so far has RomiChess scored as much as 68.5 of 100 points against Hamsters 0.2, so I am rather optomistic that this latest result reppresents an improvement. But, I am not going to try to verify it. I am just going to try to get a new record!

nczempin · Post by **nczempin** » Fri Sep 07, 2007 11:53 am

bob wrote:
You said that you don't finish an iteration, you cut the search off when time runs out. That alone should make it impossible to choose the same move every time, unless your eval is _very_ simple. Because there should be plenty of cases where you change your mind if you finish an iteration, but if time runs out early, you don't get to finish the new best move search and play the first (worse) move.

You are right in saying that this is undesirable. There are some cases where from one iteration to the next it is found that the current-best-move has a lower eval than it had on the previous, full iteration.

It would be prudent to let the iteration finish in that case.

However, it is not guaranteed that finishing the iteration will find a better move. I have not examined how many times it happens and how often the old first move would have to be changed after all. At the kinds of immature (or weak, as we called them before) levels that Eden is playing, the difference, even if it were significant, is likely to be less significant than most of the "low-hanging fruit" changes I have been able to make so far. So for the moment I'm willing to accept that this area makes Eden weaker than necessary, just like it is with many other things.

Incidentally, you seem to be contradicting yourself; I'm not sure if I understand you correctly: In another post you said that I'm probably doing it the right way, now you're saying it's wrong. What have I misunderstood?

hgm · Post by **hgm** » Fri Sep 07, 2007 11:59 am

So how does it work in Eden? You test in the root after each move if the pre-set time for the move has expired, and then play the best move so far?

Why would that be any simpler than just doing that same test after every iteration?

nczempin · Post by **nczempin** » Fri Sep 07, 2007 2:28 pm

hgm wrote:So how does it work in Eden? You test in the root after each move if the pre-set time for the move has expired, and then play the best move so far?

Why would that be any simpler than just doing that same test after every iteration?

I can't recall the details (but I can look it up if you insist), but in principle, yes. Perhaps I even test it more frequently; I haven't looked at that part of the code in a while now.

It is not simpler per se.
But if I were to test after each iteration, I could easily lose on time. And to solve that problem a more sophisticated mechanism is needed.

So yes, in that sense it is simpler.

PK-4 · Post by **PK-4** » Fri Sep 07, 2007 5:37 pm

Prof. Hyatt wrote:It seems to me that there are two issues. If you are trying to prove program A is better than program B, the longer the time control, the better. But if you are comparing A to B and A' to B (where A' is a slightly modified A) then you really don't care so much about the final result, as you do about the difference in the results for A and A'. And there, faster is better from a development point of view.

This statement of Professor Hyatt seems to be the solution of a problem of testing I have experienced - program A may be better than program B at short time control but the result is reversed at long time control. My guess was that A had better nps and B had better branching factor. Then the high speed of A would enable a deeper search in short time, but B would search deeper with long time. This would in general be true for programs from different authors and even for versions of the same engine sufficiently apart. The prescription of comparing two close versions A and A' by pitting against the same program B is exciting since testing time needed would be drastically reduced. I wonder if it is possible to prove the time independence of this prescription logically.

P.K.Mukhopadhyay

jswaff · Post by **jswaff** » Fri Sep 07, 2007 6:41 pm

Michael Sherwin wrote:
bob wrote:You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...
I guess the way that I am testing now can be looked at like a type of genetic evolution. A small change and a hundred game match to test it. If it gains a new record it lives and if not, it dies. Every time a new record is gained some old ideas or values that were close in results to the old record are retried. This style of testing has resulted in a slow but steady increase in results. Just today a new record was set by RomiChess against Hamsters 0.2 of 68.5 of 100 which is a full point ahead of the old record. Never before in a hundred or so matches played so far has RomiChess scored as much as 68.5 of 100 points against Hamsters 0.2, so I am rather optomistic that this latest result reppresents an improvement. But, I am not going to try to verify it. I am just going to try to get a new record!

First, congrats on your new record. I'm sure Romi is improving.

That said, I don't think your methodology is a good one. If you just keep playing Romi v. Hamsters in a continuous loop, without making any code changes, eventually you'll find a 100 game "slice" where Romi wins by more than 68.5. And, you'll find many where it performs worse than 68.5. What does that tell you?

--
James

MartinBryant · Post by **MartinBryant** » Fri Sep 07, 2007 7:59 pm

Alessandro Scotti wrote:
MartinBryant wrote:Also I test each change individually from some baseline version, not incrementally building up the changes on top of each other.
Hi Martin, can you tell me more about this? I'm currently always testing from the current best version, but I do that only because I like to see the program do some progress, no science behind!

P.S. And last Hamsters version would be doing great in the current test if it wasn't being excessively spanked by Colossus!

I'm not sure there's a great deal of science behind it, it just seems more sensible to test each change in isolation first, before testing them in combination (easier to draw conclusions perhaps?).

For instance, I've currently tested 21 individual changes on top of the baseline Colossus 2007c, 9 of which have proved succesful. Eventually these 9 will (probably) be put together and re-tested to become 2007d.

If you incrementally built them up, you might perhaps miss something good (or adopt something bad), as normal non-deterministic fluctuation in the scores because of the earlier changes would mask the real effect of the latest change that you're really interested in.

BTW, I think everybody's been very impressed with the recent improvements you've made in Hamsters! Well done!!

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing