Nondeterministic Testing in Weak engines such as Eden 0.0.13

Uri Blass · Post by **Uri Blass** » Wed Sep 12, 2007 8:48 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Jan Brouwer wrote:Hi Bob,

I understand that you have considerable hardware resources available for testing.
Can you give any general advice on how you would test on a single PC, let's say a quad-core processor with a time limit of about 20 hour (to allow for daily iterations) ?
What time-control (maybe several different ones?), how many different opponents, etc.

So far I have done most testing at 20 second + 1 second / move against about 6 opponents using Nunn starting positions, just to get a reasonable number of games in a few hours.
Sorry, but I can't give you an answer. To get reliable/stable results, you need thousands of games, not tens or hundreds. My goal is to be able to accurately say whether A or A' is better with a very high level of accuracy. I tried to use one PC for years and never found something workable...
I disagree.
There are things that you clearly can do
1)You can use common sense to decide if a change is good.
If the program has some weakness and you see that a change fix that weakness and the result in games is also good that the change probably works.

Sorry, but that's no good. I can't count the number of "obviously good ideas" we have implemented this past year, but testing shows they are worse than the original. If you rely on intuition, you are going to make a _lot_ of wrong steps. Objective measurement is the key...

2)You can use test suites in case that you make changes in your search.
If you make your program faster with the same output you can be practically sure that you made an improvement and you may need games only to verify that you have no bugs that happen only when you make more than one search.
Again, wrong in my opinion. To solve test suites faster, just increase your check extensions, etc. But that won't make your program play better in real games. It will likely slow it down enough that it will actually play significantly worse. Chest is a good example. Optimized for finding mates. Would make a horrible game player...

Not always the change in the search is exactly speed improvement but even in that case you can use test suites.

Note that I allow checks in the first 2 plies of the qsearch in movei but I have no special move generator that generates only captures and checks and I simply generate all moves and later find the checks out of them.

I think to add special move generator that generates only captures and checks.

This generator will not give me a pure speed improvement because the order of the generated moves may be different relative to the normal move generator but I think that it will be possible to see if there is an improvement based on test suites when I may play games only to verify that there is no serious bug.

3)You may play games at super fast time control for part of the changes that you try.
Possibly. But "super-fast" games make tactical programs look better than they actually are, or they make positional programs look worse. Because the relative difference in the average search depth increases as the games speed up. Again, you can draw the wrong conclusion.

Based on my knowledge part of the testing of rybka is simply by very fast games(game in less than 1 second)

Uri
If you only do an eval change, you can often get by with fast games. But if you don't run long games occasionally, you get surprised...

1)my definition of obviously good ideas is probably different than your definition.
I do not see it relevant for the middle game and
I think that I can say it mainly about knowledge for specific endgame that movei still does not have(like knowing some endgames are drawn or won and in that case the only thing that I am afraid from is bugs).

One example:
Movei still has big score of advantage for positions like KRB vs KRP and I am sure that it should be smaller and closer to draw.

Movei is already a slow searcher so I am not afraid from possible small loss in speed from adding knowledge.

2)I agree that there are changes that are productive in test suites but counter productive in games and I do not say that I can use test suites for every search change and the question if to use test suites is dependent on the change that you do.

If the change that you do is close to be equivalent to speed improvement you can use test suites.

It does not have to be direct speed improvement because the order of moves that you generate may be different but I talk about changes that cause you to get the same depth faster without change in the search algorithm.

3)I plan to test the program against itself in 1000 nodes per second that mean less than 1 second per game and I believe that it can be productive in detecting bugs in adding endgame knowledge(if the new program does not win the match that it is supposed to win).

In this case
Most of the matchs of 2 games from the first opening position are supposed to end by 1:1 and I may look at games that are finished by different result to see what is wrong or if not 1:1 is thanks to right knowledge.

Uri

bob · Post by **bob** » Wed Sep 12, 2007 9:31 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Jan Brouwer wrote:Hi Bob,

I understand that you have considerable hardware resources available for testing.
Can you give any general advice on how you would test on a single PC, let's say a quad-core processor with a time limit of about 20 hour (to allow for daily iterations) ?
What time-control (maybe several different ones?), how many different opponents, etc.

So far I have done most testing at 20 second + 1 second / move against about 6 opponents using Nunn starting positions, just to get a reasonable number of games in a few hours.
Sorry, but I can't give you an answer. To get reliable/stable results, you need thousands of games, not tens or hundreds. My goal is to be able to accurately say whether A or A' is better with a very high level of accuracy. I tried to use one PC for years and never found something workable...
I disagree.
There are things that you clearly can do
1)You can use common sense to decide if a change is good.
If the program has some weakness and you see that a change fix that weakness and the result in games is also good that the change probably works.

Sorry, but that's no good. I can't count the number of "obviously good ideas" we have implemented this past year, but testing shows they are worse than the original. If you rely on intuition, you are going to make a _lot_ of wrong steps. Objective measurement is the key...

2)You can use test suites in case that you make changes in your search.
If you make your program faster with the same output you can be practically sure that you made an improvement and you may need games only to verify that you have no bugs that happen only when you make more than one search.
Again, wrong in my opinion. To solve test suites faster, just increase your check extensions, etc. But that won't make your program play better in real games. It will likely slow it down enough that it will actually play significantly worse. Chest is a good example. Optimized for finding mates. Would make a horrible game player...

Not always the change in the search is exactly speed improvement but even in that case you can use test suites.

Note that I allow checks in the first 2 plies of the qsearch in movei but I have no special move generator that generates only captures and checks and I simply generate all moves and later find the checks out of them.

I think to add special move generator that generates only captures and checks.

This generator will not give me a pure speed improvement because the order of the generated moves may be different relative to the normal move generator but I think that it will be possible to see if there is an improvement based on test suites when I may play games only to verify that there is no serious bug.

3)You may play games at super fast time control for part of the changes that you try.
Possibly. But "super-fast" games make tactical programs look better than they actually are, or they make positional programs look worse. Because the relative difference in the average search depth increases as the games speed up. Again, you can draw the wrong conclusion.

Based on my knowledge part of the testing of rybka is simply by very fast games(game in less than 1 second)

Uri
If you only do an eval change, you can often get by with fast games. But if you don't run long games occasionally, you get surprised...
1)my definition of obviously good ideas is probably different than your definition.
I do not see it relevant for the middle game and
I think that I can say it mainly about knowledge for specific endgame that movei still does not have(like knowing some endgames are drawn or won and in that case the only thing that I am afraid from is bugs).

One example:
Movei still has big score of advantage for positions like KRB vs KRP and I am sure that it should be smaller and closer to draw.

Movei is already a slow searcher so I am not afraid from possible small loss in speed from adding knowledge.

2)I agree that there are changes that are productive in test suites but counter productive in games and I do not say that I can use test suites for every search change and the question if to use test suites is dependent on the change that you do.

If the change that you do is close to be equivalent to speed improvement you can use test suites.

It does not have to be direct speed improvement because the order of moves that you generate may be different but I talk about changes that cause you to get the same depth faster without change in the search algorithm.

3)I plan to test the program against itself in 1000 nodes per second that mean less than 1 second per game and I believe that it can be productive in detecting bugs in adding endgame knowledge(if the new program does not win the match that it is supposed to win).

In this case
Most of the matchs of 2 games from the first opening position are supposed to end by 1:1 and I may look at games that are finished by different result to see what is wrong or if not 1:1 is thanks to right knowledge.

Uri

There is a flaw with node count testing. Here's how to see it. Run a match of some number of games where you use something like 10,000 nodes. Then run the same match again, but now use 11,000 nodes (a 10% difference). Look at the results. Why is this important? Because by choosing a specific node count, you are restricting yourself to a small part of the total population of games. You make a tiny eval change, which simply makes the 1000 nodes that A' searches different from the 1000 nodes that A searches, and that small sample might well say A' is better when it is not, or vice versa. Yes, fixed nodes produces deterministic results, IF AND ONLY IF nothing is changed to change which nodes are actually searched. Otherwise you still need a very large number of games.

Personally I would not trust 1K node games for anything. 1M, maybe just a bit.

tvrzsky · Post by **tvrzsky** » Thu Sep 13, 2007 2:25 am

Mr. Hyatt,
what do you think about FIXED TIME PER MOVE condition? I use it by now for several reasons, mostly because my engine has not any time management yet

. Seriously, I am simply not interested in this one since at the present stage I care solely about search and evaluation improvements and not about overall playing strength at all. I assume that this way I am able to get this important but somehow unrelated factor (i. e. time management) out of play. At the same time I thing that fixed time is more realistic (closer to real game) condition than fixed depth or even fixed nodecount. So I use 3 - 6 seconds per move which on my hardware usually translates to few million nodes and search depths about 9 - 12 in an ordinary middlegame.
With all your expierence, do you see in this approach any clear weak points which I am not aware of?
Thanks

bob · Post by **bob** » Thu Sep 13, 2007 5:19 am

fixed time per move is reasonable. Of course, once you add some reasonable time management, you will want to test in a way that allows that code to influence the game as well.

Nondeterministic Testing in Weak engines such as Eden 0.0.13

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.