nondeterministic testing

bob · Post by **bob** » Sat Aug 18, 2007 3:20 am

Uri Blass wrote:
bob wrote:
pedrox wrote:I know that if you repeat the test of Nunn 2 times you obtain different values and also I know that more games and rivals better, in that we agree, but if you have a single computer, to play 2500 games 15 + 10 is impossible.

When your engine is weak, I do not believe that you must make such number of games for improvement your engine, it is possible that 50 - 100 games in blitz and you are progressing.

In many occasions, we can see for example in wbec a league where a engine has played 80 games and we already know that it has progressed, with a second tournament often we have already the verification, the 200 games that use in CCRL to give a fixed rating will make suppose that they also will consider that with 200 games it is more or less sufficient.

When in your engine you do not have new techniques and ideas that to use, then often you probe to fit values, then, yes, you need many more games to verify than you are progressing (this is not by Crafty, I admire to Crafty and with good hardware, 64 bits, several processors, Crafty is strong and I know that you continue proving thousand ideas nowadays)
I'm sorry, but that is simply not going to work. If your engine is weak, and you make a change you want to evaluate, that "change" is going to produce at best a small change in the program's skill level. It takes way more than 100 games to determine if it is really better or not. And I do mean "way".

Again, to see why, just run your test 4 times and look at the different results you will get. If you only run it once, you might get the run that says the change is better when it is not, or that the change is worse when it is not, or that it is better when it is, or that it is worse when it is. If you are going to improve, you must get that right. 100 games is not going to do it.
If the engine is weak then there may be changes that 100 games are enough to be sure that there is an improvement.

If you get an improvement of 100 elo by a single change then 100 games can be clearly enough to see that you got an improvement.

Uri

Agreed. If you have a broken null-move search vs a correct one, 100 games would show that. But there are very few such changes one can make in a program. Most are very minor improvements and it takes a _ton_ of games to measure that tiny improvement with any sort of accuracy at all...

Marc Lacrosse · Post by **Marc Lacrosse** » Wed Sep 05, 2007 11:17 pm

jswaff wrote:I am in the process of developing a better testing strategy (...)
What time controls do others use? Is there a "shortcut" I'm missing for testing eval changes? Seems to be as much art as science. (...)

I don't think it's art.

First off all if you decide not to go for the perfect situation (a very very large numbet of games at slow time control) you have to accept some compromise.

I am in the process of trying to find the best compromise for some very fast preliminary test.

You have a new idea : is it roughly better than former parameters or is it completely stupid ?

One year ago I published some results with a testing scheme based of 512 ultrarapid blitzes ran in a litlle less than two days. Results correlated very well with established ranking lists that were established at much slower time control by a large number of testers (see my site).

From this experience (and other unpublished ones) I am pretty sure that the number of games is much more important than the timing.

At the moment I am experimenting with a faster testing scheme : 256 games (32 positions x (W+B) x four standard opponents).
This scheme is run in less than 24 hours on a single PC.

For each tested engine (or engine version) I get an elo evaluation +/- ~19 elo points. So I am able to reliably detect differences between two versions with more than +/- 30-40 elo points difference in less than 24 hours on a single PC.

Whenever something interesting emerges with this testing I go for longer testing with a larger number of games at slower timing.

Marc

bob · Post by **bob** » Thu Sep 06, 2007 12:21 am

You are having far better luck than I am. I am using 40 positions, 2 games each, with 4 standard opponents. A single run has a wild variance when compared with a second run or a third run...

hgm · Post by **hgm** » Thu Sep 06, 2007 11:50 am

For testing very small modifications, that lead to a different move choice only once every 10 moves or so, the tree-comparison method I proposed might solve the problem.

jswaff · Post by **jswaff** » Thu Sep 06, 2007 4:30 pm

I did some testing of my own and have come to the same conclusion - the number of games seems to be far more important than the length of the games.

Currently I'm running some Nunn matches to find suitable opponents for Prophet. It's a bit challenging to find Linux engines at Prophet's level, but there are a few out there. GNU 5.05 turned out to be a good one. EXChess is a bit stronger, but keeps dying on queenside castles. When I find some time I will try to debug it and submit a patch to Dan Homan. (Pretty sure it's a parsing error.)

--
James

nczempin · Post by **nczempin** » Thu Sep 06, 2007 5:06 pm

A little perspective on weak engines; I think the situation is different once the strength exceeds about 2000 (in any semi-realistic scale).

My engine is pretty weak, and at this level the variability depends on a few factors, I'll list the ones that come immediately to mind.

1. Engines that have no book, or (like mine) always choose the same line given the same previous moves, when playing each other will play exactly the same game each time. So the variability will be zero if you play two games with swapping colours.

2. Most engines at this weak level are comparatively stronger in calculation than in evaluation. They regularly play 1. Nh3 or just make more or less random pawn moves as long as there's no material to be gained. If your engine (or let's say, my engine

plays those engines (that also have no book) I can easily "prepare" my book against those engines. My engine will seem stronger than it really is if it can get into positions where there superior tactics are no longer sufficient.

3. At this weak level, improvements are likely fairly significant. We are usually concerned with huge bugs, plus techniques that have been proven.

So for my engine, where I explicitly have the goal that the new version should be _slightly_ stronger than the previous one, but not exceedingly so, this is the technique I use (and so far it has been pretty consistent):

I find a group of engines that are reasonably close in strength (this selection gets better with time) and start playing a double-round tournament with them with my old version plus the new version that I consider to be worthy of testing for sufficient strength.

Usually I then decide that my version is sufficiently strong if it has at least one other engine placed between it and the old version. Sometimes it places relatively much higher.

hgm · Post by **hgm** » Thu Sep 06, 2007 5:29 pm

An off-topic question: do you think your engine is weak because of bugs, or just because it doesn't implement many feautures (e.g. very simple evaluation, no prunings/extensions, no hash table, etc.)

I noticed that over a reasonable range of time controls Eden 0.0.11 seemed about the same level as micro-Max 1.6, currently my smallest version, where I removed about everything except the alpha-beta search and material evaluation, leaving only a recapture QS, IID to find the best move (and try it first in the next iteration), and a small drive towards the center for K,B,N,P. But I am pretty sure there are no bugs (it is just too small for bugs to hide...

)

Is Eden similar to that?

bob · Post by **bob** » Thu Sep 06, 2007 5:49 pm

nczempin wrote:A little perspective on weak engines; I think the situation is different once the strength exceeds about 2000 (in any semi-realistic scale).

My engine is pretty weak, and at this level the variability depends on a few factors, I'll list the ones that come immediately to mind.

1. Engines that have no book, or (like mine) always choose the same line given the same previous moves, when playing each other will play exactly the same game each time. So the variability will be zero if you play two games with swapping colours.

This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.

If you play a weak engine against a strong engine, what you say might actually happen, but then the weaker engine is going to lose no matter what. And there is no way to measure improvement if you can't win some games.

But your basic premise about an opening book being needed to avoid repeated games is simply wrong. If you want data I can send you data from a million games or so run on my cluster here. The variability (non-deterministic behavior) is absolutely remarkable.

2. Most engines at this weak level are comparatively stronger in calculation than in evaluation. They regularly play 1. Nh3 or just make more or less random pawn moves as long as there's no material to be gained. If your engine (or let's say, my engine plays those engines (that also have no book) I can easily "prepare" my book against those engines. My engine will seem stronger than it really is if it can get into positions where there superior tactics are no longer sufficient.

A good reason for not using a book. It introduces yet another random aspect into your games that is bad for testing/evaluation purposes. yes you need a good book for tournaments. No, you don't need any kind of a book to evaluate changes. Just start the games at the same point and see if your changes produce better results. A book adds too much random noise, and when you factor in learning, it becomes useless for evaluating changes to the program itself.

3. At this weak level, improvements are likely fairly significant. We are usually concerned with huge bugs, plus techniques that have been proven.

So for my engine, where I explicitly have the goal that the new version should be _slightly_ stronger than the previous one, but not exceedingly so, this is the technique I use (and so far it has been pretty consistent):

I find a group of engines that are reasonably close in strength (this selection gets better with time) and start playing a double-round tournament with them with my old version plus the new version that I consider to be worthy of testing for sufficient strength.

Usually I then decide that my version is sufficiently strong if it has at least one other engine placed between it and the old version. Sometimes it places relatively much higher.

And if you don't play thousands of games, sometimes the results are going to be exactly the opposite of what is really going on...

nczempin · Post by **nczempin** » Thu Sep 06, 2007 5:55 pm

hgm wrote:An off-topic question: do you think your engine is weak because of bugs, or just because it doesn't implement many feautures (e.g. very simple evaluation, no prunings/extensions, no hash table, etc.)

I noticed that over a reasonable range of time controls Eden 0.0.11 seemed about the same level as micro-Max 1.6, currently my smallest version, where I removed about everything except the alpha-beta search and material evaluation, leaving only a recapture QS, IID to find the best move (and try it first in the next iteration), and a small drive towards the center for K,B,N,P. But I am pretty sure there are no bugs (it is just too small for bugs to hide... )

Is Eden similar to that?

It is weak for the following reasons:
1. It is programmed in Java. This by itself is not sufficient reason, because there are much stronger Java engines (Flux 2.2 seems to be the strongest at the moment).
2. I started out preferring simplicity of implementation over performance. I had horrendous inefficiencies such as allocating an object for every piece on the board, for every position (I still generate a new position each time I go down a ply, rather than implementing just one board with a takeback function. I also still have a "move" object which gets put into a java SortSet for move order. There are still many inefficiencies of this kind; I am slowly moving towards more efficiency, at the cost of readability.
3. It definitely has bugs. In fact, the more I'm saying goodbye to the hardline OO approach the more bugs seem to come up. The current version (0.0.13) even seems to have a problem of not realizing that bishops and queens can be used to prevent mates by moving in between the (rook-type) threatening piece and the king

4. There are a lot of proven techniques not yet implemented. Null-move is probably the biggest missing factor so far.
5. For the opening book I try not to use the objectively best variations (by statistically analysing all those GM games), but prepare a repertoire in a similar way that a human player at this level would: Find positions that I like, try to get towards them, don't vary your openings. It doesn't make any sense to memorize more than one variation at each branch, because your opponents won't usually be able to prepare like they do at the GM level. So far I haven't seen any books specifically prepared to beat Eden

(I have seen a situation where an engine was specifically enhanced to beat it though, in a way unknown to me. check the story here

I did have a major version change planned last year, which was very promising .

Unfortunately that version is lost forever (although the ideas obviously aren't), and for now I'm just back to my one-step-at-a-time approach, still getting (sometimes surprising to me) good improvements.[/url]

bob · Post by **bob** » Thu Sep 06, 2007 5:55 pm

hgm wrote:For testing very small modifications, that lead to a different move choice only once every 10 moves or so, the tree-comparison method I proposed might solve the problem.

The problem is in the timing. A small change on this move might not make much different, but it loads the hash table, history counters, killer moves, etc with values that will have a cumulative effect as the game progresses. And since there is no point in searching for a specific number of nodes to test changes, since the old and new programs will search the same number of nodes, but not the _same_ nodes, this just reduces the sample size, not the sample size required to properly evaluate the changes.

Fixed-depth is better, if you only change evaluation stuff. But even then it is not a fair comparison, because new evaluation terms slow a program down. But fixed depth does not penalize the slower program and the changes will therefore be artificially inflated since the computational cost is ignored, whereas in a real game the cost has its effect on the depth reached and the move played.

The bottom line is that the only practical solution is a _bunch_ of games. Anything else leads to reduced (significantly reduced) accuracy. I'm working on a paper for the JICGA that really illustrates how this variability is created, and the effect it has on games. It will convince anyone testing that 100 games is absolutely worthless unless the change you make is major, as maybe in adding null-move vs no null move or something drastic like that. Reasonable changes (hopefully improvements) require thousands of games to verify. That seems to be sadly overlooked in most of the posts here.

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing