Testing on time control versus nodes | ply

Rebel · Post by **Rebel** » Wed Dec 18, 2013 8:09 am

Well, since I am the OP I have some responsibility to defend my case, so I will. On the other hand I might expect my partners in the discussion to have read the page and what it is trying to say.

I am not saying I am right and everybody is wrong, the subject is too complicated for that. Given that the weapon to fight randomness is volume and provided a program change is any good it will produce a positive result in the end because of the volume of games.

And here lies part of the problem and I agree with Bob that for a reasonable accurate test environment one needs 30K of games and not many have the luxury having a cluster in the living room, my wife would instantly divorce me and looking at my former PC addiction I wouldn't blame her. So I am stuck with a simple Quad for the job and 30K games on regular time control isn't an option.

So hear my filosophy: I want a system were the only ramdomness is the 16.000 opening lines that will produce the 32K games. And if I am (for instance) testing a bishop pair change that every start positon that does not contain a bishop pair will end in a 1-1 result and not is influenced by the randomness introduced by "clock()" or "GetTickCount()". That's simply unnecessary noise that increases the number of needed games one needs to play to get a reliable result.

So in self-play (using Nodes) if the engine decides to play a different move due to the program change you have made it has to be better in >50% of all cases. And if that's the case you will end up with a possitive score.

The only drawback I see for Nodes testing is the sudden_death effect of the search, after (say) [Nodes=100.000] the search is just terminated. This can be solved by HGM's suggestion introducing a (say) "40/40M" level (40 million nodes in 40 moves) which by a simple conversion from Nodes to Milliseconds can run in your existing "Time Control" code.

Again, 16.000 games in just 8 hours as a first indication is a huge win over the 52 hours I need to play them on 40/15 level, less noise also.

hgm · Post by **hgm** » Wed Dec 18, 2013 12:01 pm

It would be even faster not to play games that would give you a guaranteed 1-1 at all. And if you don't play them, the need for guaranteeing the 1-1 in such cases would also disappear.

Sven · Post by **Sven** » Wed Dec 18, 2013 2:57 pm

I'd suggest to use your N cores to play N single-core games in parallel, instead of 1 N-core game. That should reduce the randomness of your tests significantly. It also reduces the quality of the games, of course, as if you play with a different time control. So 3600 4-core games 40/15 might be replaced by 14400 1-core games 40/5 (if your parallel search speedup with 4 cores were around 3.33). If that would find your acceptance then give it a try. I would always prefer to play single-core over multi-core games when targeting for reproducible results.

Regarding the question about the reversed games, the key problem is that you play games "A vs. A" (i.e. two identical engine versions) and want to see a 50.0% result. As others have stated this can only be achieved by either
a) carefully removing all unbalanced positions (which causes a lot of effort but still some uncertainty whether you are really done), or
b) repeating the games with reversed colors as you did (which leads to dependent games, with the usual statistical problems arising from that) while removing all other randomness factors (like parallel search).

Therefore I suggest that you skip the "50.0%" as a goal when playing "A vs. A", and then drop the reversed-colors repetitions. Essentially this is also what HGM has proposed.

When playing "A1 vs. A2", which I consider to be the more realistic use case, you would not expect 50.0% in general anyway. But in this case you could also keep the reversed-color repetitions since they would not lead to dependent games in my opinion (A1 with White would differ from A2 with White).

Sven

hgm · Post by **hgm** » Wed Dec 18, 2013 3:53 pm

I think you say it reversed from what you mean: one 40/5 4-CPU game is equivalent to a 40/15 1-CPU game, but because you can play 4 at the same time in the latter case, you will finish the same number of games 25% faster.

Sven · Post by **Sven** » Wed Dec 18, 2013 6:42 pm

hgm wrote:I think you say it reversed from what you mean: one 40/5 4-CPU game is equivalent to a 40/15 1-CPU game, but because you can play 4 at the same time in the latter case, you will finish the same number of games 25% faster.

Yes, "40/5" was wrong, I meant "40/50" ... So 3600 4-core games 40/15 are roughly equivalent to 1080 1-core games 40/50, and by using all 4 cores in parallel you can finish 3600 games in (3600 / (4*1080)) = 5/6 of the time = 16.67% faster. Or you play 20% more games in the same time (4320 instead of 3600).

Thanks for correcting this.

Sven

bob · Post by **bob** » Wed Dec 18, 2013 7:19 pm

Sven Schüle wrote:I'd suggest to use your N cores to play N single-core games in parallel, instead of 1 N-core game. That should reduce the randomness of your tests significantly. It also reduces the quality of the games, of course, as if you play with a different time control. So 3600 4-core games 40/15 might be replaced by 14400 1-core games 40/5 (if your parallel search speedup with 4 cores were around 3.33). If that would find your acceptance then give it a try. I would always prefer to play single-core over multi-core games when targeting for reproducible results.

Regarding the question about the reversed games, the key problem is that you play games "A vs. A" (i.e. two identical engine versions) and want to see a 50.0% result. As others have stated this can only be achieved by either
a) carefully removing all unbalanced positions (which causes a lot of effort but still some uncertainty whether you are really done), or
b) repeating the games with reversed colors as you did (which leads to dependent games, with the usual statistical problems arising from that) while removing all other randomness factors (like parallel search).

Therefore I suggest that you skip the "50.0%" as a goal when playing "A vs. A", and then drop the reversed-colors repetitions. Essentially this is also what HGM has proposed.

When playing "A1 vs. A2", which I consider to be the more realistic use case, you would not expect 50.0% in general anyway. But in this case you could also keep the reversed-color repetitions since they would not lead to dependent games in my opinion (A1 with White would differ from A2 with White).

Sven

The two games still have the same underlying characteristics. IE king-side attack, or a pawn majority endgame, or whatever. One can even argue, to a lessor degree, that using the same opponent multiple times is not exactly an independent set of trials. How serious this actually is, I don't know, but a simple thought experiment would tell you that playing 30K games against the same opponent, same starting position, would NOT give you a +/-4 Elo error bar... Bayeselo would claim that, because it would not understand that the games are anything but independent.

hgm · Post by **hgm** » Wed Dec 18, 2013 7:37 pm

Well, it is easy to check by looking at the result against individual engines. If you have tested two versions of Crafty (C1 and C2) against many different engine, each test set against one engine (say C1-A and C2-A) gives you a measure of the Elo difference of C1 and C2. There will be a statistical error on that, which you can make as small as you like by playing enough games from independent positions.

The question now if the derived difference C2-C1 measured with opponent A is always the same as that measured from playing against B (within the statistical error bars). If it is (and also for opponent C, D etc.) then apparently any single opponent would be as good as any other one, or any combination. (Although with a single opponent you might have a bit more difficulty getting enough independnent start positions.)

But most likely the results won't be the same, and some opponents will find a significantly different Elo difference between C1 and C2. You can then take the set of all these rating difference from single opponents, and calculate their standard deviation. This tells you have far your result from a single opponent will on average deviate from the (sought) result against an 'average opponent'.

Testing against M opponents will then lead to an error due to opponent choice that is sqrt(M) as small (if all opponents are independent, i.e. no close derivatives amongst them). This error can only be reduced by increasing the number of opponents. Not by increasing the number of games.

bob · Post by **bob** » Wed Dec 18, 2013 9:14 pm

hgm wrote:Well, it is easy to check by looking at the result against individual engines. If you have tested two versions of Crafty (C1 and C2) against many different engine, each test set against one engine (say C1-A and C2-A) gives you a measure of the Elo difference of C1 and C2. There will be a statistical error on that, which you can make as small as you like by playing enough games from independent positions.

The question now if the derived difference C2-C1 measured with opponent A is always the same as that measured from playing against B (within the statistical error bars). If it is (and also for opponent C, D etc.) then apparently any single opponent would be as good as any other one, or any combination. (Although with a single opponent you might have a bit more difficulty getting enough independnent start positions.)

But most likely the results won't be the same, and some opponents will find a significantly different Elo difference between C1 and C2. You can then take the set of all these rating difference from single opponents, and calculate their standard deviation. This tells you have far your result from a single opponent will on average deviate from the (sought) result against an 'average opponent'.

Testing against M opponents will then lead to an error due to opponent choice that is sqrt(M) as small (if all opponents are independent, i.e. no close derivatives amongst them). This error can only be reduced by increasing the number of opponents. Not by increasing the number of games.

My logical ranking, from worst to best would look like this:

1. one opponent, one starting position, 30K (or whatever) games.

2. one opponent, 15K starting positions, 30K games alternating colors and playing each position twice.

3. one opponent 30K starting positions, 30K games but alternating colors on every other position to play 15K white, 15K black.

4. N opponents, 30K/2N positions, 30K games every position played twice with alternating colors.

5. N opponents, 30K/N positions, 30K games, each position played once, alternating colors on every other position.

6. 30K opponents, 1 position, 30K games, alternating colors every other game

7. 30K opponents, 30K positions, 30K games, again alternating color every other position.

The last two are clearly impossible since there are not that many opponents available. I've been toying with 5 however. BayesElo would not see any difference when the number of positions are varied, but it would reflect the number of opponents.

hgm · Post by **hgm** » Thu Dec 19, 2013 10:53 am

That sounds reasonable. But my point was that this could be quantified. You can extract from the data how large your standard error is due to the limited number of opponents. That will also tell you at which point it becomes pointless to take more games; once the standard error starts to be dominated by opponent selection, adding more games only helps to train the engine to exploit the weaknesses of those particular opponents, but won't do anything for improving the result against the hypothetical 'average opponent'.

Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply