Testing on time control versus nodes | ply

Rebel · Post by **Rebel** » Thu Dec 05, 2013 12:32 am

I subjected my testing methodology into research and found out that it is not reliable. Basically I am using a 40/15 time control playing 3600 games. When I pitched 2 equal versions against each other I expected that after some time the percentage would be very close to 50.0%. That did not happen, see the report on: http://www.top-5000.nl/tuning2.htm

Looking for feedack if this is normal.

bob · Post by **bob** » Thu Dec 05, 2013 1:28 am

Rebel wrote:I subjected my testing methodology into research and found out that it is not reliable. Basically I am using a 40/15 time control playing 3600 games. When I pitched 2 equal versions against each other I expected that after some time the percentage would be very close to 50.0%. That did not happen, see the report on: http://www.top-5000.nl/tuning2.htm

Looking for feedack if this is normal.

Multiple things.

1. 3600 is a minimal number of games. How close to 50% did you get? This is a good place for statistics rather than logic. If you flipped a coin 3600 times would you really expect 1800 heads and 1800 tails? I'd be utterly amazed if I got that and would suspect some sort of biased test. I'll run a 3600 game test on my laptop, identical versions playing, and see what kind of number I get the first time around...

2. +/- 4 Elo requires 30,000 games. And +/- 4 is a pretty wide error bar

3. I dislike nodes to measure improvement. It hides too much. For example, it equalizes every node whether or not they are actually equal in terms of speed, and importance. It allows an engine to force the game into an area where it is less efficient, without any penalty, since the time difference is not counted.

test is running playing very fast games, ponder=on on a dual-core box.

brianr · Post by **brianr** » Thu Dec 05, 2013 1:40 am

Could you please clarify the "games" v "pairs".
Might you only be playing half the number of games?

Either way the uncertainty is not unreasonable.

How many ply were used for the fixed ply test?

For pure evaluation testing I am experimenting with fixed ply also.
On my quads, I am able to run 8-16 threads reliably with Winboard's tournament manager. There seems to be more overhead in logging the pgns than in CPU search time. 10,000 games only take a few minutes, BTW.
Winboard reads the .trn file for each game to know where it left off, so I reduce the number of threads by 2 for about every 4,000 games.

Laskos · Post by **Laskos** » Thu Dec 05, 2013 2:49 am

Rebel wrote:I subjected my testing methodology into research and found out that it is not reliable. Basically I am using a 40/15 time control playing 3600 games. When I pitched 2 equal versions against each other I expected that after some time the percentage would be very close to 50.0%. That did not happen, see the report on: http://www.top-5000.nl/tuning2.htm

Looking for feedack if this is normal.

3600 independent games match with 50% draw rate has 1.2% 2SD, 8-9 ELO points, so your result at 40/15 is normal. If the number of independent games color-reversed out of total number of games decreases, the statistical error gets smaller, to 0 if all white/black games are identical.

hgm · Post by **hgm** » Thu Dec 05, 2013 9:04 am

SMP engines involve some intrinsic randomness, due to the scheduling of the threads by the OS. So in general they won't repeat the same game when started from the same position. They would not even repeat the same game when you let them play by a total number of nodes. Only 1-CPU engines would do that.

Rather than just looking at the percentage of long reversed-color matches, you could look at the individual gae pairs to see how 'bad' the problem really is. How many of the color-reversed games are truly identical, and how many radically different. They might all be different while you are still close to 50%, because games that are different can still have the same result, and when they have opposite results this is masked in the total because sometimes a win turns into a loss, but other times a loss turns in a win, and the total stays the same.

I 0nce tried a match with the single-CPU engines micro-Max 1.6 and Eden 0.0.13, which are completely deterministic, and always fisish a root iteration. So the only non-determinism comes from the decision to start a new iteration, if the previous one ends to within the jitter near the time it uses to decide on this. About half of 80 games was different, i.e. they started the same, but at some point one of the engines played a different move.

The clearest indicator of the 'problem' is how many moves on average you can play before the identical engines play a different move.

hgm · Post by **hgm** » Thu Dec 05, 2013 11:11 am

Another remark after reading your web page on this:

When faced with a fail-low then often there will be not enough time to find a better move. After 100.000 nodes it's just boom and the engine will play the bad move.

The same applies the engine being in the process of finding a better move when the search is suddenly is terminated due to the 100.000 nodes limit and the better move is not played.

While these kind of problems don't exist using regular time-control or fixed depth testing it's not unreasonable to assume these 2 disadvantages will be equally divided during a match of 4000 | 8000 | 16.000 games.

Note that the disadvantages you mention are not related to use of nodes per se, but are characteristic of 'fixed-maximum per move' depth limiters. They would also occur using fixed maximum time per move (WB 'st' mode, UCI 'movetime').

You would not have these problems if you used the WinBoard 'nps' feature, (later discussed here by Don Dailey under the header 'Nodes = Time'), which would allow the engine to play by nodes, rather than time, but would still allow it to freely allocate the nodes in its budget from one move to the other, just as it does in classical or incremental TC. E.g. you just play 400M nodes/40 moves, and the engine would decide if it allocates 10M nodes to each move, or whether some moves need 50M nodes while others can do with 1M nodes to make up for that.

The real problem with playing by nodes is that you won't be detecting how expensive the changes you test are in terms of slowdown. So you would always have to measure the total time used of each version for the number of nodes you gave it, and afterwards by hand compensate for the difference by some empirical Elo-vs-TimeUse formula. Or play 'nodes-odds' games, (which WinBoard can also do, btw), where you carefully tune the node budget of each engine to equalize their time use.

Rebel · Post by **Rebel** » Thu Dec 05, 2013 10:18 pm

bob wrote:Multiple things.

1. 3600 is a minimal number of games. How close to 50% did you get?

Read the link

This is a good place for statistics rather than logic. If you flipped a coin 3600 times would you really expect 1800 heads and 1800 tails? I'd be utterly amazed if I got that and would suspect some sort of biased test.

I only partly agree, here is why. First of all there are 3600 flip_a_coin moments, not 1800. Secondly an average game contains 60-80 moves of which several will be quite decisive to the outcome of a game especially on these low depths, these (in my view) are flip_a_coin moments as well.

If you look at the last section of the page (Emulate) I think this is demonstrated.

I'll run a 3600 game test on my laptop, identical versions playing, and see what kind of number I get the first time around...

Will be interesting.

3. I dislike nodes to measure improvement. It hides too much. For example, it equalizes every node whether or not they are actually equal in terms of speed, and importance. It allows an engine to force the game into an area where it is less efficient, without any penalty, since the time difference is not counted.

I am aware of the disadvantages, I mention them myself. Nevertheless it kills the time control whimsicality completely, just look at the sudden robustness of equal moves. With simple hardware (like a quad) nodes testing could be more effective than regular time control during the development period.

Rebel · Post by **Rebel** » Thu Dec 05, 2013 10:25 pm

brianr wrote:Could you please clarify the "games" v "pairs".
Might you only be playing half the number of games?

A game and its reverse game I call a pair.

How many ply were used for the fixed ply test?

I am using the ProDeo approach, I believe the base depth was 8 or 9.

Rebel · Post by **Rebel** » Thu Dec 05, 2013 10:34 pm

hgm wrote: E.g. you just play 400M nodes/40 moves, and the engine would decide if it allocates 10M nodes to each move, or whether some moves need 50M nodes while others can do with 1M nodes to make up for that.

That's brilliant!

The real problem with playing by nodes is that you won't be detecting how expensive the changes you test are in terms of slowdown. So you would always have to measure the total time used of each version for the number of nodes you gave it, and afterwards by hand compensate for the difference by some empirical Elo-vs-TimeUse formula. Or play 'nodes-odds' games, (which WinBoard can also do, btw), where you carefully tune the node budget of each engine to equalize their time use.

I keep statistics to cover that.

hgm · Post by **hgm** » Thu Dec 05, 2013 11:24 pm

Well, WinBoard/XBoard has been supporting that for a long time, through the 'nps' command of WB protocol. When the engine receives 'nps 100000' is should convert its node count to seconds by dividing it by 100000 (e.g. in the routine that normally reads the clock), and otherwise everything can stay the same. The GUI will still sent it time and otim commands in centiseconds, but these will be virtual centiseconds, and the engine can directly compare them to this virtual time it computes, as it always does.

WinBoard would use the node counts in the engine's thinking output to decrement the clock, (to make sure the clock measures virtual seconds), doing the same computation on it. So it would be important that the engine concisely reports its node count, and in particular doesn't fail to always print thinking output when it stops thinking. (Also when the thinking was interrupted by a time-out!) But that is really all that is needed.

Testing on time control versus nodes | ply

Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply