I subjected my testing methodology into research and found out that it is not reliable. Basically I am using a 40/15 time control playing 3600 games. When I pitched 2 equal versions against each other I expected that after some time the percentage would be very close to 50.0%. That did not happen, see the report on: http://www.top-5000.nl/tuning2.htm
Looking for feedack if this is normal.
Testing on time control versus nodes | ply
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing on time control versus nodes | ply
Multiple things.Rebel wrote:I subjected my testing methodology into research and found out that it is not reliable. Basically I am using a 40/15 time control playing 3600 games. When I pitched 2 equal versions against each other I expected that after some time the percentage would be very close to 50.0%. That did not happen, see the report on: http://www.top-5000.nl/tuning2.htm
Looking for feedack if this is normal.
1. 3600 is a minimal number of games. How close to 50% did you get? This is a good place for statistics rather than logic. If you flipped a coin 3600 times would you really expect 1800 heads and 1800 tails? I'd be utterly amazed if I got that and would suspect some sort of biased test. I'll run a 3600 game test on my laptop, identical versions playing, and see what kind of number I get the first time around...
2. +/- 4 Elo requires 30,000 games. And +/- 4 is a pretty wide error bar
3. I dislike nodes to measure improvement. It hides too much. For example, it equalizes every node whether or not they are actually equal in terms of speed, and importance. It allows an engine to force the game into an area where it is less efficient, without any penalty, since the time difference is not counted.
test is running playing very fast games, ponder=on on a dual-core box.
-
- Posts: 536
- Joined: Thu Mar 09, 2006 3:01 pm
Re: Testing on time control versus nodes | ply
Could you please clarify the "games" v "pairs".
Might you only be playing half the number of games?
Either way the uncertainty is not unreasonable.
How many ply were used for the fixed ply test?
For pure evaluation testing I am experimenting with fixed ply also.
On my quads, I am able to run 8-16 threads reliably with Winboard's tournament manager. There seems to be more overhead in logging the pgns than in CPU search time. 10,000 games only take a few minutes, BTW.
Winboard reads the .trn file for each game to know where it left off, so I reduce the number of threads by 2 for about every 4,000 games.
Might you only be playing half the number of games?
Either way the uncertainty is not unreasonable.
How many ply were used for the fixed ply test?
For pure evaluation testing I am experimenting with fixed ply also.
On my quads, I am able to run 8-16 threads reliably with Winboard's tournament manager. There seems to be more overhead in logging the pgns than in CPU search time. 10,000 games only take a few minutes, BTW.
Winboard reads the .trn file for each game to know where it left off, so I reduce the number of threads by 2 for about every 4,000 games.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Testing on time control versus nodes | ply
3600 independent games match with 50% draw rate has 1.2% 2SD, 8-9 ELO points, so your result at 40/15 is normal. If the number of independent games color-reversed out of total number of games decreases, the statistical error gets smaller, to 0 if all white/black games are identical.Rebel wrote:I subjected my testing methodology into research and found out that it is not reliable. Basically I am using a 40/15 time control playing 3600 games. When I pitched 2 equal versions against each other I expected that after some time the percentage would be very close to 50.0%. That did not happen, see the report on: http://www.top-5000.nl/tuning2.htm
Looking for feedack if this is normal.
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing on time control versus nodes | ply
SMP engines involve some intrinsic randomness, due to the scheduling of the threads by the OS. So in general they won't repeat the same game when started from the same position. They would not even repeat the same game when you let them play by a total number of nodes. Only 1-CPU engines would do that.
Rather than just looking at the percentage of long reversed-color matches, you could look at the individual gae pairs to see how 'bad' the problem really is. How many of the color-reversed games are truly identical, and how many radically different. They might all be different while you are still close to 50%, because games that are different can still have the same result, and when they have opposite results this is masked in the total because sometimes a win turns into a loss, but other times a loss turns in a win, and the total stays the same.
I 0nce tried a match with the single-CPU engines micro-Max 1.6 and Eden 0.0.13, which are completely deterministic, and always fisish a root iteration. So the only non-determinism comes from the decision to start a new iteration, if the previous one ends to within the jitter near the time it uses to decide on this. About half of 80 games was different, i.e. they started the same, but at some point one of the engines played a different move.
The clearest indicator of the 'problem' is how many moves on average you can play before the identical engines play a different move.
Rather than just looking at the percentage of long reversed-color matches, you could look at the individual gae pairs to see how 'bad' the problem really is. How many of the color-reversed games are truly identical, and how many radically different. They might all be different while you are still close to 50%, because games that are different can still have the same result, and when they have opposite results this is masked in the total because sometimes a win turns into a loss, but other times a loss turns in a win, and the total stays the same.
I 0nce tried a match with the single-CPU engines micro-Max 1.6 and Eden 0.0.13, which are completely deterministic, and always fisish a root iteration. So the only non-determinism comes from the decision to start a new iteration, if the previous one ends to within the jitter near the time it uses to decide on this. About half of 80 games was different, i.e. they started the same, but at some point one of the engines played a different move.
The clearest indicator of the 'problem' is how many moves on average you can play before the identical engines play a different move.
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing on time control versus nodes | ply
Another remark after reading your web page on this:
You would not have these problems if you used the WinBoard 'nps' feature, (later discussed here by Don Dailey under the header 'Nodes = Time'), which would allow the engine to play by nodes, rather than time, but would still allow it to freely allocate the nodes in its budget from one move to the other, just as it does in classical or incremental TC. E.g. you just play 400M nodes/40 moves, and the engine would decide if it allocates 10M nodes to each move, or whether some moves need 50M nodes while others can do with 1M nodes to make up for that.
The real problem with playing by nodes is that you won't be detecting how expensive the changes you test are in terms of slowdown. So you would always have to measure the total time used of each version for the number of nodes you gave it, and afterwards by hand compensate for the difference by some empirical Elo-vs-TimeUse formula. Or play 'nodes-odds' games, (which WinBoard can also do, btw), where you carefully tune the node budget of each engine to equalize their time use.
Note that the disadvantages you mention are not related to use of nodes per se, but are characteristic of 'fixed-maximum per move' depth limiters. They would also occur using fixed maximum time per move (WB 'st' mode, UCI 'movetime').When faced with a fail-low then often there will be not enough time to find a better move. After 100.000 nodes it's just boom and the engine will play the bad move.
The same applies the engine being in the process of finding a better move when the search is suddenly is terminated due to the 100.000 nodes limit and the better move is not played.
While these kind of problems don't exist using regular time-control or fixed depth testing it's not unreasonable to assume these 2 disadvantages will be equally divided during a match of 4000 | 8000 | 16.000 games.
You would not have these problems if you used the WinBoard 'nps' feature, (later discussed here by Don Dailey under the header 'Nodes = Time'), which would allow the engine to play by nodes, rather than time, but would still allow it to freely allocate the nodes in its budget from one move to the other, just as it does in classical or incremental TC. E.g. you just play 400M nodes/40 moves, and the engine would decide if it allocates 10M nodes to each move, or whether some moves need 50M nodes while others can do with 1M nodes to make up for that.
The real problem with playing by nodes is that you won't be detecting how expensive the changes you test are in terms of slowdown. So you would always have to measure the total time used of each version for the number of nodes you gave it, and afterwards by hand compensate for the difference by some empirical Elo-vs-TimeUse formula. Or play 'nodes-odds' games, (which WinBoard can also do, btw), where you carefully tune the node budget of each engine to equalize their time use.
-
- Posts: 6993
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Testing on time control versus nodes | ply
Read the linkbob wrote:Multiple things.
1. 3600 is a minimal number of games. How close to 50% did you get?
I only partly agree, here is why. First of all there are 3600 flip_a_coin moments, not 1800. Secondly an average game contains 60-80 moves of which several will be quite decisive to the outcome of a game especially on these low depths, these (in my view) are flip_a_coin moments as well.This is a good place for statistics rather than logic. If you flipped a coin 3600 times would you really expect 1800 heads and 1800 tails? I'd be utterly amazed if I got that and would suspect some sort of biased test.
If you look at the last section of the page (Emulate) I think this is demonstrated.
Will be interesting.I'll run a 3600 game test on my laptop, identical versions playing, and see what kind of number I get the first time around...
I am aware of the disadvantages, I mention them myself. Nevertheless it kills the time control whimsicality completely, just look at the sudden robustness of equal moves. With simple hardware (like a quad) nodes testing could be more effective than regular time control during the development period.3. I dislike nodes to measure improvement. It hides too much. For example, it equalizes every node whether or not they are actually equal in terms of speed, and importance. It allows an engine to force the game into an area where it is less efficient, without any penalty, since the time difference is not counted.
-
- Posts: 6993
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Testing on time control versus nodes | ply
A game and its reverse game I call a pair.brianr wrote:Could you please clarify the "games" v "pairs".
Might you only be playing half the number of games?
I am using the ProDeo approach, I believe the base depth was 8 or 9.How many ply were used for the fixed ply test?
-
- Posts: 6993
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Testing on time control versus nodes | ply
That's brilliant!hgm wrote: E.g. you just play 400M nodes/40 moves, and the engine would decide if it allocates 10M nodes to each move, or whether some moves need 50M nodes while others can do with 1M nodes to make up for that.
I keep statistics to cover that.The real problem with playing by nodes is that you won't be detecting how expensive the changes you test are in terms of slowdown. So you would always have to measure the total time used of each version for the number of nodes you gave it, and afterwards by hand compensate for the difference by some empirical Elo-vs-TimeUse formula. Or play 'nodes-odds' games, (which WinBoard can also do, btw), where you carefully tune the node budget of each engine to equalize their time use.
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Testing on time control versus nodes | ply
Well, WinBoard/XBoard has been supporting that for a long time, through the 'nps' command of WB protocol. When the engine receives 'nps 100000' is should convert its node count to seconds by dividing it by 100000 (e.g. in the routine that normally reads the clock), and otherwise everything can stay the same. The GUI will still sent it time and otim commands in centiseconds, but these will be virtual centiseconds, and the engine can directly compare them to this virtual time it computes, as it always does.
WinBoard would use the node counts in the engine's thinking output to decrement the clock, (to make sure the clock measures virtual seconds), doing the same computation on it. So it would be important that the engine concisely reports its node count, and in particular doesn't fail to always print thinking output when it stops thinking. (Also when the thinking was interrupted by a time-out!) But that is really all that is needed.
WinBoard would use the node counts in the engine's thinking output to decrement the clock, (to make sure the clock measures virtual seconds), doing the same computation on it. So it would be important that the engine concisely reports its node count, and in particular doesn't fail to always print thinking output when it stops thinking. (Also when the thinking was interrupted by a time-out!) But that is really all that is needed.