Testing on time control versus nodes | ply

Rebel · Post by **Rebel** » Tue Dec 17, 2013 1:49 pm

It seems "Node" testing is beneficial for me.

From: http://www.top-5000.nl/tuning2.htm

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After collecting a number of improvements via the Nodes testing method we pitched the new (1.87) version against the old (1.86) version and the scaling at increasing time controls is excellent.

Code: Select all

 1.87 vs 1.86

 Level                      Games     Score    Run Time
 Nodes=100.000             16.000      54.0%    8 hours
 40 moves in 15 seconds     8.555      53.3%   28 hours
 40 moves in 30 seconds     3.018      53.1%   21 hours
 40 moves in 60 seconds     2.034      53.8%   27 hours

As it seems we have created ourselves a more reliable testing method as with 16.000 games (or 32.000 for that matter) we can now measure 1-2 elo improvements in a much faster setting.

lucasart · Post by **lucasart** » Tue Dec 17, 2013 2:52 pm

Rebel wrote:It seems "Node" testing is beneficial for me.

From: http://www.top-5000.nl/tuning2.htm

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After collecting a number of improvements via the Nodes testing method we pitched the new (1.87) version against the old (1.86) version and the scaling at increasing time controls is excellent.
Code: Select all
 1.87 vs 1.86

 Level                      Games     Score    Run Time
 Nodes=100.000             16.000      54.0%    8 hours
 40 moves in 15 seconds     8.555      53.3%   28 hours
 40 moves in 30 seconds     3.018      53.1%   21 hours
 40 moves in 60 seconds     2.034      53.8%   27 hours
As it seems we have created ourselves a more reliable testing method as with 16.000 games (or 32.000 for that matter) we can now measure 1-2 elo improvements in a much faster setting.

How does that demonstrate that node testing is better for you? Really the question you should ask yourself is this: for a given testing throughput (nb of games per minute) which method produces the highest quality games? This is not such a trivial question to answer, because you need to calibrate time, nodes, and plies in a way that they are comparable as precisely as possible. And then you can compare the draw rate of self play (higher is better as indicates better quality games).

I remember Don Dailey explained that he did such an experiment and found that node testing was significantly worse than depth testing.

Anyway, I think all methods have advantages, and disadvantages, depending on what you are testing. A few examples:
* you can generally test eval changes at fixed depth or fixed nodes
* it's out of question to test a change affecting search extensions or reductions at fixed depth (obviously depth=4 with no LMR beats the crap out of depth=4 with LMR)
* nodes puts more emphasis on the endgame, compared to depth (you reach high depths in the endgame but your NPS should be in the same ballpark).

I generally use time, with a fixed ratio time=120*inc, like 6"+0.05" or 2.4"+0.02". I sometimes use depth and/or nodes testing to do some quick pre-filtering, but I never use depth only. I prefer to combine depth+nodes, example depth=8 and nodes=64000, meaning the search stops when any of the two conditions is met (max depth or max nodes). You never know, even at depth=8 you could fall into a search explosion position and get stuck on it for a while...

rbarreira · Post by **rbarreira** » Tue Dec 17, 2013 3:45 pm

I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.

I think node-based testing can be OK for tuning evaluation parameters, but even then it's not perfect. It can favour entering areas where your engine is slower as Bob said, which would fail in time-based testing as the engine would be punished for it.

Rebel · Post by **Rebel** » Tue Dec 17, 2013 8:18 pm

rbarreira wrote:I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.

By your logic we should stop playing reverse opening lines and instead play randomly from an opening book to maximize randomness.

I don't think so.

hgm · Post by **hgm** » Tue Dec 17, 2013 9:08 pm

Well, indeed playing reversed lines hardly has any advantage whatsoever, if most lines in the book are not terribly unbalanced. If the choice of line only contributes 10% extra winning probability for one side or the other, (for every line in the book!), the contribution of line choice to the variance in the match result (N games) will be only 10%/sqrt(N). The variance due to the game result (with 1/3 W, D, and L) is about 40%/sqrt(N), i.e. 4 times larger. As the two are independent, they add quadratically, so that completely random choice of lines only has sqrt(1 + 1/16) ~ 1.03 times larger standard deviation as with reversed lines (or with a book with perfectly balanced lines, where line choice is completely irrelevant).

bob · Post by **bob** » Tue Dec 17, 2013 9:54 pm

Rebel wrote:
rbarreira wrote:I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.
By your logic we should stop playing reverse opening lines and instead play randomly from an opening book to maximize randomness.

I don't think so.

He has a valid point. The Elo calculation assumes "independent trials". Clearly two games from the same position are not exactly independent. Most of us do the reverse-colors trick due to laziness, we don't want to take the time to cull unbalanced starting positions that a weaker program will win from a stronger program.

I've done some tinkering with this by having two programs play a set of positions with the weaker (about 400 Elo weaker) starting off as white. Any games it wins are suspect and I look at them. I then have it play black and do the same. I want to exclude positions that are going to produce the same result no matter what...

This would certainly give a better test set when you play each position once, but it will take some time to eliminate the bad ones. I have a bunch of positions, and I have run two tests...

(1) N positions played 2x with reversed colors.

(2) 2N positions played just once per position, but still alternating colors so that any program plays white and black equally.

After running those I have two sets of Elo numbers. The number from (2) is usually not the same as what I get from (1). But the DIFFERENCES are extremely close. And it is the difference in ratings that matters, not the absolute numbers.

More on this when I get back to finishing everything up...

michiguel · Post by **michiguel** » Tue Dec 17, 2013 10:22 pm

rbarreira wrote:I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.

I think node-based testing can be OK for tuning evaluation parameters, but even then it's not perfect. It can favour entering areas where your engine is slower as Bob said, which would fail in time-based testing as the engine would be punished for it.

You do not need randomness, you need pseudo-randomness. So, being able to repeat the whole set is not a problem. It is advantage in case things need to be check for artifacts. Not to mentioned that the tests are independent of the machine they are run and could be combined/compared without any problem.

In addition, to answer Bob, a test based on nodes is not designed to "measure" progress, it is designed to "screen" it, which is not the same. As long as a confirmatory test is done later, there is no issue regarding how things were screened.

Miguel

rbarreira · Post by **rbarreira** » Tue Dec 17, 2013 11:54 pm

Rebel wrote:
rbarreira wrote:I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.
By your logic we should stop playing reverse opening lines and instead play randomly from an opening book to maximize randomness.

I don't think so.

That depends on what's your objective with the testing. You might even want to have just one starting position, if that's what you want to optimize for. The games should be as independent as possible, as that's what the statistical models assume.

My point was that you should decide what to fix (like game rules and what engines are playing) and everything else should be as random as possible.

rbarreira · Post by **rbarreira** » Tue Dec 17, 2013 11:59 pm

michiguel wrote:
rbarreira wrote:I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.

I think node-based testing can be OK for tuning evaluation parameters, but even then it's not perfect. It can favour entering areas where your engine is slower as Bob said, which would fail in time-based testing as the engine would be punished for it.
You do not need randomness, you need pseudo-randomness. So, being able to repeat the whole set is not a problem. It is advantage in case things need to be check for artifacts. Not to mentioned that the tests are independent of the machine they are run and could be combined/compared without any problem.

In addition, to answer Bob, a test based on nodes is not designed to "measure" progress, it is designed to "screen" it, which is not the same. As long as a confirmatory test is done later, there is no issue regarding how things were screened.

Miguel

Pseudo-randomness is an attempt to approximate randomness. So sure, it can be good enough in specific cases.

bob · Post by **bob** » Wed Dec 18, 2013 5:39 am

michiguel wrote:
rbarreira wrote:I completely disagree with your premise that "repeatable testing is good testing".

You want as much randomness as possible in any AB test. Then you select a number of games (in advance), run the test and take any conclusions when all the games have run. Anything else is lying to yourself with statistics.

I think node-based testing can be OK for tuning evaluation parameters, but even then it's not perfect. It can favour entering areas where your engine is slower as Bob said, which would fail in time-based testing as the engine would be punished for it.
You do not need randomness, you need pseudo-randomness. So, being able to repeat the whole set is not a problem. It is advantage in case things need to be check for artifacts. Not to mentioned that the tests are independent of the machine they are run and could be combined/compared without any problem.

In addition, to answer Bob, a test based on nodes is not designed to "measure" progress, it is designed to "screen" it, which is not the same. As long as a confirmatory test is done later, there is no issue regarding how things were screened.

Miguel

Nodes do give bad answers, for reasons I have repeated in the past.

If you play against another program, it is likely that both vary in terms of NPS over the course of the game. If you slow down in the endgame, you might try to steer the game toward endgames because you don't see the time penalty and get wrong indications. If your opponent speeds up, you might still go for endgames since he can't speed up in a pure node-limited search.

So you can end up steering the game in ways that SEEM to be better for you, when in reality they are not. I tried node testing quite a few years ago, but ran into all sorts of difficulties. Perhaps self-test might work there since both programs would then slow down or speed up identically (same program). But I am not a big fan of self-testing either, for other reasons.

Just my $.02...

BTW would you REALLY want to screen with a "bad filter"? You screen out good changes that look bad. If you let bad changes through, a normal test later would catch those, but not what you had already incorrectly screened out and discarded.

Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply

Re: Testing on time control versus nodes | ply