fast testing NIT algorithm

Don · Post by **Don** » Thu Aug 22, 2013 8:04 pm

From time to time we will do fixed depth testing - which has a number of advantages as well as a number of disadvantages. The disadvantages make it not very practical for most of our testing but it's still sometimes useful.

The primary advantages is that if you pick a low depth you can execute many games per second and it often returns a relevant result that is compatible with actual time control games. But not always. We sometimes use it as a very quick filter to determine quickly whether we should continue. When used like this we don't require a new version to "win" but we want to see if it loses badly at a fixed depth time control or whether it is unusually slow or fast - which our testing infrastructure instruments for us (we get average time per move information.)

One disadvantage is that on a per game basis it's not fair to one side - for example in a position where one side is more cramped than the other a given depth is not completed in the same amount of time for the white player compared to the black player. And in the endgame, a fixed depth test does not react to the much reduced branching factor.

Our custom tester is written in java and is very efficient, so we can test at levels as fast as 2 seconds + some small fraction of a second Fischer time control - but as you go faster you get internal overheads dominating the performance. For example if a program takes a few milliseconds to start searching due to table initialization or other overheads it is seriously handicapped at these ridiculously fast time controls. Houdini 3 cannot even run on our tester at time controls that are this fast and will lose games on time.

We also discovered that when running multiple matches on the same machine, it matters which binaries are running. If you are running a round robin between 3 programs and the tester is configured to keep all 4 cores busy and 2 of the programs use the same binary (with different configurations) these 2 programs have a huge advantage. So to make this test fair we are forced to make copies of binaries with different names only because we want to set the parameters differently.

The NIT method

So we are experimenting with a new variation of fixed node testing I call NIT - which stands for "Nodes Is Time" and the way it works is that Komodo now supports an option where you can direct it to ignore it's internal clock and instead consider 1 million nodes as 1 second in all it's reporting and time accounting.

The tester also has to support this mode. The mode is used for Fischer or sudden death time controls and as such the tester uses the programs reported node counts as the "time clock", so that if the program makes a move and consumes 1.5 million nodes the tester views this as 1.5 seconds spent on the search and charges this against the program.

In fixed node testing we get the same exact score after an even number of games if you test identical programs, and now with this system if you play 1000 games the score will be 50/50 - same number of wins and draws for each side. It doesn't matter what is running in the background or even what computer is it running on!

So this has all the advantages of fixed depth testing, but without many of the disadvantages. It has the advantage that the time control algorithm is fully active so the games are played at a much more natural pace and we can trust the results a lot more than we would trust fixed depth time controls.

It's not perfect however - it still cannot be 100% related to time control games. For one thing if a change impacts the nodes per second it is not guaranteed to come out the same - so we have to monitor that just as we would with fixed depth games. Also, the nodes per second vary in different phases of the game so this is not perfectly compatible with normal time control games.

So this is highly superior to fixed depth games and a compromise between time games and fixed depth games. We will probably use this in cases where we might normally do fixed depth games and it will enable us to run matches such as sudden depth game in 1/10 of a second (where 1/10 second is really 100000 nodes) so it means complete the game before consuming 100,000 nodes.

The primary usage case for this is to test similar binaries against each other where some small change is being tested - a change that has very little impact on nodes per second. Even if there is some impact we do monitor the difference and would be alerted to this. So one intriguing possibility is that this might enables is to run on different operating system and different hardware with impunity - kind of a holy grail for us since we have a number of machines of different sorts and sizes that we test on. To make this work, a single REFERENCE machine would be used to establish a handicap to be used in all the other tests. Let's say that we add "space" to our evaluation function and find that it slows down the program by 4% in nodes per second on our reference hardware - the one we are optimizing for. We can measure that very quickly using fixed depth games and have a very precise number in a few minutes or it could be done with a large number of game representative positions which could be calculated in seconds or minutes.

Michel · Post by **Michel** » Thu Aug 22, 2013 8:18 pm

If you are running a round robin between 3 programs and the tester is configured to keep all 4 cores busy and 2 of the programs use the same binary (with different configurations) these 2 programs have a huge advantage. So to make this test fair we are forced to make copies of binaries with different names only because we want to set the parameters differently.

I discovered exactly the same thing.

Don · Post by **Don** » Thu Aug 22, 2013 8:48 pm

Michel wrote:
If you are running a round robin between 3 programs and the tester is configured to keep all 4 cores busy and 2 of the programs use the same binary (with different configurations) these 2 programs have a huge advantage. So to make this test fair we are forced to make copies of binaries with different names only because we want to set the parameters differently.
I discovered exactly the same thing.

I don't know if it's that way with Windows, probably it is. But I know that Linux is very fussy and very anal about using resources wisely and why it performs so well. I'm sure the OS here is just being smart and keeping only 1 copy in memory of a running process. I assume Windows would do the same in this case.

It's actually fairly difficult to get a perfectly fair test working. Sometimes you get different testers reporting completely different results and I don't quite trust any result that I didn't run myself for several reasons:

1. Most testers are at least somewhat biased. They report the results that please them the most and view the other results as anomalies that go unreported.

2. If the testing doesn't come out their way they tinker with the conditions and introduce bias that way even if unwittingly.

3. Sometimes the results are due to an unfair test even without the tester purposely doing it this way. That has happened with us too (such is in that round robin case you noticed too.)

4. Occasionally a tester will take several versions of one very strong program and test it against 1 or more unique programs - not realizing how ridiculously biased that is. For example if I want to see if Stockfish has passed Houdini but I'm not sure which recent dev version is best, I could run a guantlet of Houdini against each of the past 5 version. ONE of these Stockfish versions is likely to win because it's equivalent to run 5 separate matches against Houdini and picking the result you like. Even if Houdini wins all of them you can report the BEST result and declare that to be the "true" picture. You might not even mention that the other 4 tests were run - i.e. "I ran version XYZ against Houdini and it was within 5 ELO!!!"

For our internal testing it really doesn't matter because we use identical conditions so that even if there is bias in favor of a given program we are measuring progress, not absolutes.

hgm · Post by **hgm** » Thu Aug 22, 2013 9:01 pm

Don wrote:So we are experimenting with a new variation of fixed node testing I call NIT - which stands for "Nodes Is Time" and the way it works is that Komodo now supports an option where you can direct it to ignore it's internal clock and instead consider 1 million nodes as 1 second in all it's reporting and time accounting.

The tester also has to support this mode. The mode is used for Fischer or sudden death time controls and as such the tester uses the programs reported node counts as the "time clock", so that if the program makes a move and consumes 1.5 million nodes the tester views this as 1.5 seconds spent on the search and charges this against the program.

Wow, you re-invented XBoard's nps mode!

XBoard protocol wrote:nps NODE_RATE

The engine should not use wall-clock time to make its timing decisions, but an own internal time measure based on the number of nodes it has searched (and will report as "thinking output", see section 10), converted to seconds through dividing by the given NODE_RATE. Example: after receiving the commands "st 8" and "nps 10000", the engine should never use more that 80,000 nodes in the search for any move. In this mode, the engine should report user CPU time used (in its thinking output), rather than wall-clock time. This even holds if NODE_RATE is given as 0, but in that case it should also use the user CPU time for its timing decisions. The effect of an "nps" command should persist until the next "new" command.

Modern Times · Post by **Modern Times** » Thu Aug 22, 2013 9:01 pm

Don wrote: I don't know if it's that way with Windows, probably it is. But I know that Linux is very fussy and very anal about using resources wisely and why it performs so well. I'm sure the OS here is just being smart and keeping only 1 copy in memory of a running process. I assume Windows would do the same in this case.

No, I think you'll find Windows keeps 2 copies.

Don · Post by **Don** » Thu Aug 22, 2013 10:13 pm

hgm wrote:
Don wrote:So we are experimenting with a new variation of fixed node testing I call NIT - which stands for "Nodes Is Time" and the way it works is that Komodo now supports an option where you can direct it to ignore it's internal clock and instead consider 1 million nodes as 1 second in all it's reporting and time accounting.

The tester also has to support this mode. The mode is used for Fischer or sudden death time controls and as such the tester uses the programs reported node counts as the "time clock", so that if the program makes a move and consumes 1.5 million nodes the tester views this as 1.5 seconds spent on the search and charges this against the program.
Wow, you re-invented XBoard's nps mode!

This is not a new thing so I don't take credit for inventing it - it has been done before. The idea came from a discussion I had over 20 years at one of the ICCA tournaments.

But in either case UCI does not support "nps NODE_RATE" and I cannot easily use xboard for testing anyway. I do like Xboard as a user interface, it's my GUI of choice for this because it's responsive and snappy but it's horrible as an auto-player and very few people use it for that.

We have our own custom tester so that we are not dependent on what someone else might happen to build for us.

bob · Post by **bob** » Thu Aug 22, 2013 10:49 pm

Most of us have tried this. But it still fails for the reasons given in the past, the most serious is that nodes are absolutely != time, since programs vary in NPS over the course of the game.

Take A vs B. A is a constant NPS program from opening to endgame. B gets faster in endgames, at least 2x faster, often more.

If you are testing A, any change that ends a game quickly looks better, because you avoid B's strength of faster speed. Any change that causes you to trade pieces looks bad, because you play into B's advantage. Even if the change was actually good.

This is an effect I have seen multiple times. It makes a good stress test, but it can often lead to invalid comparisons, unless you only test against programs that remain similar in NPS over the course of the game...

BTW why do you say "huge advantage"? All this does is reduce memory footprint. One might see a small advantage since executing A1 also pre-loads A2 into the same cache (assuming A1 and A2 are same program different processes).

bob · Post by **bob** » Thu Aug 22, 2013 10:52 pm

Modern Times wrote:
Don wrote: I don't know if it's that way with Windows, probably it is. But I know that Linux is very fussy and very anal about using resources wisely and why it performs so well. I'm sure the OS here is just being smart and keeping only 1 copy in memory of a running process. I assume Windows would do the same in this case.
No, I think you'll find Windows keeps 2 copies.

I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...

jundery · Post by **jundery** » Fri Aug 23, 2013 4:12 am

bob wrote: I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...

The simple answer is yes pages are shared (when possible), this also helps with cache as multiple processes can share space for instructions reducing pressure.

You'll find a more detailed explanation in in "Windows Internals" by Mark Russinovich, et al.

lucasart · Post by **lucasart** » Fri Aug 23, 2013 6:20 am

Don wrote:From time to time we will do fixed depth testing - which has a number of advantages as well as a number of disadvantages. The disadvantages make it not very practical for most of our testing but it's still sometimes useful.

The primary advantages is that if you pick a low depth you can execute many games per second and it often returns a relevant result that is compatible with actual time control games. But not always. We sometimes use it as a very quick filter to determine quickly whether we should continue. When used like this we don't require a new version to "win" but we want to see if it loses badly at a fixed depth time control or whether it is unusually slow or fast - which our testing infrastructure instruments for us (we get average time per move information.)

One disadvantage is that on a per game basis it's not fair to one side - for example in a position where one side is more cramped than the other a given depth is not completed in the same amount of time for the white player compared to the black player. And in the endgame, a fixed depth test does not react to the much reduced branching factor.

I have never used fixed depth testing, and I can't think of any good use case for it. It has many problems, such as:
* branching factor
* even/odd depth asymmetry

For a quick test, to reject very bad patches, the only thing I've ever used is fixed nodes testing. That being said it's still not reliable as a commit criteria (regardless of the statistical side of things, even a patch that passes all statistical ctests at 40000 nodes, may fail them all at 10"+0.1", etc.)

Our custom tester is written in java and is very efficient, so we can test at levels as fast as 2 seconds + some small fraction of a second Fischer time control - but as you go faster you get internal overheads dominating the performance. For example if a program takes a few milliseconds to start searching due to table initialization or other overheads it is seriously handicapped at these ridiculously fast time controls. Houdini 3 cannot even run on our tester at time controls that are this fast and will lose games on time.

At extrme time controls, you start to have pipe I/O, task switching, and internal overhead that become significant. The only reliable way I can think of is to tell your tester to never flag the engine, and code the clock management inside the engine (not counting your initialisation phase). Only possible for self-testing, obviously.

We also discovered that when running multiple matches on the same machine, it matters which binaries are running. If you are running a round robin between 3 programs and the tester is configured to keep all 4 cores busy and 2 of the programs use the same binary (with different configurations) these 2 programs have a huge advantage. So to make this test fair we are forced to make copies of binaries with different names only because we want to set the parameters differently.

Never "keep all the cores busy". This is a huge mistake. I have 8 cores, and I always run 7 concurrent games, never 8. When I compare the results of 7 concurrent games to 8 concurrent games, for example the draw rate is much lower with 8 concurrent games, indicating a lot of interference in the results (lower quality games with more "random blunders" due to interference). Never use more than N-1 cores, even if the machine is "unused" (even when you do nothing, there is so much going on behind the scene, daemons, task switching, etc.)

The NIT method

Problem with "node is time" is that some patches represent a compromise between NPS and accuracy. Typically when you add or change evaluation terms, and you slow down the program as a result. In fixed nodes (or NIT) testing, you do not account for the speed penalty, and only look at the accuracy gain.

You may argue that speed gains have dimishing returns, but maybe your patch also has dimishing returns (even with time measured in nodes). So the asymptotic behaviour is hazardous to predict.

That's why I prefer to stay with time.

I do some pre-selective testing in 3 steps, with aggressive early stopping. Here's the recipie I follow (unless when I cut corners, which I often regret afterwards, when I have to bisect for the regression):

1/ 4000 nodes/move
* SPRT elo0=-5 elo1=0
* H0 accepted -> reject patch

2/ 5"+0.05"
* SPRT elo0=-5 elo1=0
* H0 accepted -> reject patch

3/ 10"+0.05"
* simplification patch: SPRT elo0=-5 elo1=0
* patch that adds code/complexity: SPRT elo0=0 elo1=5
* neutral patch (eg. changing a number in a table): SPRT elo0=-2.5 elo1=2.5
* In all cases, commit if, and only if, H1 is accepted.

This is realy easy to automate with cutechess-cli and a little shell script around it.

This testing procedure handles the various aspects of testing:
* statistics
* scaling
* efficient use of resources
* risk/reward biais depending on patch type: In the long run, simplification patches are good, and complication patches are toxic -> less is more.

PS: alpha=beta=5% in all SPRT.

PPS: I always output the games PGN to /dev/null. Saves space, and a bit of overhead. But more importantly, watching games is a huge distraction, and causes people to make arbitrary and biaised decisions (to early stop and acept, in a statistically unsound way, because they "like what they see" etc.)

fast testing NIT algorithm

fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm