fast testing NIT algorithm

bob · Post by **bob** » Fri Aug 23, 2013 6:24 am

jundery wrote:
bob wrote: I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...
The simple answer is yes pages are shared (when possible), this also helps with cache as multiple processes can share space for instructions reducing pressure.

You'll find a more detailed explanation in in "Windows Internals" by Mark Russinovich, et al.

Exactky what I'd expect. But I don't quite follow the "huge advantage" this somehow provides. It is a pretty small advantage in reality. The main benefit is avoiding having duplicate copies of data that can't possibly be different, occupying pages of memory that could be used for something else.

Modern Times · Post by **Modern Times** » Fri Aug 23, 2013 7:27 am

bob wrote:
jundery wrote:
bob wrote: I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...
The simple answer is yes pages are shared (when possible), this also helps with cache as multiple processes can share space for instructions reducing pressure.

You'll find a more detailed explanation in in "Windows Internals" by Mark Russinovich, et al.
Exactky what I'd expect. But I don't quite follow the "huge advantage" this somehow provides. It is a pretty small advantage in reality. The main benefit is avoiding having duplicate copies of data that can't possibly be different, occupying pages of memory that could be used for something else.

Thanks for the clarification. For the IPON list, Ingo runs three instances of the same Shredder GUI on the same machine, so a single exe for the engine being tested is invoked three times and running in memory at the same time. My simplistic assumption was that three instances in task manager meant three totally separate unique processes with no overlap. That seems to be wrong from what you have said, but I can't comment on how it much affects engine performance. I think this practice is widespread among testers. I have a similar thing on my chess960 testing, but in that case it is separate but identically named exes that are invoked by different GUIs.

jundery · Post by **jundery** » Fri Aug 23, 2013 8:07 am

bob wrote:
jundery wrote:
bob wrote: I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...
The simple answer is yes pages are shared (when possible), this also helps with cache as multiple processes can share space for instructions reducing pressure.

You'll find a more detailed explanation in in "Windows Internals" by Mark Russinovich, et al.
Exactky what I'd expect. But I don't quite follow the "huge advantage" this somehow provides. It is a pretty small advantage in reality. The main benefit is avoiding having duplicate copies of data that can't possibly be different, occupying pages of memory that could be used for something else.

I can't defend the phrase "huge advantage" as huge is vague, but I can defend some advantage.

If you have 3 programs being tested, with 2 using the same binary (different configuration) in a round robin, ignoring the test harness and OS, the shared binary will have approximately 2/3rd of the execution time (we are running A vs B, A vs B` and B vs B`). So it isn't unrealistic that the shared binary would have more instructions cached, reducing processor stalls giving an advantage. That said I agree that the advantage is small, but if you are only trying to test for small ELO improvements it may be significant.

Don · Post by **Don** » Fri Aug 23, 2013 2:12 pm

bob wrote:
jundery wrote:
bob wrote: I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...
The simple answer is yes pages are shared (when possible), this also helps with cache as multiple processes can share space for instructions reducing pressure.

You'll find a more detailed explanation in in "Windows Internals" by Mark Russinovich, et al.
Exactky what I'd expect. But I don't quite follow the "huge advantage" this somehow provides. It is a pretty small advantage in reality. The main benefit is avoiding having duplicate copies of data that can't possibly be different, occupying pages of memory that could be used for something else.

I think I'm the one that said huge advantage. Even though you don't think it should happen it does. The canonical example is round robin testing with 3 players, but 2 of them are the same program (with different option settings.) The 2 programs are going to appear faster than the single program and the effect is non-trivial. Maybe saying "huge" is an over-statement because it remains to be defined what "huge" is, but I'll just say it's several ELO and easy to measure - enough to leave no doubt.

Don · Post by **Don** » Fri Aug 23, 2013 2:32 pm

I would like to add that on a 4 core machine we test using 4 instances, not the 3 instances you feel is better. It was not an arbitrary decision either but is based on someone we observed long ago.

If you set the tester to 3 instances, one program will tend to be more crippled than the other - and I'm not sure why. One would expect all 3 to be evenly allocated with resources but that wasn't the case with us. If you test 4 they seem to be evenly allocated.

The problem may likely have to do with processor affinity. With 3 instances running you have 6 programs running on 4 processes. When the OS assigns a processor initially to a program, it probably tries to pick the least utilized processor at the time. Even though the OS manages this dynamically it tends to be hesitant to move a processor away from one processor and onto another.

Maybe that has changed in recent years. I made this observation many years ago and processors and operating systems have improved a lot since then.

Still, we are reluctant to so quickly waste a processor. The machines we use are dedicated to running test games (I rarely use my desktop workstation for this) and the the background processes consume less than 1% of the resources and I turn off most of the useless background daemons that I never use anyway.

Our tester also support playing as many as 256 players (and it's just a constant I could increase if I needed to) in a round robin and one thing we DO NOT DO is try to keep a program running - we always kill the program between games and I strongly believe that is the only right way to do this. This makes any any advantages that one program may get random. If you are testing 256 players on a 16 core system and never restarting the programs you have to keep as many as 1 running copy on each core, that is 256 * 16 programs opened for I/O. That is because one of the programs could be running a game on each core.

For running just 2 programs it's workable, but I still don't recommend it. Only if you are running at a level where you can play 10-50 games per second is the slowdown from restarting processes noticeable - at least for Komodo. On a quad we can still open, play a match and close a process and get 20 games per second at a very fast fixed depth level such as 3 ply so there is just no good reason to try to manage 256 * 16 = 4096 open connections or restrict the capabilities of the tester.

Don · Post by **Don** » Fri Aug 23, 2013 2:46 pm

lucasart wrote:
Don wrote:From time to time we will do fixed depth testing - which has a number of advantages as well as a number of disadvantages. The disadvantages make it not very practical for most of our testing but it's still sometimes useful.

The primary advantages is that if you pick a low depth you can execute many games per second and it often returns a relevant result that is compatible with actual time control games. But not always. We sometimes use it as a very quick filter to determine quickly whether we should continue. When used like this we don't require a new version to "win" but we want to see if it loses badly at a fixed depth time control or whether it is unusually slow or fast - which our testing infrastructure instruments for us (we get average time per move information.)

One disadvantage is that on a per game basis it's not fair to one side - for example in a position where one side is more cramped than the other a given depth is not completed in the same amount of time for the white player compared to the black player. And in the endgame, a fixed depth test does not react to the much reduced branching factor.
I have never used fixed depth testing, and I can't think of any good use case for it. It has many problems, such as:
* branching factor
* even/odd depth asymmetry

For a quick test, to reject very bad patches, the only thing I've ever used is fixed nodes testing. That being said it's still not reliable as a commit criteria (regardless of the statistical side of things, even a patch that passes all statistical ctests at 40000 nodes, may fail them all at 10"+0.1", etc.)

For us it has proven to be very workable as a quick test. One advantage is that we almost immediately get a result - say if we are running 8 or 9 ply for example because they go incredibly fast. Also, we immediately get performance numbers which show if the change affected the speed (nodes per second.) If the new version loses big, there is either a bug or the change is a really bad idea and we have saved a LOT of time. If it's close we don't assume anything.

So the main point is that this test doesn't take much time and returns useful information. About 80% of the time it actually tells us which program is best although we don't assume that (we always proceed to time games.)

We also did our own study with fixed node testing and found that the performance is absolutely terrible. For the same exact amount of CPU time you lose something like 150 ELO or more. We believe that the quality of the test, no matter how fast it is run, is very important. If we are going to run 20 games per second we still want them to be the highest quality games we can generate.

But running fixed node testing which you seem to advocate has the same problems you claim running NIT (nodes is time) so why do you think this is superior to NIT? That is not a consistent viewpoint. We are experimenting with NIT because we get most of the advantages of fixed depth, fixed node AND regular time control games. Even though we cannot work around all the disadvantages we have solved most of them.

Anyway, we view fixed nodes as the largest possible waste of resources and rejected it long ago, for example time-out in the middle of searching the first move happens most of the time and a serious reduction in the quality of the games and the results is very random because of the time-out in the middle issue - it's like randomly doing 7 or 8 ply and introduces a lot of noise.

A workaround we tried is to use fixed nodes but try to be a little smarter about when to stop. In other words we might return a move early if our time is almost up at the end of an iteration and delay returning a move if we are in the middle of an iteration. But why go to the trouble if you can do NIT (nodes is time) and get results much more consistent with time control games and one that exercises all your time control rules? Of all the possible "quick and dirty" tests this seems to be the most superior.

Our custom tester is written in java and is very efficient, so we can test at levels as fast as 2 seconds + some small fraction of a second Fischer time control - but as you go faster you get internal overheads dominating the performance. For example if a program takes a few milliseconds to start searching due to table initialization or other overheads it is seriously handicapped at these ridiculously fast time controls. Houdini 3 cannot even run on our tester at time controls that are this fast and will lose games on time.
At extrme time controls, you start to have pipe I/O, task switching, and internal overhead that become significant. The only reliable way I can think of is to tell your tester to never flag the engine, and code the clock management inside the engine (not counting your initialisation phase). Only possible for self-testing, obviously.

We also discovered that when running multiple matches on the same machine, it matters which binaries are running. If you are running a round robin between 3 programs and the tester is configured to keep all 4 cores busy and 2 of the programs use the same binary (with different configurations) these 2 programs have a huge advantage. So to make this test fair we are forced to make copies of binaries with different names only because we want to set the parameters differently.
Never "keep all the cores busy". This is a huge mistake. I have 8 cores, and I always run 7 concurrent games, never 8. When I compare the results of 7 concurrent games to 8 concurrent games, for example the draw rate is much lower with 8 concurrent games, indicating a lot of interference in the results (lower quality games with more "random blunders" due to interference). Never use more than N-1 cores, even if the machine is "unused" (even when you do nothing, there is so much going on behind the scene, daemons, task switching, etc.)

The NIT method
Problem with "node is time" is that some patches represent a compromise between NPS and accuracy. Typically when you add or change evaluation terms, and you slow down the program as a result. In fixed nodes (or NIT) testing, you do not account for the speed penalty, and only look at the accuracy gain.

You may argue that speed gains have dimishing returns, but maybe your patch also has dimishing returns (even with time measured in nodes). So the asymptotic behaviour is hazardous to predict.

That's why I prefer to stay with time.

I do some pre-selective testing in 3 steps, with aggressive early stopping. Here's the recipie I follow (unless when I cut corners, which I often regret afterwards, when I have to bisect for the regression):

1/ 4000 nodes/move
* SPRT elo0=-5 elo1=0
* H0 accepted -> reject patch

2/ 5"+0.05"
* SPRT elo0=-5 elo1=0
* H0 accepted -> reject patch

3/ 10"+0.05"
* simplification patch: SPRT elo0=-5 elo1=0
* patch that adds code/complexity: SPRT elo0=0 elo1=5
* neutral patch (eg. changing a number in a table): SPRT elo0=-2.5 elo1=2.5
* In all cases, commit if, and only if, H1 is accepted.

This is realy easy to automate with cutechess-cli and a little shell script around it.

This testing procedure handles the various aspects of testing:
* statistics
* scaling
* efficient use of resources
* risk/reward biais depending on patch type: In the long run, simplification patches are good, and complication patches are toxic -> less is more.

PS: alpha=beta=5% in all SPRT.

PPS: I always output the games PGN to /dev/null. Saves space, and a bit of overhead. But more importantly, watching games is a huge distraction, and causes people to make arbitrary and biaised decisions (to early stop and acept, in a statistically unsound way, because they "like what they see" etc.)

bob · Post by **bob** » Fri Aug 23, 2013 8:16 pm

Don wrote:
bob wrote:
jundery wrote:
bob wrote: I'd bet both do the same. Which is to share the executable pages, and the read-only pages, but obviously duplicate the read/write data pages. I know a couple of folks in the kernel group, when I get a chance I will poke 'em to see if they can answer (they can't answer all questions, Micro$oft considers some things secret.

I'm not sure why it would be a big performance boost with chess, it would just result in less memory used overall...
The simple answer is yes pages are shared (when possible), this also helps with cache as multiple processes can share space for instructions reducing pressure.

You'll find a more detailed explanation in in "Windows Internals" by Mark Russinovich, et al.
Exactky what I'd expect. But I don't quite follow the "huge advantage" this somehow provides. It is a pretty small advantage in reality. The main benefit is avoiding having duplicate copies of data that can't possibly be different, occupying pages of memory that could be used for something else.
I think I'm the one that said huge advantage. Even though you don't think it should happen it does. The canonical example is round robin testing with 3 players, but 2 of them are the same program (with different option settings.) The 2 programs are going to appear faster than the single program and the effect is non-trivial. Maybe saying "huge" is an over-statement because it remains to be defined what "huge" is, but I'll just say it's several ELO and easy to measure - enough to leave no doubt.

Let's define the circumstances.

1. How many total players in the event (this includes EVERY player even if it is duplicated..

2. How many cpu cores?

3. Cache configuration (some have separate l1 and l2 for each core, some have separate l1 but shared l2. Etc.

For clarity, if you play 3 games, with A vs B, A vs C and A vs D, I call that 6 players, with one being run 3x in separate processes.

If there are more cores than games in progress, one can get a pretty good handle on what, if any, advantage there is. I'd assume that is the way anyone would test, obviously.

For speculation, let's take an example of A vs B, A vs C, on my macbook, which is a dual-core i7, and apparently has the usual intel private 2xlevel1 cache, 256k of shared L2, and 4mb of shared L3.

Playing 2 games at once, will almost certainly have AB on one core, AC on the other. The only real advantage I see is that A will likely have more cache hits than B or C, because A is being referenced 2x as often, and l2/l3 is shared. I suspect that is a measurable difference, but not very big at all.

I'd test it, but it wold be a pain with just my mac. I might could figure out a way to test on our cluster, but the problem there is that these boxes are all dual-chip boxes, which means L2 (no i7 cluster nodes I use) is not shared between the two CPU chips. Harder to measure.

fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm

Re: fast testing NIT algorithm