Nondeterministic Testing in Weak engines such as Eden 0.0.13

bob · Post by **bob** » Tue Sep 11, 2007 8:22 pm

nczempin wrote:
bob wrote:...
You made some very valid comments.

I guess for now I just have to say: My engine is severely nps-challenged and time-to-depth compared to other engines at a similar level. The changes I am making are 99 % just optimizations that should to a large extent lead to the engine getting stronger (except in those rare cases where looking deeper causes you to dismiss the better move that you would find were you to look even better).

So under this condition it is mainly a question of: Have I optimized enough so I can get a whole ply more on average (yes my engine still has a lot of potential in that area), and when that is the case my tests are there to find out if that one ply was actually significant (which it doesn't have to be).

I am not changing the eval, the move ordering, or introducing any known techniques such as null-move. All I'm doing is finding bottlenecks, changing Java objects into primitives, representing "blackness" with a bit instead of <0, etc.

So I guess this factor pretty much makes the previous discussions on Eden meaningless, or at least any conclusions that anyone would like to draw from them.

And yet the questions remains: Why does my approach still seem to work? Who is willing to test my hypothesis that each version of Eden is stronger than the preceding one, even under the conditions you propose to be necessary?

I would also like to make one thing clear: I am very well aware of statistical issues such as random fluctuations (not solely because I play Poker sometimes), especially the fact that the human mind by default seems to be unable to deal with them.

I'm normally the guy that says in that joke I'm sure you've heard: "no, you can't say that all sheep in Scotland are black, not even that at least one sheep is, but the only thing you can say is that at least one sheep in Scotland is black on at least one side"
I'm always the guy that points out that e. g. salesperson competitions are more or less meaningless because the random fluctuations completely dominate any skill there might be.
I always shrug off "amazing" events that I easily determine to be very possibly caused simply by randomness.

I even seem to be too far on the side of randomness, being very skeptical even of scientific articles that claim to have found this or that correlation and/or even causation.

I take an interest in Statistical Process Control even to the extent of owning (although not yet having worked through) Shewhart, W A (1939) Statistical Method from the Viewpoint of Quality Control, plus Deming and lots of other stuff that precedes Watts Humphrey's work on Software Engineering Processes.

So...

It feels weird when I'm being placed "on the other side"

If all you are doing is dealing with performance issues, then no testing is necessary. Any increase in NPS will either make no difference, or will improve the program. Can't possibly hurt assuming you introduced no bugs.

But for eval/search/etc changes, much more care is needed or you will be throwing away good changes, keeping bad changes, all based on random results.

nczempin · Post by **nczempin** » Tue Sep 11, 2007 8:29 pm

bob wrote:
If all you are doing is dealing with performance issues, then no testing is necessary. Any increase in NPS will either make no difference, or will improve the program. Can't possibly hurt assuming you introduced no bugs.

But for eval/search/etc changes, much more care is needed or you will be throwing away good changes, keeping bad changes, all based on random results.

Exactly. Any increase in nps will either make no difference or improve it. And at the moment this is basically all I'm doing. And my tests are there to test if I have done enough for the changes to have an effect at the time settings I've chosen against the engines I have selected.

I totally agree with your second paragraph, and I always have done so.

bob · Post by **bob** » Tue Sep 11, 2007 8:36 pm

hgm wrote:
bob wrote:A simple example: You play against program X (nearly equal) with your old version and score +50/-50 (ignoring draws completely for now). Then you play the same test with your new version and score +51/-49. Then you play against Rybka with the original and score +1/-99, and the new version scores +4/-96. That second result is far more significant than your first test, because even though you are getting killed by Rybka, you made a clear improvement. The problem with choosing nearly equal opponents is that it now takes 10x (or more) the number of games to detect an improvement that it does with much weaker and much stronger opponents added into the mix.
This is a very debatable statement / misleading example.

Yes, going from 1-99 to 4/96 is statistically somewhat more significant than going from 50-50 to 51-49. But not orders of magnitude. The standard deviation of the numbver of wins (ignoring draws) in 100 games at 1% win probability is sqrt(100*0.01*0.99) = 1, while at 50% its is sqrt(100*0.5*0.5)=5.

But what you say will never happen. To get a 3% better score around the 2% level requires an enormously larger rating increase than to get a 3% increase around 50%. The details depend on the rating model that one uses, but in the popular 1/(1+exp(rating_diff/400)) the slope of score-vs-rating curve is ~12.5 times steeper. So an improvement that would up your score from 1% to 4% against Rybka, would most likely give an improvement of 37% against the formerly-equal opponent. So you would not win by 51-49, but more something like 85-15. And that is more significant, even in the face of the larger statistical error.

Believing that a 1% improvement against an equal opponent would measurably improve your score against Rybka is just day-dreaming...

No it isn't. It comes from actual experience. If one program posseses a key bit of knowledge another does not, that key bit of knowledge exaggerates the difference between the two programs. This has been discussed many times. When I added outside passed pawns to Cray Blitz, it started drubbing commercial programs on ICC. It continued with Crafty. Commercial programs added the feature and the drubbing stopped. A simple piece of code made a big difference. If you add something that an equal opponent does not have, it might or might not help depending on whether your eval/search is capable of causing you to reach positions where that feature is critical. But against an opponent that already has that key feature, again assuming they know how to use it, they will use it to roll you until you add it, and the difference is more pronounced.

I'm not going to get into any more arguments about testing. I simply offer suggestions that are _not_ based on witchcraft and superstition. I claim to have played more test games than any single person on planet earth or beyond. A typical trial for me is over 20,000 games. I have run 8 of those this past week. And in doing that I have learned a lot of things I didn't know before having that ability.

If you think your current testing methodology works, more power to you. I no longer have to guess when I talk about testing, as I can reach the "n0" point quickly enough to see what the real truth is, rather than playing < n0 games and having to guess.

But, so far, everything I have seen is not supported by the millions of games I have personally played over the past year or so. Everything I have said is clearly supported. You can easily ask the other people on team Crafty (Mike, Tracy and Peter) as they have been reviewing the results as I have run them, and we did lots of analysis on the randomness we saw and what was required to be able to draw accurate conclusions.

This discussion has not reached that point of maturity yet. And since I am the only one with the hardware necessary to support this kind of testing, I doubt many will agree. Until they run the test.

As far as your math goes, it is simply 100% faulty. You are trying to use statistics against something that is not statisitical in nature, namely the way changes modify program results. Computers are not human in any form or fashion. Elo is flawed with respect to computers. Discussing standard deviation in the above context is also flawed, because it assumes that results from A vs B and A vs C have some sort of transitivity property that is not there. I have added _many_ eval terms that had no real effect against A, but which worked great against B. One simple one was the bishop trapped at a2/h2/a7/h7. I could not begin to count the number of commercial program victories Crafty stacked up until everyone started to evaluate that mistake. Yet crafty without it did not play that much better against some opponents as the way they played simply avoided the situations where it was important.

There is oh so much more to testing and evaluating that is being discussed in most of these posts... The commercial guys know about it. It is a big edge to test and evaluate _properly_ rather than relying on intuition / guesswork...

hgm · Post by **hgm** » Tue Sep 11, 2007 9:23 pm

Well, there is no guess-work involved in the fact that when two Eden versions play each other 5 times, they play 5 times the same game, move for move...

My experience is that when I play an engine that is riddled with holes against a much better opponent, plugging one of the holes has virtually no effect on the score. The opponent will simply exploit one of the other holes to win. Even when to the spectator it was obviously clear that before it always won by exploiting the same hole (e.g. Rook on 6th, or avancing passers). It then turns out that this was only because it was usually the first hole that it encountered, and when you repaired it, equality lasts a bit longer, until the hole second in line presents itself.

By testing against many different opponents (usually 24) I can usually avoid improvements staying without effect, as not all opponents will have a blind spot for the hole I plugged. But the opponents that now find their progress blocked through this hole, are unlikely to be equipped to exploit all my other holes in stead (or they would not have been equal opponents).

I guess live in the lower to medium part of the rating scale is just unimaginably different from what you are used to.

bob · Post by **bob** » Tue Sep 11, 2007 9:44 pm

hgm wrote:Well, there is no guess-work involved in the fact that when two Eden versions play each other 5 times, they play 5 times the same game, move for move...

My experience is that when I play an engine that is riddled with holes against a much better opponent, plugging one of the holes has virtually no effect on the score. The opponent will simply exploit one of the other holes to win. Even when to the spectator it was obviously clear that before it always won by exploiting the same hole (e.g. Rook on 6th, or avancing passers). It then turns out that this was only because it was usually the first hole that it encountered, and when you repaired it, equality lasts a bit longer, until the hole second in line presents itself.

By testing against many different opponents (usually 24) I can usually avoid improvements staying without effect, as not all opponents will have a blind spot for the hole I plugged. But the opponents that now find their progress blocked through this hole, are unlikely to be equipped to exploit all my other holes in stead (or they would not have been equal opponents).

I guess live in the lower to medium part of the rating scale is just unimaginably different from what you are used to.

I don't think so. I've been in that part of the rating scale for a long time, but it doesn't mean I plan on staying down there forever. I at least now have a reliable way of answering "Is this change good or bad?" which means I will make some progress even if it is just based on luck, hitting the right eval value every now and again.

The worst thing that can happen is that you make a lucky guess on an eval term, but then get an unlucky test result and throw the idea out. I don't have that problem any longer.

Jan Brouwer · Post by **Jan Brouwer** » Tue Sep 11, 2007 11:15 pm

Hi Bob,

I understand that you have considerable hardware resources available for testing.
Can you give any general advice on how you would test on a single PC, let's say a quad-core processor with a time limit of about 20 hour (to allow for daily iterations) ?
What time-control (maybe several different ones?), how many different opponents, etc.

So far I have done most testing at 20 second + 1 second / move against about 6 opponents using Nunn starting positions, just to get a reasonable number of games in a few hours.

bob · Post by **bob** » Wed Sep 12, 2007 3:45 am

Jan Brouwer wrote:Hi Bob,

I understand that you have considerable hardware resources available for testing.
Can you give any general advice on how you would test on a single PC, let's say a quad-core processor with a time limit of about 20 hour (to allow for daily iterations) ?
What time-control (maybe several different ones?), how many different opponents, etc.

So far I have done most testing at 20 second + 1 second / move against about 6 opponents using Nunn starting positions, just to get a reasonable number of games in a few hours.

Sorry, but I can't give you an answer. To get reliable/stable results, you need thousands of games, not tens or hundreds. My goal is to be able to accurately say whether A or A' is better with a very high level of accuracy. I tried to use one PC for years and never found something workable...

Uri Blass · Post by **Uri Blass** » Wed Sep 12, 2007 4:13 am

bob wrote:
Jan Brouwer wrote:Hi Bob,

I understand that you have considerable hardware resources available for testing.
Can you give any general advice on how you would test on a single PC, let's say a quad-core processor with a time limit of about 20 hour (to allow for daily iterations) ?
What time-control (maybe several different ones?), how many different opponents, etc.

So far I have done most testing at 20 second + 1 second / move against about 6 opponents using Nunn starting positions, just to get a reasonable number of games in a few hours.
Sorry, but I can't give you an answer. To get reliable/stable results, you need thousands of games, not tens or hundreds. My goal is to be able to accurately say whether A or A' is better with a very high level of accuracy. I tried to use one PC for years and never found something workable...

I disagree.
There are things that you clearly can do
1)You can use common sense to decide if a change is good.
If the program has some weakness and you see that a change fix that weakness and the result in games is also good that the change probably works.

2)You can use test suites in case that you make changes in your search.
If you make your program faster with the same output you can be practically sure that you made an improvement and you may need games only to verify that you have no bugs that happen only when you make more than one search.

Not always the change in the search is exactly speed improvement but even in that case you can use test suites.

Note that I allow checks in the first 2 plies of the qsearch in movei but I have no special move generator that generates only captures and checks and I simply generate all moves and later find the checks out of them.

I think to add special move generator that generates only captures and checks.

This generator will not give me a pure speed improvement because the order of the generated moves may be different relative to the normal move generator but I think that it will be possible to see if there is an improvement based on test suites when I may play games only to verify that there is no serious bug.

3)You may play games at super fast time control for part of the changes that you try.

Based on my knowledge part of the testing of rybka is simply by very fast games(game in less than 1 second)

Uri

bob · Post by **bob** » Wed Sep 12, 2007 5:04 pm

Uri Blass wrote:
bob wrote:
Jan Brouwer wrote:Hi Bob,

I understand that you have considerable hardware resources available for testing.
Can you give any general advice on how you would test on a single PC, let's say a quad-core processor with a time limit of about 20 hour (to allow for daily iterations) ?
What time-control (maybe several different ones?), how many different opponents, etc.

So far I have done most testing at 20 second + 1 second / move against about 6 opponents using Nunn starting positions, just to get a reasonable number of games in a few hours.
Sorry, but I can't give you an answer. To get reliable/stable results, you need thousands of games, not tens or hundreds. My goal is to be able to accurately say whether A or A' is better with a very high level of accuracy. I tried to use one PC for years and never found something workable...
I disagree.
There are things that you clearly can do
1)You can use common sense to decide if a change is good.
If the program has some weakness and you see that a change fix that weakness and the result in games is also good that the change probably works.

Sorry, but that's no good. I can't count the number of "obviously good ideas" we have implemented this past year, but testing shows they are worse than the original. If you rely on intuition, you are going to make a _lot_ of wrong steps. Objective measurement is the key...

2)You can use test suites in case that you make changes in your search.
If you make your program faster with the same output you can be practically sure that you made an improvement and you may need games only to verify that you have no bugs that happen only when you make more than one search.

Again, wrong in my opinion. To solve test suites faster, just increase your check extensions, etc. But that won't make your program play better in real games. It will likely slow it down enough that it will actually play significantly worse. Chest is a good example. Optimized for finding mates. Would make a horrible game player...

Not always the change in the search is exactly speed improvement but even in that case you can use test suites.

Note that I allow checks in the first 2 plies of the qsearch in movei but I have no special move generator that generates only captures and checks and I simply generate all moves and later find the checks out of them.

I think to add special move generator that generates only captures and checks.

This generator will not give me a pure speed improvement because the order of the generated moves may be different relative to the normal move generator but I think that it will be possible to see if there is an improvement based on test suites when I may play games only to verify that there is no serious bug.

3)You may play games at super fast time control for part of the changes that you try.

Possibly. But "super-fast" games make tactical programs look better than they actually are, or they make positional programs look worse. Because the relative difference in the average search depth increases as the games speed up. Again, you can draw the wrong conclusion.

Based on my knowledge part of the testing of rybka is simply by very fast games(game in less than 1 second)

Uri

If you only do an eval change, you can often get by with fast games. But if you don't run long games occasionally, you get surprised...

MartinBryant · Post by **MartinBryant** » Wed Sep 12, 2007 7:55 pm

bob wrote:Sorry, but that's no good. I can't count the number of "obviously good ideas" we have implemented this past year, but testing shows they are worse than the original. If you rely on intuition, you are going to make a _lot_ of wrong steps. Objective measurement is the key...

I absolutely agree with Bob on this one.
I too have been surprised (and dissappointed) when some 'obviously good' change actually turns out not to be so after a long objective test run.
It's actually a great eye-opener because you then start to challenge your original assumptions to try to understand WHY it didn't work, until hopefully you get to that AHA! moment where you understand your own program better.

Nondeterministic Testing in Weak engines such as Eden 0.0.13

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.

Re: Nondeterministic Testing in Weak engines such as Eden 0.