obvious/easy move

bob · Post by **bob** » Fri Jan 25, 2013 6:48 pm

Don wrote:
AlvaroBegue wrote:I would be careful when using self-tests for anything having to do with time control and pondering. The number of ponder hits is likely much higher in self-tests than playing against a different opponent, and that will distort the results. Even with pondering off, the amount of useful information left in the transposition tables is probably also higher than it should be.

This also applies to Bob's test, of course.

Do you guys test against some reference opponents as well?
The test I just did was without pondering but we do in fact test against reference opponents quit a bit although this was a quick and dirty test.

I'm going to go into a bit of a rant here so please forgive me.

For almost 30 years of computer chess I have been getting the warning from people about avoiding self-testing and although I consider it a well-meaning warning nobody has once offered any evidence other than their own superstition. I'm very close to putting it in the category of "myth" or "Conventional wisdom" which by definition is not "exceptional" - it is usually untested and believed based on blind credulity or gut instinct which is notoriously unreliable.

Nevertheless, due to my own superstitions I have over 30 years done a lot of mixed tests simply because I know that that program version are not 100 percent transitive. And yet I have never seen intransitivity that cannot be explained by error margins.

I'm always looking for an edge, so if it were a factor I would drop all self testing faster than a ton of bricks. I have 30 years of considering this issue and running tests.

What I DO see from time to time is that self testing can distort the value of a change, but even then it's not by much. That happens often enough that I consider it a real thing. But that is actually a good thing since the majority of the time I only want to know if a change is beneficial. I'm happy to have a magnifying glass.

I can understand why you bring it up because with pondering it is likely to be more than just superstition or speculation. I would expect a much bigger benefit with self-play for this particular thing.

Time control in general is a messy thing even without pondering - it is important to utilize your time to the best advantage without actually knowing how much time you actually have since you do not know how long the game will last. There are several very important principles to be considered, the important of "front loading" your time (spend a lot more time on early moves), trying hard to finish an iteration that you started, and others.

So it's probably natural to imagine that there could be a strong self-test interaction (even without pondering) but I imagine this about almost every change we try. As it turns out, the way we do time control tests has been mostly against foreign programs due to our own superstitions in this regard, which always turn out to be unfounded.

Here is something that I happen to know about Rybka. Remember Rybka 3 which took the world by storm in a big way? ALL their testing was done with "incestuous" self-testing, primary because they had no worthy opponents. It did not seem to have much of an impact on their progress.

I will say this. I know that you have some background in computer Go and I also flirted a bit with computer go although not to the extent that you did. Computer go may be more prone to the effects of intransitivity because even combined with Monte Carlo Tree Search they are fairly pattern intensive which means they could be more susceptible to other programs which address their weaknesses and exploit them without necessarily being superior in any other sense. But even that is just a theory on my part, I don't know to what extent that is true.

Just to be clear, I am not saying there is no transitivity in computer chess, I am quite sure there is some. I just don't think it's much of a practical concern 99% of the time and it's CLEARLY over-hyped and not for any rational reason.

To succeed at computer chess you have to sort out the nonsense from the reality and do it with a fair amount of objectivity, otherwise it's like stepping on the brakes. If you are too "anal retentive" you become a cripple. If you are not careful enough you could be spinning your tires so you have to find the balance. You are a good engineer yourself so you know what I am talking about.

This is not quite "urban legend". I ran such a test years ago, and what I saw, was an "overstated gain" when using self-play. If I would see +20 in self-play, it was ALWAYS much less than that against other opponents.

I can actually run this test again, and will do so. But it will take a bit of time as I don't have any cluster testing set up to run crafty vs crafty. But I can do something like Crafty-23.4 vs Crafty-23.5 and then each against the normal gauntlet. Will report back...

bob · Post by **bob** » Fri Jan 25, 2013 6:55 pm

Don wrote:
Rebel wrote:
Don wrote: I'm going to go into a bit of a rant here so please forgive me.

For almost 30 years of computer chess I have been getting the warning from people about avoiding self-testing and although I consider it a well-meaning warning nobody has once offered any evidence other than their own superstition. I'm very close to putting it in the category of "myth" or "Conventional wisdom" which by definition is not "exceptional" - it is usually untested and believed based on blind credulity or gut instinct which is notoriously unreliable.
Excellent rant.

There is something that I happen to know about Rybka. Remember Rybka 3 which took the world by storm in a big way? ALL their testing was done with "incestuous" self-testing, primary because they had no worthy opponents. It did not seem to have much of an impact on their progress.
By the lack of competitive opponents incest testing more or less was forced.

Actually, you can get perfectly good results against any opponent within a couple of hundred ELO and you can also give time odds, so this wasn't forced on them. If it was a real problem instead of an imaginary problem they would have be forced the OTHER way, to only test against foreign opponents despite the obvious inefficiencies.

Just to be clear, I am not saying there is no transitivity in computer chess, I am quite sure there is some. I just don't think it's much of a practical concern 99% of the time and it's CLEARLY over-hyped and not for any rational reason.
Please rant more
Ok

Seriously, self-testing works beautifully and we supplement it with foreign testing on occasion but in general this is a sanity test. Of course we do want to know how we stand against other programs so I don't consider it too much of a waste when done judiciously. So far it has never surprised us.

The biggest argument in favor of self-testing is that self-testing is far more efficient use of CPU resources. Our bottlneck and probably everyone else's is the testing procedure itself. By testing incorrectly we effectively throw out half of our hardware! Would you trade in your quad for a 2 core machine? I don't think you would willing do such a thing.

Imagine that you want to know if program B is an improvement over program A. If you forbid self-testing you have to run program A against 1 or more foreign opponents, then you have to run program B against those same opponents and compare the results. You have to run twice as many games to do this because you are not just testing your own program, but somebody else's program too. That stinks! You also have 2 sets of error margins to deal with. The rating program A achieved has error and so does program B. Comparing them is indirect. If you want to see which stick is the longest you should hold them side by side to get a more accurate answer. You could also use a tape measure on each separately, but the result will be less reliable, especially if the difference is small.

I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.

There is another risk that I have seen, and written about in the past. At one point, I re-vamped a couple of basic triggers in the king safety evaluation to make Crafty more aggressive. Quick self-testing shows this to be an improvement. The gauntlet showed it to be BAD. I think the problem was related to Crafty vs Crafty', where the "prime" version was more aggressive, and it was enough so to overwhelm the older version. But other programs use different king safety concepts, and the extra aggression on the first attempt really backfired.

That turned me off to self-play for anything other than debugging.

Don · Post by **Don** » Fri Jan 25, 2013 7:14 pm

bob wrote: This is not quite "urban legend". I ran such a test years ago, and what I saw, was an "overstated gain" when using self-play. If I would see +20 in self-play, it was ALWAYS much less than that against other opponents.

So what is the problem with the results being overstated? We don't care if the result is overstated as we rarely run automated tests to measure how much improvement, we only use it in an attempt to show that there is some improvement.

If self-testing is a major transitivity issue then testing computer vs computer is highly flawed too. The similarities between 2 programs is 99% compared to the similarities compared to humans.

I can actually run this test again, and will do so. But it will take a bit of time as I don't have any cluster testing set up to run crafty vs crafty. But I can do something like Crafty-23.4 vs Crafty-23.5 and then each against the normal gauntlet. Will report back...

You should run this test again. I think you will find that if you improve Crafty in self testing it will always translate to an improvement against other opponents if you run this test to the point that it is statistically convincing.

Every time we thought we say this intransitivity it turns out that we were just looking at statistical noise and running more games corrected the situation.

I already said I believe that intransitivity exists, so if you really wanted to show intransitivity you could if you worked very hard at it, perhaps rigging the test, searching for a very idiosyncratic way to take advantage of a Crafty weakness (but which doesn't work against other programs) or some other means. but it's not the kind of thing I lose any sleep over and it has not prevented rapid progress in Komodo.

Don · Post by **Don** » Fri Jan 25, 2013 7:36 pm

bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.

There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)

Adam Hair · Post by **Adam Hair** » Fri Jan 25, 2013 7:51 pm

Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)

If you need x games to get +/- y Elo in self-testing, then you need x*√2 games to have the same error (+/- y Elo) when comparing gauntlet results.

Don · Post by **Don** » Fri Jan 25, 2013 8:11 pm

Adam Hair wrote:
Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
If you need x games to get +/- y Elo in self-testing, then you need x*√2 games to have the same error (+/- y Elo) when comparing gauntlet results.

So basically, I am not willing to throw away 30% of my testing resources just to be anal retentive over this.

I had an older friend who passed a few years ago and he had a new but inexpensive vehicle that he took very good care of. He used to buy the most expensive gasoline for it, the highest grade, and he figured it was "worth it" for his cheap car. Over the years he was just pouring out money for this. It made him feel good though.

Gauntlet testing like this is like throwing away money. We cannot get enough CPU power to do the testing we want to do and I won't just throw away 30% of what I have just to be unduly and stupidly cautious like this.

I also had a friend who rebooted his Windows machine before he did anything new. I don't know if there are any studies on this, but rebooting is hard on a computer. His machines ALWAYS ended up having boot problem. I am convinced it was because he booted 30 or 40 times a day - just to be "safe." That cost him money. Bill Gates legacy is that he was able to convince the masses that rebooting several times a day is not just normal but a good thing to do. My computers usually stay up for a few months at a time.

AlvaroBegue · Post by **AlvaroBegue** » Fri Jan 25, 2013 8:15 pm

It's a bit unfortunate that that rant was directed at me, because I agree with the vast majority of what you said. I generally think self-test is fine for most tests: You might get a larger measured benefit than what you would get against other opponents, but that just makes it easier to do statistics and detect the sign of the change, which is primarily what you care about.

My concern was specific to time-control issues. In particular, if you tune your time consumption when using pondering through self-tests, you are likely to overestimate the future time savings available through ponder hits, and then people might complain that your program is using time too aggressively (as you mentioned happens with Komodo). Perhaps your tuning is correct, and consuming more time than other programs is a perfectly fine thing to do, but I was trying to make you think about the issue, in case you hadn't. I see now that you have thought about it, so never mind.

Don · Post by **Don** » Fri Jan 25, 2013 8:25 pm

AlvaroBegue wrote:It's a bit unfortunate that that rant was directed at me, because I agree with the vast majority of what you said.

No it wasn't. It was just a good opportunity for me to make this point. In the post I mentioned that you probably understand what I am talking about.

Actually, I respect you as a program author and you have many times said insightful things on this forum and the computer-go list. It's a refreshing thing.

So if you think I was implying that you were stupid about this, it isn't how I feel.

I generally think self-test is fine for most tests: You might get a larger measured benefit than what you would get against other opponents, but that just makes it easier to do statistics and detect the sign of the change, which is primarily what you care about.

My concern was specific to time-control issues. In particular, if you tune your time consumption when using pondering through self-tests, you are likely to overestimate the future time savings available through ponder hits, and then people might complain that your program is using time too aggressively (as you mentioned happens with Komodo). Perhaps your tuning is correct, and consuming more time than other programs is a perfectly fine thing to do, but I was trying to make you think about the issue, in case you hadn't. I see now that you have thought about it, so never mind.

hgm · Post by **hgm** » Fri Jan 25, 2013 8:34 pm

Adam Hair wrote:If you need x games to get +/- y Elo in self-testing, then you need x*√2 games to have the same error (+/- y Elo) when comparing gauntlet results.

Actually x*sqr(2) = x*4 rather than x*sqrt(2).

Don · Post by **Don** » Fri Jan 25, 2013 8:39 pm

AlvaroBegue wrote:It's a bit unfortunate that that rant was directed at me, because I agree with the vast majority of what you said. I generally think self-test is fine for most tests: You might get a larger measured benefit than what you would get against other opponents, but that just makes it easier to do statistics and detect the sign of the change, which is primarily what you care about.

My concern was specific to time-control issues. In particular, if you tune your time consumption when using pondering through self-tests, you are likely to overestimate the future time savings available through ponder hits, and then people might complain that your program is using time too aggressively (as you mentioned happens with Komodo). Perhaps your tuning is correct, and consuming more time than other programs is a perfectly fine thing to do, but I was trying to make you think about the issue, in case you hadn't. I see now that you have thought about it, so never mind.

When we do time control we are particularly careful about how we test. It's not just testing against foreign programs but testing at different time controls too. You can optimize for a specific time control if you are not careful at the detriment of other time controls.

My basic observation here is that I doubt it would have came out any different had we used self-testing for this. Every time I think I have a candidate that absolutely requires foreign-testing it turns out to be not the case. But I agree with you completely that if anything would seem to be a natural candidate for foreign testing, this would be one of them.

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results