michiguel wrote:bob wrote:lkaufman wrote:Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.
I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.
I suspect that both types of testing have value and will produce different kinds of improvement.
Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?
Yes. I reported this curious effect a year or two ago. I don't remember the specific eval term, but the idea was that I added a new term, so that A' had the term, while A did not. And A' won with a very high confidence that it was better. I then dropped it in to our cluster testing approach, and it was worse.
We have seen several such cases where A vs A' suggests that a change is better, but then testing against a suite of opponents shows that the new term is worse more often than it is better.
I believe this is an exception rather than a rule.
Miguel
The only time I test A vs A' any longer is when I do a "stress test" to make sure that everything works correctly at impossibly fast time controls, to detect errors that cause unexpected results. And I rarely do that kind of testing unless I make a major change in something (say parallel search) where I want to make sure it doesn't break something seriously. Game in 1 second (or less) plays a ton of games and usually exposes serious bugs quickly.
I did not intend to imply that self testing did not have value.
The point I attempted to make (and did not succeed) was that testing against yourself shows whether the new program can beat itself or not. Clearly, on average, a program that when modified can beat itself will probably be stronger.
However, the only evidence you have is against itself and so you cannot mathematically project against other opponents. It is like SSDF testing of a program and then trying to project that program against humans. Probably, the strongest programs will do better against humans. But we cannot know it for sure because we did not test it.
Similarly, if I have 8 opponents I want to beat, and I play 100,000 games against them, then I will know if my change can beat them at the exact conditions of testing (time control, memory, CPU, pondering, etc.).
So, if you want to beat the top ten programs in the world at 40 moves in two hours and be sure about it, you will have to buy a few hundred high-end computers and let them play against each other around the clock.
Fortunately, nobody has the money to do that, since it would be boring if we knew the outcome of a contest before it happened.