testing question

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
lkaufman
Posts: 3722
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

testing question

Post by lkaufman » Wed Jun 01, 2011 9:37 pm

There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?

Dann Corbit
Posts: 10101
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: testing question

Post by Dann Corbit » Wed Jun 01, 2011 9:47 pm

lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
To me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.

F. Bluemers
Posts: 860
Joined: Thu Mar 09, 2006 10:21 pm
Location: Nederland
Contact:

Re: testing question

Post by F. Bluemers » Wed Jun 01, 2011 9:51 pm

lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
One of the problems with selftesting is that in most,if not all,tournaments you are not playing against a previous version.

User avatar
michiguel
Posts: 6388
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: testing question

Post by michiguel » Wed Jun 01, 2011 9:57 pm

lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
If the limit is 1000, 400 games self test, 600 foreign with the most varied number of engines possible. You do not have to choose one or the other.

I found that certain changes are very good against one particular engine than another.

In my last complete experiment, I had +40 elo in self testing (80k games, super fast = limited to 30k nodes/move), also +40 elo 10k games self testing (40 moves/20 sec). When I did foreign testing with 7 engines (1000 games each = 7000 games total) I got only +20 elo. Against one engine was only +4 and some were +30. All of them, lower than self testing.

I use self testing to follow progress because it is more sensitive, and foreign testing to "confirm" it, once I accumulated a difference I am confident I can measure.

Miguel
PS: Giving my resources, this is the most reasonable set up I could find.
PS2: This is the typical problem you face in experimental science when you have two methods: one is more sensitive, the other is more reliable. You will end up doing both.

lkaufman
Posts: 3722
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: testing question

Post by lkaufman » Thu Jun 02, 2011 3:03 am

Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?

lkaufman
Posts: 3722
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: testing question

Post by lkaufman » Thu Jun 02, 2011 3:09 am

Yes, gains against foreign programs are almost always less than predicted by self-testing. Furthermore we already do what you say, a mix of the two types of testing. So I guess what I really want to know is which of the two types of testing should we do mostly? It is unlikely that exactly a 50-50 distribution between the two types of tests is optimum. When I worked on Rybka we relied 99% on self-testing, and obviously it worked well, but this was primarily because there were then no other programs close to Rybka's level. Now this is no longer a problem so the answer is not at all obvious.

bob
Posts: 20549
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: testing question

Post by bob » Thu Jun 02, 2011 5:07 am

lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
I am not sure how testing A against A' requires fewer games than when testing A vs B, C, D, E and F. I've not seen where the opponents matter, just the total number of games.

In fact, the opposite might be the case, because when testing A vs A', the difference is typically very small, which requires many more games to reach a reasonable error margin.

bob
Posts: 20549
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: testing question

Post by bob » Thu Jun 02, 2011 5:12 am

lkaufman wrote:
Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?
Yes. I reported this curious effect a year or two ago. I don't remember the specific eval term, but the idea was that I added a new term, so that A' had the term, while A did not. And A' won with a very high confidence that it was better. I then dropped it in to our cluster testing approach, and it was worse.

We have seen several such cases where A vs A' suggests that a change is better, but then testing against a suite of opponents shows that the new term is worse more often than it is better.

The only time I test A vs A' any longer is when I do a "stress test" to make sure that everything works correctly at impossibly fast time controls, to detect errors that cause unexpected results. And I rarely do that kind of testing unless I make a major change in something (say parallel search) where I want to make sure it doesn't break something seriously. Game in 1 second (or less) plays a ton of games and usually exposes serious bugs quickly.

User avatar
michiguel
Posts: 6388
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: testing question

Post by michiguel » Thu Jun 02, 2011 5:43 am

bob wrote:
lkaufman wrote:
Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?
Yes. I reported this curious effect a year or two ago. I don't remember the specific eval term, but the idea was that I added a new term, so that A' had the term, while A did not. And A' won with a very high confidence that it was better. I then dropped it in to our cluster testing approach, and it was worse.

We have seen several such cases where A vs A' suggests that a change is better, but then testing against a suite of opponents shows that the new term is worse more often than it is better.
I believe this is an exception rather than a rule.

Miguel

The only time I test A vs A' any longer is when I do a "stress test" to make sure that everything works correctly at impossibly fast time controls, to detect errors that cause unexpected results. And I rarely do that kind of testing unless I make a major change in something (say parallel search) where I want to make sure it doesn't break something seriously. Game in 1 second (or less) plays a ton of games and usually exposes serious bugs quickly.

Ferdy
Posts: 4109
Joined: Sun Aug 10, 2008 1:15 pm
Location: Philippines

Re: testing question

Post by Ferdy » Thu Jun 02, 2011 5:44 am

lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
I prefer foreign testing for a simple reason that in most (if not all) tournaments, I play vs foreign engines. There is no point if I win vs previous own version if I lose overall to foreign engines.

I also experienced self testing showed an improvement but when I tried foreign testing, the supposed improvement simply vanished, so I abandoned self testing.

Post Reply