testing question

hgm · Post by **hgm** » Thu Jun 02, 2011 7:51 am

Dann Corbit wrote:Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.

This is actually a more general problem, that continues to exists even when testing against a mix of opponents: a Chess engine is somewhat like a boat. As long as there are holes in the hull below the water line, it will sink. Plugging a hole has virtually no effect on the outcome as long as there are still other holes: you will still sink. It just takes longer to go under. But average game duration is not something you measure in testing, just the final result.

I noticed this very much when developing micro-Max. Initially it had a number of 'holes' making it lose against TSCP: tendency towards bad trades (N vs 3P etc.), not recognizing the danger of passer advance, tendency to take a King stroll, not preserving a Pawn shield in front of the King. Fixing each of those indivisually hardly had any effect on the test results. But when I plugged the last hole, the result shot up, and TSCP didn't stand a chance at all anymore.

michiguel · Post by **michiguel** » Thu Jun 02, 2011 7:58 am

lkaufman wrote:Yes, gains against foreign programs are almost always less than predicted by self-testing. Furthermore we already do what you say, a mix of the two types of testing. So I guess what I really want to know is which of the two types of testing should we do mostly?

I do enough self testing accumulating small improvements until I reach a point in which I could measure with confidence a difference with foreign testing. For instance, if you could measure reliably a 5 point increase with foreign testing, you could first try to increase the strength of your program at least 10 points with self testing.

Miguel

It is unlikely that exactly a 50-50 distribution between the two types of tests is optimum. When I worked on Rybka we relied 99% on self-testing, and obviously it worked well, but this was primarily because there were then no other programs close to Rybka's level. Now this is no longer a problem so the answer is not at all obvious.

hgm · Post by **hgm** » Thu Jun 02, 2011 8:49 am

For small differences, there are potentially enormous gains by focusing on the differences. One problem in testing when you compare A and A' by playing them against B, C, D etc., is that the opponents are not strictly deterministic. This makes a large fraction of the games indepent through early deviations of the opponents, even when A and A' would be identical and deterministic. And this is not what you want to measure at all; what you are interested in is the cases where A and A' move differently, whether on average A or A' does the better moves.

A smart way to approach this is let A play games against B-Z, then let A' think on each position of these games (measuring the time it spends, making sure that there is no systematic difference in the time A took, and if so, correct the measured Elo for that), to filter out the positions where A and A' move differently. If A and A' are very similar, that should be only a small fraction of the positions. All other positions were just contributing to the noise in the total result, by offering opportunities to B-Z to deviate and change the result.

Then the only thing left to do is determine whether the A or A' moves were better. You can do that by playing games from them. If A and A' differed in a move in a certain position of a game against B, you can also let them play that position against C-Z. You don't need a large number of games at all for this:

If the A moves on average score 5% better than the A' moves, you only need 400 games to establish this with 95% confidence (for all positions together!). If the positions with different moves came from only 5% of the games, the 5% better score from the positions would translate to only 0.25% better score in total. To measure such an improvement through brute force, you would need 160,000 games.

Kempelen · Post by **Kempelen** » Thu Jun 02, 2011 12:16 pm

I play 'foreing' type tourneys against 25 opponent to 160 games each one, total 4000 games, but one of the engines is my previous version of my engine, so in the tournament I have a mini tournament of 160 selfplay games, with let me know if I am improving over the previous version. 160 games are not a lot, but enought to see differences along repiting tournaments.

lkaufman · Post by **lkaufman** » Thu Jun 02, 2011 4:41 pm

[quote="bob"I am not sure how testing A against A' requires fewer games than when testing A vs B, C, D, E and F. I've not seen where the opponents matter, just the total number of games.

In fact, the opposite might be the case, because when testing A vs A', the difference is typically very small, which requires many more games to reach a reasonable error margin.[/quote]

I don't understand this comment. If A' is one Elo better than A, it will take a huge number of games to prove this regardless of whether they play each other or foreign programs. My testing has always shown that it takes more games to prove an improvement with foreign testing than with self-testing. The argument for foreign testing must be that self-testing just doesn't correlate highly enough with it, as you suggest in another response here.

uaf · Post by **uaf** » Thu Jun 02, 2011 5:09 pm

I use self-testing when I want to test changes to the search function otherwise I use "foreign" testing (usually 16k games against 4 engines at 10"+0.1").

mhull · Post by **mhull** » Thu Jun 02, 2011 5:35 pm

hgm wrote:For small differences, there are potentially enormous gains by focusing on the differences. One problem in testing when you compare A and A' by playing them against B, C, D etc., is that the opponents are not strictly deterministic. This makes a large fraction of the games indepent through early deviations of the opponents, even when A and A' would be identical and deterministic. And this is not what you want to measure at all; what you are interested in is the cases where A and A' move differently, whether on average A or A' does the better moves.

A smart way to approach this is let A play games against B-Z, then let A' think on each position of these games (measuring the time it spends, making sure that there is no systematic difference in the time A took, and if so, correct the measured Elo for that), to filter out the positions where A and A' move differently. If A and A' are very similar, that should be only a small fraction of the positions. All other positions were just contributing to the noise in the total result, by offering opportunities to B-Z to deviate and change the result.

Then the only thing left to do is determine whether the A or A' moves were better. You can do that by playing games from them. If A and A' differed in a move in a certain position of a game against B, you can also let them play that position against C-Z. You don't need a large number of games at all for this:

If the A moves on average score 5% better than the A' moves, you only need 400 games to establish this with 95% confidence (for all positions together!). If the positions with different moves came from only 5% of the games, the 5% better score from the positions would translate to only 0.25% better score in total. To measure such an improvement through brute force, you would need 160,000 games.

This gets me thinking about other possible automated statistics collected as games are played, recorded either in log files or gleaned from a PGN analyzer. For whatever number games, what was the ponder hit % per each opponent per white or black; all stats we can think of per ECO opening code. What percent wins/losses involved unbalanced trades (i.e. 2 rooks for queen, etc.), classified endings or middle-game mates, etc.

One would need to test what kinds of stats are stable. Maybe consistent patterns would be common to both large and small game samples within the wins and losses columns individually. If so, progress might be safely measured with smaller samples of games with less emphasis on win/loss. Wins and losses might be given a weighting factor whether the game was close or a wipe-out, a short mate or a long end game, even material throughout or uneven, pawns up or pawns down at what move or percent of the game. A lot of possibilities for measurement and weighting.

There is a lot of data in PGNs and log files that is perhaps going to waste and just waiting to be mined.

marcelk · Post by **marcelk** » Thu Jun 02, 2011 6:43 pm

F. Bluemers wrote:One of the problems with selftesting is that in most,if not all,tournaments you are not playing against a previous version.

Hypothetically it is possible that your version 4.1 is playing in the same rating list as your previous version 3. And if your version 3 was king of the hill for a while and available to the public, it is possible that somebody has tuned his own engine such as to mimic your evaluation and search as closely as possible. (Hypothetically of course.)

I verify changes in a small pool of different players. Only in the case it is not clear if one version is better than the other I pit them directly against each other.

Antonio Torrecillas · Post by **Antonio Torrecillas** » Thu Jun 02, 2011 7:18 pm

I Use self testing when I develop a new feature.
First I program a Naive version of the feature.
When it passes the simetry evaluation test and some other.
I do two versions which differ in the eval parameter +V -V.
I run a self test. In this usefullness test, I see if the new feature do add knowledge.
(it helps to win some games, disregarding the performance).
If the feature is usefull, I do an informed guess for the parameter.
(this test from a "search" feature versus "avoid" feature looks like a MonteCarlo evaluation of the feature parameter).
Then I do an integration test. I try to determine how it behave with the rest of the engine.
This time I do a self test with the previous reference engine versus the new one.
It's optimization time, study sinergie/redundancy with other evaluation term.
I try to answer: this new knowledge do compensate the computing effort.
If the new engine is stronger (>90% probability) it becomes the new reference development.

Now is time to do a Foreign testing.if it's stronger in 100 elo it's a new alpha release .
Then is time to fine tuning...

There is no single test who do answer all questions.

Dann Corbit · Post by **Dann Corbit** » Thu Jun 02, 2011 9:17 pm

michiguel wrote:
bob wrote:
lkaufman wrote:
Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?
Yes. I reported this curious effect a year or two ago. I don't remember the specific eval term, but the idea was that I added a new term, so that A' had the term, while A did not. And A' won with a very high confidence that it was better. I then dropped it in to our cluster testing approach, and it was worse.

We have seen several such cases where A vs A' suggests that a change is better, but then testing against a suite of opponents shows that the new term is worse more often than it is better.

I believe this is an exception rather than a rule.

Miguel

The only time I test A vs A' any longer is when I do a "stress test" to make sure that everything works correctly at impossibly fast time controls, to detect errors that cause unexpected results. And I rarely do that kind of testing unless I make a major change in something (say parallel search) where I want to make sure it doesn't break something seriously. Game in 1 second (or less) plays a ton of games and usually exposes serious bugs quickly.

I did not intend to imply that self testing did not have value.
The point I attempted to make (and did not succeed) was that testing against yourself shows whether the new program can beat itself or not. Clearly, on average, a program that when modified can beat itself will probably be stronger.

However, the only evidence you have is against itself and so you cannot mathematically project against other opponents. It is like SSDF testing of a program and then trying to project that program against humans. Probably, the strongest programs will do better against humans. But we cannot know it for sure because we did not test it.

Similarly, if I have 8 opponents I want to beat, and I play 100,000 games against them, then I will know if my change can beat them at the exact conditions of testing (time control, memory, CPU, pondering, etc.).

So, if you want to beat the top ten programs in the world at 40 moves in two hours and be sure about it, you will have to buy a few hundred high-end computers and let them play against each other around the clock.

Fortunately, nobody has the money to do that, since it would be boring if we knew the outcome of a contest before it happened.

testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question