Using too few openings?

MartinBryant · Post by **MartinBryant** » Tue Sep 18, 2007 10:19 pm

Just to throw another factor into the entertaining testing 'debates' herein, what do people think of this...

We hear much talk of the 40 de facto standard test positions (Nunn, Silverlight, whatever), but is there a danger that repeatedly testing with so few starting positions that you are in danger of tuning your engine to this set only?

I know that these are meant to be 'representative' but is there really any such thing? There are surely 1000's of perfectly playable opening lines which will inevitably come up and although your score improves on 80 of them it might get worse on the other 920 and you'll never know unless you test them too.

The real reason I'm concerned about this is an observation in my own testing viz...
I sometimes run a 1000 game match against a single opponent using a fixed set of 500 GM extracted openings.
I have observed that you often get 'lucky streaks' where one engine completely dominates the other over a long period, such that if I were to cut the 1000 games into 80 game slices the results from each 80-game slice are hugely different. (Even though the engines might actually be very similar in strength.)
Even more disturbingly, if I make even a slightest change (or possibly even NO change if Bob is correct, which I suspect he is) the lucky streaks move about all over the place!
Thus if I base my made-it-better/made-it-worse decision only on the first slice I could easily draw the wrong conclusion.

(I have noticed these streaks simply because I graph the rating difference as the match progresses, rather than doing any stats analysis on it.)

Alessandro Scotti · Post by **Alessandro Scotti** » Tue Sep 18, 2007 10:32 pm

I have noticed the same behavior but playing 100 games against multiple opponents. However, if I get "lucky" in one match then I get "unlucky" in another match with a different opponent. So I tend to ignore the result of single matches (unless I can spot some pattern like say "I always lose with at least 5% gap") and only look and the overall result. Over 1000 games, lucky and unlucky streaks tend to somewhat compensate.

MartinBryant · Post by **MartinBryant** » Tue Sep 18, 2007 10:35 pm

Oh yes. Absolutely. Agreed.
After 1000 games the overall difference in the result may be tiny.

But is it actually safe to just use the same 40 positions, or could this lead to 'tuning'?
(A bit like people often say that you can 'tune' your engine to tactical testsets.)

Alessandro Scotti · Post by **Alessandro Scotti** » Tue Sep 18, 2007 10:42 pm

If the opening set is well built (like Albert described for example) I think it's not easy to tune on the whole set but in theory it seems very well possible as a kind of learning where the programmer is the unwittingly database!

Uri Blass · Post by **Uri Blass** » Wed Sep 19, 2007 12:21 am

MartinBryant wrote:Just to throw another factor into the entertaining testing 'debates' herein, what do people think of this...

We hear much talk of the 40 de facto standard test positions (Nunn, Silverlight, whatever), but is there a danger that repeatedly testing with so few starting positions that you are in danger of tuning your engine to this set only?

I know that these are meant to be 'representative' but is there really any such thing? There are surely 1000's of perfectly playable opening lines which will inevitably come up and although your score improves on 80 of them it might get worse on the other 920 and you'll never know unless you test them too.

The real reason I'm concerned about this is an observation in my own testing viz...
I sometimes run a 1000 game match against a single opponent using a fixed set of 500 GM extracted openings.
I have observed that you often get 'lucky streaks' where one engine completely dominates the other over a long period, such that if I were to cut the 1000 games into 80 game slices the results from each 80-game slice are hugely different. (Even though the engines might actually be very similar in strength.)
Even more disturbingly, if I make even a slightest change (or possibly even NO change if Bob is correct, which I suspect he is) the lucky streaks move about all over the place!
Thus if I base my made-it-better/made-it-worse decision only on the first slice I could easily draw the wrong conclusion.

(I have noticed these streaks simply because I graph the rating difference as the match progresses, rather than doing any stats analysis on it.)

I believe that a match of 1000 games is better.
Unfortunately I did not build some tool to take a pgn of many games cut them and remove doubles.

I wonder if there is a tool that does that because there is no reason to reinvent the wheel.

Uri

Albert Silver · Post by **Albert Silver** » Wed Sep 19, 2007 1:04 am

I don't recommend one set or even two sets of 100 games with any suite. When I test, I have at least 3 opponents, possibly more. As mentioned, one suite against one opponent may be no more than 'booking' that particular opponent. In my experience testing against 4-5 different opponents helps avoid this phenomenon. You may gain an extra 5 games against one opponent, but lose 1-2 against the others, etc.

Albert

Uri Blass · Post by **Uri Blass** » Wed Sep 19, 2007 1:40 am

I doubt if booking against specific opponent is practically possible.

I looked at results of FRC by Ray who use 50 random position and I found that I never see cases when A get significant result against B and B get significant result against C and C get significant result against A.

http://www.computerchess.org.uk/ccrl/404FRC/

I find that perfomance against a single opponent is never different than the rating of the program by a big margin(the differences can be explained by statistical error).

Uri

Dann Corbit · Post by **Dann Corbit** » Wed Sep 19, 2007 2:54 am

There are lots of tools to remove doubles. Free ones:
SCID
PGN-Extract

The commercial tools like Chess Assistant also do it.

I actually use all three.

pgn input -> scid -> pgn-extract -> chess assistant -> pgn output

Sometimes I run more than one cycle. In SCID, I run the cleaner, which makes duplicate detection more likely. Each tool finds duplicates that the others miss.

bob · Post by **bob** » Wed Sep 19, 2007 4:01 am

MartinBryant wrote:Just to throw another factor into the entertaining testing 'debates' herein, what do people think of this...

We hear much talk of the 40 de facto standard test positions (Nunn, Silverlight, whatever), but is there a danger that repeatedly testing with so few starting positions that you are in danger of tuning your engine to this set only?

I know that these are meant to be 'representative' but is there really any such thing? There are surely 1000's of perfectly playable opening lines which will inevitably come up and although your score improves on 80 of them it might get worse on the other 920 and you'll never know unless you test them too.

The real reason I'm concerned about this is an observation in my own testing viz...
I sometimes run a 1000 game match against a single opponent using a fixed set of 500 GM extracted openings.
I have observed that you often get 'lucky streaks' where one engine completely dominates the other over a long period, such that if I were to cut the 1000 games into 80 game slices the results from each 80-game slice are hugely different. (Even though the engines might actually be very similar in strength.)
Even more disturbingly, if I make even a slightest change (or possibly even NO change if Bob is correct, which I suspect he is) the lucky streaks move about all over the place!
Thus if I base my made-it-better/made-it-worse decision only on the first slice I could easily draw the wrong conclusion.

(I have noticed these streaks simply because I graph the rating difference as the match progresses, rather than doing any stats analysis on it.)

This is really a catch-22. Here's my take...

If you use a book, you introduce significant randomness into the testing, requiring a really large number of games to figure out what your changes are doing. So that is a non-starter if you want to run many tests...

So you have to resort to some common starting positions instead, and the main issue here is how many? Obviously you need to play two games per position to avoid black/white bias. If you pick 100 positions, that is 200 games for a single match, and from my randomness observations, you will need several such matches. So it is easy to get "too big" and make the testing required intractable computationally.

One could argue that if you double the number of starting positions, you could halve the number of games per position and not hurt anything. But who wants to figure out which 100 (or 200 or 500) positions are "the ones" to use. I've looked at each of Silver's positions and didn't find anything I would think "that's got to go".

If you "tune for the positions" and the positions are really pretty representative, you are probably doing a good thing...

Using too few openings?

Using too few openings?

Re: Using too few openings?

Re: Using too few openings?

Re: Using too few openings?

Re: Using too few openings?

Re: Using too few openings?

Re: Using too few openings?

Re: Using too few openings?

Re: Using too few openings?