Extensions, anyone?

Greg Strong · Post by **Greg Strong** » Mon Feb 23, 2009 6:38 pm

Here's the results for pawn to 7th rank extension:


                             Wins - Losses - Draws vs. Opponent 

                   Crafty          Fruit        Scorpio      Stockfish        Total 

Pawn to 7th 0/0   14-2-4 &#40;80%)   13-4-3 &#40;73%)  15-2-3 &#40;83%)   6-6-8 &#40;50%)   48-14-18 &#40;71.25%) 

Pawn to 7th 1/0   11-4-5 &#40;68%)   12-6-2 &#40;65%)  15-1-4 &#40;85%)   9-3-8 &#40;65%)   47-14-19 &#40;70.625%) 

Pawn to 7th 1/1   12-4-4 &#40;70%)   13-3-4 &#40;75%)  17-1-2 &#40;90%)   7-5-8 &#40;55%)   49-13-18 &#40;72.5%) 

Pawn to 7th 2/0   17-1-2 &#40;90%)   11-7-2 &#40;60%)  19-0-1 &#40;98%)   6-7-7 &#40;48%)   53-15-12 &#40;73.75%) 

Pawn to 7th 2/1   12-1-7 &#40;78%)   12-6-2 &#40;65%)  17-2-1 &#40;88%)  6-2-12 &#40;60%)   47-11-22 &#40;72.5%) 

Pawn to 7th 2/2   12-5-3 &#40;68%)   16-2-2 &#40;85%)  13-3-4 &#40;75%)   6-9-5 &#40;43%)   47-19-14 &#40;67.5%)

All over the place. I threw in 2/0 this time just for the hell of it (full extension at PV and none otherwise) and it came out with the highest percentage! Since I doubt that that's actually best, it is probably just random error. Perhaps I'm not going to learn anything usful from this exercise...

zamar · Post by **zamar** » Mon Feb 23, 2009 7:05 pm

Take a quick look for example at CCRL rating list. When engine has played 60 games, errorbar for its elo rating is around 80 elo points (and that is huge!).

Some time ago I played a bit with bayeselo program (which is used by CCRL to calculate elo ratings and errorbars). To get errorbar down to 4-5 elo points (which is absolute necessity for your experiment to succeed, but I'm afraid even that is not enough.) you need about 4000 games for each modified engine.

That is why top guys like Vas and Bob are running 16k - 50k ultra fast blitz games for each engine change.

bob · Post by **bob** » Mon Feb 23, 2009 7:27 pm

Greg Strong wrote:Here's the results for pawn to 7th rank extension:

Code: Select all


                             Wins - Losses - Draws vs. Opponent 

                   Crafty          Fruit        Scorpio      Stockfish        Total 

Pawn to 7th 0/0   14-2-4 &#40;80%)   13-4-3 &#40;73%)  15-2-3 &#40;83%)   6-6-8 &#40;50%)   48-14-18 &#40;71.25%) 

Pawn to 7th 1/0   11-4-5 &#40;68%)   12-6-2 &#40;65%)  15-1-4 &#40;85%)   9-3-8 &#40;65%)   47-14-19 &#40;70.625%) 

Pawn to 7th 1/1   12-4-4 &#40;70%)   13-3-4 &#40;75%)  17-1-2 &#40;90%)   7-5-8 &#40;55%)   49-13-18 &#40;72.5%) 

Pawn to 7th 2/0   17-1-2 &#40;90%)   11-7-2 &#40;60%)  19-0-1 &#40;98%)   6-7-7 &#40;48%)   53-15-12 &#40;73.75%) 

Pawn to 7th 2/1   12-1-7 &#40;78%)   12-6-2 &#40;65%)  17-2-1 &#40;88%)  6-2-12 &#40;60%)   47-11-22 &#40;72.5%) 

Pawn to 7th 2/2   12-5-3 &#40;68%)   16-2-2 &#40;85%)  13-3-4 &#40;75%)   6-9-5 &#40;43%)   47-19-14 &#40;67.5%)

All over the place. I threw in 2/0 this time just for the hell of it (full extension at PV and none otherwise) and it came out with the highest percentage! Since I doubt that that's actually best, it is probably just random error. Perhaps I'm not going to learn anything usful from this exercise...

You should run the _same_ test again. Then you will begin to understand the "all over the place" results.

I'd bet that the "best setting" will not be the same the second time around.

bob · Post by **bob** » Mon Feb 23, 2009 7:29 pm

zamar wrote:Take a quick look for example at CCRL rating list. When engine has played 60 games, errorbar for its elo rating is around 80 elo points (and that is huge!).

Some time ago I played a bit with bayeselo program (which is used by CCRL to calculate elo ratings and errorbars). To get errorbar down to 4-5 elo points (which is absolute necessity for your experiment to succeed, but I'm afraid even that is not enough.) you need about 4000 games for each modified engine.

That is why top guys like Vas and Bob are running 16k - 50k ultra fast blitz games for each engine change.

I agree. 32,000 games gives a +/-4 error bar. Which means your "change" has to be significant enough to gain or lose more than that before you can measure it accurately enough to decide good/bad/indifferent...

mcostalba · Post by **mcostalba** » Mon Feb 23, 2009 8:55 pm

Greg Strong wrote:Not a silly question - should have made the table more clear. The three numbers are wins-losses-draws for Glaurung, and the percentage is the percent of points Glaurung won, with draws being half a point. Or:

Percentage = (Glaurung wins + (Glaurung draws / 2)) / total games.

Ok. In this case the mate threat extension pushed to a full ply both in PV then in non-PV shows a Rybka like result!!! (12 wins vs 3 lost against almost all the opponents)

Are you sure results are reliable?

Normally I would play at least 1000 games to have something trustable, especially when results are so strange (extending mate threat in non-pv doesn't seem a good idea).

Anyhow thanks for your tests

Marco

Greg Strong · Post by **Greg Strong** » Mon Feb 23, 2009 9:05 pm

mcostalba wrote:Ok. In this case the mate threat extension pushed to a full ply both in PV then in non-PV shows a Rybka like result!!! (12 wins vs 3 lost against almost all the opponents)

Are you sure results are reliable?

Nope. In fact, the prevailing opinion is that there is not nearly enough games in the test set to say anything meaningful. Add to that the fact that the results of the pawn-push-to-7th test seem really erratic.

But I'm not giving up - I'm going to retry with a larger set of positions and more opponent programs, but it's going to take a lot longer to get results...

bob · Post by **bob** » Mon Feb 23, 2009 9:24 pm

Greg Strong wrote:
mcostalba wrote:Ok. In this case the mate threat extension pushed to a full ply both in PV then in non-PV shows a Rybka like result!!! (12 wins vs 3 lost against almost all the opponents)

Are you sure results are reliable?
Nope. In fact, the prevailing opinion is that there is not nearly enough games in the test set to say anything meaningful. Add to that the fact that the results of the pawn-push-to-7th test seem really erratic.

But I'm not giving up - I'm going to retry with a larger set of positions and more opponent programs, but it's going to take a lot longer to get results...

You are almost where I was when I started this cluster-testing stuff. I was testing history threshold values for the history-based LMR (which I no longer use of course). And as I modified the threshold value between 0% and 100%, the results were all over the place. I'll try to dig up a recent test I ran for a paper I am writing on this subject, where we got a very nice sort of normal-curve result with a peak for the best value, and where values on either side of this "best" dropped off in terms of Elo. With the history test I mentioned, 0 might be bad, 10 might be better, 20 might be worse, 30 might be good, 40 might be bad, 50 might be good, 60 might be bad... All caused by the inherent randomness any program exhibits when using a time-based search control...

A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go. You need more than one or two opponents as I have made changes that helped against A, but hurt against B, C and D. If you only play A you can reach the wrong conclusion and make a tuning mistake that will hurt in a real tournament.

Tord Romstad · Post by **Tord Romstad** » Tue Feb 24, 2009 12:42 pm

bob wrote:A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go.

It would be, in principle, but sometimes it seems like you don't realize that this just isn't possible for the vast majority of us. I use my iMac for all sorts of other things besides computer chess, and I can't run computer chess matches all the time. When I have a new experimental version I want to test, I rarely have enough CPU time for more than 100--200 games.

Fortunately, being able to verify tiny improvements isn't really necessary. Intuition is an imperfect, but adequate replacement for statistically significant tests. Without a doubt, it happens that some of the "improvements" in my program make it play worse, but as long as I am right more often than wrong, the net result in the long term will be that the program keeps improving.

Tord

zamar · Post by **zamar** » Tue Feb 24, 2009 12:46 pm

bob wrote: A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go. You need more than one or two opponents as I have made changes that helped against A, but hurt against B, C and D. If you only play A you can reach the wrong conclusion and make a tuning mistake that will hurt in a real tournament.

Auch. I'm currently testing automated tuning system, where the basic idea is to test if randomly modified engine can beat the original one. But if that what you say is true, then I'm doomed to fail, because the changes might make engine to play worse against all the other engines. Is that kind of behaviour common? I guess you must have tested many variables for your paper.

bob · Post by **bob** » Tue Feb 24, 2009 6:32 pm

Tord Romstad wrote:
bob wrote:A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go.
It would be, in principle, but sometimes it seems like you don't realize that this just isn't possible for the vast majority of us. I use my iMac for all sorts of other things besides computer chess, and I can't run computer chess matches all the time. When I have a new experimental version I want to test, I rarely have enough CPU time for more than 100--200 games.

Fortunately, being able to verify tiny improvements isn't really necessary. Intuition is an imperfect, but adequate replacement for statistically significant tests. Without a doubt, it happens that some of the "improvements" in my program make it play worse, but as long as I am right more often than wrong, the net result in the long term will be that the program keeps improving.

Tord

There's a place where we simply disagree significantly. I can't count the number of ideas that Tracy and I have tried. A recent example. Rook on the 7th rank. Older versions used to require that the king be on the 8th rank for the rook on the 7th to be an issue, as well as having pawns on the 7th. Both of these ideas are mentioned in more than one chess book. At some point I simplified a lot of things just to get the latest version running. And then we started to go back to "correct" the oversights. Adding either the pawn test, or the king on the 8th test both _weakened_ the program. Not by a lot, but by enough that when you add up these small changes, they become a big change.

The only place intuition works well is that if all else is equal, faster is better. A 5% speed improvement is always a plus, assuming you don't give up something (an eval term removed, for example) to get that speed. But everywhere else, I find intuition is about as accurate as flipping a coin and using that to make decisions. Another example. I have been working on passed pawns and decided to try a fruit-like approach for passer mobility. The overhead to determine if a pawn can safely advance (when it is not blockaded) is not very much. Yet adding this resulted in worse results by a small amount. And yes, I tried testing the square in front of the pawn, or the two squares, etc. And each played slightly worse than then the normal version that just scores passers based on their rank of advance, and whether the square in front is empty or occupied.

I'm very skeptical of "intuition" since no evaluation term is free. I would bet that 75% of our changes the past year, each of which sounded perfectly reasonable in discussions, turned out to be either "no change" or "slightly weaker" in real testing.

I realize playing large numbers of games is difficult if not impossible for most. But the alternative is to accept bad changes and reject good changes based purely on a coin flip, which is all 200 games is. You only have to run a 200 game match between A and B twice and compare the results to see what I mean. No changes to A or B between the two matches.

I've previously posted a ton of 80-game matches between two identical opponents to show the randomness in such a small sample. It really is there...

Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?