testing, again. Glaurung 2 change
Moderators: hgm, Rebel, chrisw
Re: error in testing...
2.1 came with a lot of more options with several default values changed. For sp, the biggest changes were probably the "single reply extensions"
-
- Posts: 10460
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: testing, again. Glaurung 2 change
I disagree that nothing look unusualbob wrote:I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.
worse performance of Glaurung 2.1 is certainly unusual because 2.1 is better based on many games of CCRL of CEGT.
In addition to it the difference between fruit2.1's rating seem to be unusual
2 Fruit 2.1 68 9 9 3894 64% -34 24% (+102 elo relative to Crafty)
5 Crafty-22.2 -34 5 5 19470 44% 7 23%
2 Fruit 2.1 52 11 11 2267 60% -22 24% (+74 elo relative to Crafty)
5 Crafty-22.2 -22 5 6 11344 46% 4 24%
If we ignore games of Crafty against non fruit programs then
the effective difference in performance against Crafty is 28 elo
and it certainly seems unusual.
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: testing, again. Glaurung 2 change
Uri Blass wrote:I disagree that nothing look unusualbob wrote:I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.
worse performance of Glaurung 2.1 is certainly unusual because 2.1 is better based on many games of CCRL of CEGT.
In addition to it the difference between fruit2.1's rating seem to be unusual
Read my other post. I had not noticed that Tord had significantly changed the polyglot.ini file and I was running the new version of Glaurung with the old 2.0 e5 file. I am re-running the tests now.
As to the difference in Fruit's rating, are we now going to flip sides where you say even if the new rating is inside the error bar for the old one, that it is "wrong"???
I have only been pointing out the cases where the two ratings +/- the error do not overlap at all. Here they do.
It doesn't to me, considering that there are 2,000 positions, and 4,000 games roughly.68 -9 and 52 + 11 are certainly within expectation.
2 Fruit 2.1 68 9 9 3894 64% -34 24% (+102 elo relative to Crafty)
5 Crafty-22.2 -34 5 5 19470 44% 7 23%
2 Fruit 2.1 52 11 11 2267 60% -22 24% (+74 elo relative to Crafty)
5 Crafty-22.2 -22 5 6 11344 46% 4 24%
If we ignore games of Crafty against non fruit programs then
the effective difference in performance against Crafty is 28 elo
and it certainly seems unusual.
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: error in testing...
There are some scoring paramaters. for example, old:Dirt wrote:What difference is there in the polyglot.ini files that would cause a substantial change?bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.
I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
new:
Mobility (Middle Game) = 100
Mobility (Endgame) = 100
Aggressiveness = 100
Cowardice = 100
Pawn Structure (Middle Game) = 100
Pawn Structure (Endgame) = 100
Passed Pawns (Middle Game) = 100
Passed Pawns (Endgame) = 100
I always try to run engines with "default settings" And I probably should just remove all of those completely, but they are "as distributed". I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
-
- Posts: 10460
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: testing, again. Glaurung 2 change
I do not say impossible but this is not a result that I can expect to happen often and the difference is practically bigger if we give Crafty the same rating in both lists and ignore games of other programs against Crafty.bob wrote:Uri Blass wrote:I disagree that nothing look unusualbob wrote:I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.
worse performance of Glaurung 2.1 is certainly unusual because 2.1 is better based on many games of CCRL of CEGT.
In addition to it the difference between fruit2.1's rating seem to be unusual
Read my other post. I had not noticed that Tord had significantly changed the polyglot.ini file and I was running the new version of Glaurung with the old 2.0 e5 file. I am re-running the tests now.
As to the difference in Fruit's rating, are we now going to flip sides where you say even if the new rating is inside the error bar for the old one, that it is "wrong"???
I have only been pointing out the cases where the two ratings +/- the error do not overlap at all. Here they do.
It doesn't to me, considering that there are 2,000 positions, and 4,000 games roughly.68 -9 and 52 + 11 are certainly within expectation.
2 Fruit 2.1 68 9 9 3894 64% -34 24% (+102 elo relative to Crafty)
5 Crafty-22.2 -34 5 5 19470 44% 7 23%
2 Fruit 2.1 52 11 11 2267 60% -22 24% (+74 elo relative to Crafty)
5 Crafty-22.2 -22 5 6 11344 46% 4 24%
If we ignore games of Crafty against non fruit programs then
the effective difference in performance against Crafty is 28 elo
and it certainly seems unusual.
Uri
If in calculating rating for fruit you simply ignore games of Crafty against non fruit programs and give Crafty constant rating of 0 you will probably get
for Fruit 102+-9 and 74 +-11 that is clearly not within expectation.
It is possible to explain it not only by statistical noise and it is possible that fruit simply does not play relatively well in the part of the positions
that you tested when you got the +-11.
I do not claim that there has to be an error in your games but only that it is a possibility that I think to check based on the data.
Uri
-
- Posts: 1808
- Joined: Wed Mar 08, 2006 9:19 pm
- Location: Oslo, Norway
Re: testing, again. Glaurung 2 change
All rating lists agree that 2.1 is stronger than 2 - ε/5, so I'm quite sure the latter of your two explanations is right. This isn't a big surprise: Glaurung 2.1 has a strange and very speculative evaluation function, which often return huge scores even in materially equal positions. You may have noticed that Glaurung 2.1 often fails to win even when it reaches scores of +3 or +4. Making the program less speculative would almost certainly make it stronger, but also less fun, which is of course more important.bob wrote:Tord:
A while back you mentioned that I should move from the older 2.0 epsilon whatever to the most recent. I didn't change at the time because I didn't want to alter a constant opponent that was represented in a lot of old data.
With the new testing approach, I am in the progress os now re-evaluating the opponents, and perhaps adding a few more opponents (to do a few less games per opponent to keep things close computationally).
One oddity I found is this:While the current test has not completed, I did run one complete run but threw it out because I accidentally replaced the wrong glaurung with the newest. But the thing I noticed is that at least for Crafty, the new glaurung is not doing quite as well as the previous version (old was 70% vs crafty, new is 65%). I will post the complete run when it finishes, but I thought it interesting. Whether it suggests that some change was not so good, or just not so good against Crafty I am not sure.Code: Select all
crafty-22.2R5 Rank Name Elo + - games score oppo. draws 1 Glaurung 2-epsilon/5 115 9 9 3894 70% -34 20% 2 Fruit 2.1 68 9 9 3894 64% -34 24% 3 opponent-21.7 20 8 8 3894 58% -34 34% 4 Glaurung 1.1 SMP 14 9 9 3894 57% -34 20% 5 Crafty-22.2 -34 5 5 19470 44% 7 23% 6 Arasan 10.0 -184 9 9 3894 30% -34 20% Rank Name Elo + - games score oppo. draws 1 Glaurung 2.1 95 11 11 2271 65% -22 18% 2 Fruit 2.1 52 11 11 2267 60% -22 24% 3 Glaurung 1.1 SMP 27 11 11 2263 57% -22 21% 4 opponent-21.7 16 11 10 2269 56% -22 35% 5 Crafty-22.2 -22 5 6 11344 46% 4 24% 6 Arasan 10.0 -169 11 11 2274 30% -22 20%
Highly speculative play works better against some engines than against others. I suppose Crafty has a very sound and solid style, and excels at refuting risky play.
Tord
-
- Posts: 1808
- Joined: Wed Mar 08, 2006 9:19 pm
- Location: Oslo, Norway
Re: error in testing...
Actually, these parameters seem to be from the polyglot.ini file of Glaurung 1. Several of the above parameters don't even exist in Glaurung 2.bob wrote:There are some scoring paramaters. for example, old:Dirt wrote:What difference is there in the polyglot.ini files that would cause a substantial change?bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.
I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
I don't think using the old polyglot.ini file will hurt the strength significantly, though. I think the biggest part of the explanation for Glaurung 2.1's poor results compared to 2 - ε/5 is simply that it has problems against Crafty.
I do exactly the same in Glaurung: The default settings are identical to what is found in the polyglot.ini file supplied with Glaurung, apart from the hash size and the number of threads.I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
Tord
-
- Posts: 4557
- Joined: Tue Jul 03, 2007 4:30 am
Re: testing, again. Glaurung 2 change
Your way of thinking is appreciated.Tord Romstad wrote:Making the program less speculative would almost certainly make it stronger, but also less fun, which is of course more important.
-
- Posts: 10460
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: error in testing...
The result of Glaurung2.1 in another thread suggest that Glaurung2.1 has no special problems against Crafty.Tord Romstad wrote:Actually, these parameters seem to be from the polyglot.ini file of Glaurung 1. Several of the above parameters don't even exist in Glaurung 2.bob wrote:There are some scoring paramaters. for example, old:Dirt wrote:What difference is there in the polyglot.ini files that would cause a substantial change?bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.
I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
I don't think using the old polyglot.ini file will hurt the strength significantly, though. I think the biggest part of the explanation for Glaurung 2.1's poor results compared to 2 - ε/5 is simply that it has problems against Crafty.
I do exactly the same in Glaurung: The default settings are identical to what is found in the polyglot.ini file supplied with Glaurung, apart from the hash size and the number of threads.I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
Tord
I think that usually when A is significantly stronger than B then it means
that every opponent is going to perform worse against A relative to B.
You can verify it by the CCRL FRC results
http://computerchess.org.uk/ccrl/404FRC ... t_all.html
For example
Glaurung2.1 scored better against common weaker opponents relative to 2.01 and there were only 100 games in every single match:
Movei 00.8.438 70.5 − 29.5(2.01 67-33)
Pharaon 3.5.1 76.5 − 23.5(2.01 70.5-29.5)
Hamsters 0.6 2595 76.5 − 23.5(2.01 75-25)
Ufim 8.02 2590 82 − 18(2.01 79-21)
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: error in testing...
Actually the new one is doing much better if you have seen the data I published in the q-search checks thread. Something in the old polyglot file was not so good for the new one, not sure what...Tord Romstad wrote:Actually, these parameters seem to be from the polyglot.ini file of Glaurung 1. Several of the above parameters don't even exist in Glaurung 2.bob wrote:There are some scoring paramaters. for example, old:Dirt wrote:What difference is there in the polyglot.ini files that would cause a substantial change?bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.
I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
I don't think using the old polyglot.ini file will hurt the strength significantly, though. I think the biggest part of the explanation for Glaurung 2.1's poor results compared to 2 - ε/5 is simply that it has problems against Crafty.
I do exactly the same in Glaurung: The default settings are identical to what is found in the polyglot.ini file supplied with Glaurung, apart from the hash size and the number of threads.I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
Tord