Sloppy experiment, results after 1 cycle

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Tony Thomas

Sloppy experiment, results after 1 cycle

Post by Tony Thomas »

Ilari hated the results of Sloppy in my tournament, his own experiments showed Sloppy to be much superior. So I decided to do an experiment, and run the same version against a set of stronger opponents. After 1 cycle sloppy is doing much better than I expected. With the current performance Sloppy is going for Rank # 16 in my best versions list were as the test of the current version against weaker opponents suggested that it is only good enough to be rank # 44. It could all well be statistical noise, we shall see how it pans out. Rating was calculated using offset 2300.

Code: Select all

35 Sloppy 0.2.0 X                2696  107  109    33   45%  2701   12% 
Here is the detailed elostat output.

Code: Select all

38 Sloppy 0.2.0 X            : XXXX  33 (+ 13,=  4,- 16), 45.5 %

Rybka v1.0 Beta.w32           :   1 (+  0,=  0,-  1),  0.0 %
WildCat 7.0                   :   1 (+  1,=  0,-  0), 100.0 %
Spike 1.2 Turin               :   1 (+  0,=  0,-  1),  0.0 %
Smarthink 1.00                :   1 (+  1,=  0,-  0), 100.0 %
Prodeo 1.2                    :   1 (+  1,=  0,-  0), 100.0 %
Trace 1.37a                   :   1 (+  0,=  0,-  1),  0.0 %
Gandalf 6.01                  :   1 (+  0,=  1,-  0), 50.0 %
Ktulu 8.0                     :   1 (+  0,=  0,-  1),  0.0 %
Thinker 4.7a                  :   1 (+  0,=  0,-  1),  0.0 %
Pharaon 3.5.1                 :   1 (+  1,=  0,-  0), 100.0 %
SOS 5.1                       :   1 (+  1,=  0,-  0), 100.0 %
Ruffian 1.0.5                 :   1 (+  1,=  0,-  0), 100.0 %
SlowChess Blitz WV 2.1        :   1 (+  1,=  0,-  0), 100.0 %
Aristarch 4.50                :   1 (+  1,=  0,-  0), 100.0 %
CM10th D2Alos                 :   1 (+  1,=  0,-  0), 100.0 %
ChessTiger2007.1 UCI          :   1 (+  0,=  0,-  1),  0.0 %
Fruit 2.3                     :   1 (+  0,=  0,-  1),  0.0 %
DeepSjeng27                   :   1 (+  0,=  0,-  1),  0.0 %
Delfi 5.2                     :   1 (+  0,=  0,-  1),  0.0 %
Movei00_8_438                 :   1 (+  0,=  0,-  1),  0.0 %
BugChess2_V1_5_2              :   1 (+  1,=  0,-  0), 100.0 %
Shredder11UCI                 :   1 (+  1,=  0,-  0), 100.0 %
Crafty 21.6 JA                :   1 (+  0,=  0,-  1),  0.0 %
Scorpio 2.0                   :   1 (+  0,=  1,-  0), 50.0 %
AlaricWB707                   :   1 (+  0,=  0,-  1),  0.0 %
Glaurung 2.0.1 JA             :   1 (+  1,=  0,-  0), 100.0 %
Hiarcs11.2SPUCI               :   1 (+  0,=  0,-  1),  0.0 %
Bright-0.2c                   :   1 (+  0,=  0,-  1),  0.0 %
Frenzee Dec 07                :   1 (+  0,=  1,-  0), 50.0 %
Zappa Mexico II               :   1 (+  0,=  0,-  1),  0.0 %
TogaII 1.4 beta 5c            :   1 (+  0,=  0,-  1),  0.0 %
Naum 3                        :   1 (+  0,=  1,-  0), 50.0 %
Crafty 22.0 JA                :   1 (+  1,=  0,-  0), 100.0 %
User avatar
hgm
Posts: 28391
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Sloppy experiment, results after 1 cycle

Post by hgm »

Yes, you see how dangerous it is to do only one-sided testing. It is really important ot have both weaker and stronger opponents. Some engines have a much wider distribution of performance than the standard Elo curve assumes. (Joker also seems to suffer from this.) Such engines look poor if you test them at weaker opponents, and very good when you test them against stronger opponents.

My theory is that this is because they are buggy (randomly giving away games regardless of opponent strength), or that they purely rely on tactics for strength, with poor strategical insight (as the possibility for a winning tactical shot will only occur in a fraction of the games).
Tony Thomas

Re: Sloppy experiment, results after 1 cycle

Post by Tony Thomas »

My weaker opponents are only relatively weaker. They are technically very strong compared to most other engines. Sloppy does tend to give away material and gets draws against opponents when the evaluation of both programs show a 4 pawn advantage for Sloppy. By the way, you never answered my question about bayesian elo. Is it possible to calculate the rating of engines with a minimum number of games? Also Prior 0 didnt work for me at all, after few iterations Bayesian elo stopped calculating.

Here is the list of previous opponents for Sloppy.

Code: Select all

25 Sloppy 0.2.0 JA           : xxxx  124 (+ 50,= 25,- 49), 50.4 %

Amyan 1.597                   :   4 (+  1,=  1,-  2), 37.5 %
Alaric 061011                 :   4 (+  2,=  1,-  1), 62.5 %
Arasan 9.4                    :   4 (+  3,=  1,-  0), 87.5 %
Comet_B68                     :   4 (+  3,=  0,-  1), 75.0 %
Green Light Chess 3.01.2.2    :   4 (+  2,=  1,-  1), 62.5 %
Patzer 3.80                   :   4 (+  4,=  0,-  0), 100.0 %
Nejmet_307                    :   4 (+  3,=  0,-  1), 75.0 %
Trace 1.36 NC3                :   4 (+  0,=  2,-  2), 25.0 %
Quark 2.35                    :   4 (+  3,=  1,-  0), 87.5 %
Zarkov 4.89                   :   4 (+  0,=  1,-  3), 12.5 %
AnMon 5.60                    :   4 (+  2,=  1,-  1), 62.5 %
Petir 3.99d                   :   4 (+  3,=  0,-  1), 75.0 %
Yace.099.56                   :   4 (+  1,=  0,-  3), 25.0 %
Pepit 1.59                    :   4 (+  2,=  2,-  0), 75.0 %
Zappa 1.1                     :   4 (+  2,=  1,-  1), 62.5 %
Ufim 8.02                     :   4 (+  2,=  0,-  2), 50.0 %
Frenzee 3.0                   :   4 (+  1,=  0,-  3), 25.0 %
Tao 5.6 Blitz Specialist      :   4 (+  2,=  1,-  1), 62.5 %
WildCat 6.0                   :   4 (+  1,=  1,-  2), 37.5 %
Pseudo07c                     :   4 (+  1,=  1,-  2), 37.5 %
Chiron v0.8.1b10.3            :   4 (+  1,=  0,-  3), 25.0 %
Sjeng 1.6                     :   4 (+  0,=  2,-  2), 25.0 %
Bright 0.1                    :   4 (+  1,=  2,-  1), 50.0 %
Baron 1.8.1                   :   4 (+  0,=  4,-  0), 50.0 %
Movei00_8_403 10 10 10        :   4 (+  2,=  0,-  2), 50.0 %
Colossus2007d                 :   4 (+  1,=  0,-  3), 25.0 %
BugChess2_V1_5_2              :   4 (+  3,=  0,-  1), 75.0 %
Sloppy-0.1.1 JA               :   4 (+  2,=  0,-  2), 50.0 %
E.T.Chess 13.01.08            :   4 (+  0,=  0,-  4),  0.0 %
Booot 4.14.0                  :   4 (+  0,=  2,-  2), 25.0 %
Petir 4.999999                :   4 (+  2,=  0,-  2), 50.0 %
User avatar
hgm
Posts: 28391
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Sloppy experiment, results after 1 cycle

Post by hgm »

I probably missed that question, and I must admit I am not sure I understand what exactly you are asking now.

Do you mean what the minimum number of games is that an engine should have before you can calculate its rating? I think this is 1 game. If the calculated rating means anything is another matter. But you will see that from the reported error bars.

With BayesElo it is never harmful to have engines with low number of games in your data set. BayesElo will either ignore them as far as the ratings of the other engines go, or derive from it what little information it can to improve the accuracy of the rating of other engines. An example is for instance if you play engine A against 1000 different engines, 1 game each, and then play engine B against the same 1000 engines, 1 game each. If A scored 60%, and B scored 50%, you have pretty accurately determined the rating difference between A and B to be ~ 110 Elo points. Despite the fact that 1000 of your 1002 engines have played only 2 games. If you would remove all the engines with two games from your data set, you would have no games left, and you would know nothing about the rating difference of A and B. Even worse, if A and B would have played 10 games against each other, won by B with 6-4, you would think that B is the stronger one. While in fact it was just lucky, as beating a 100 pt stronger engine by 6-4 is quite normal. The results against the engines with few games contributes to the accuracy of the ratings of A and B as much as 250 games directly between them would (without the systematical errors you get by playing the same engines too often). So it is far more accurate than anything you could learn from the 10 mutual games. BayesElo handles all this correctly.

Now prior=0 is tricky if there are engines in your set with 0% or 100% scores. Such engines could be infinitely weak or infinitely strong, respectively, and only prior knowledge on their strength would prevent you from thinking that. So it might be that BayesElo chokes on such infinities without prior. The solution would be to either remove them from the data set, or just take a very small prior (I think it accepts fractional priors, like 0.001). The 100% engine will then settle on extremely high ratings (e.g. +10,000) which makes it easy to cut them off the list after the rating calculation. Again the fact that these engines were in the calculation will not have affected the ratings of others, as everyone is expected to have a 100% guaranteed loss against a 10,000 Elo opponent, which indeed they all did.
Tony Thomas

Re: Sloppy experiment, results after 1 cycle

Post by Tony Thomas »

I have too many engines in my list with 2-20 games, mostly due to incompatibilities. It would be better if I can define the minimum number of games that an engine must have for me to get its rating. I do not want them to be excluded from the rating calculation, but I dont want them to be on the last list. If such an option exist, it would be much easier for me to edit my best versions list. I prefer to get at least 100-120 games per engine, but I have lowered my standards since it isnt always possible to get so many games with around 400 engines. My magic number currently is 60.

Thanks for the reply

Tony
User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Re: Sloppy experiment, results after 1 cycle

Post by ilari »

Again, thank you for testing Sloppy. Seems that the results are now more in line with my estimation. That doesn't of course mean that they're more correct, as my test conditions are far from ideal. I run hundreds of games after each change, but only against opponents that are very close in strength, and only on one time control.

Sloppy does tend to give away material and gets draws against opponents when the evaluation of both programs show a 4 pawn advantage for Sloppy.
True. Sloppy doesn't currently have any special endgame knowledge, so it needs a big lead to secure a win. That's why many won positions end up as draws even against weaker opponents. The middle game on the other hand is pretty good, which may explain Sloppy's ability to score a lot of draws (and even some wins) against much stronger opponents.

For the next version I'm going to focus a lot more on the endgame evaluation. In the meantime one can boost Sloppy's endgame by letting it use bitbases.
Tony Thomas

Re: Sloppy experiment, results after 1 cycle

Post by Tony Thomas »

Sloppy wasnt able to hold on to the promising results in the first cycle. After finishing all 4 cycles, Sloppy slipped down on the rankings once again. Its final rating is 2572, 10 points better than the score against weaker engines. Final ranking is 42, 2 ranks better than the rank against weaker engines. Only one thing is for sure, everything happend within the error margin. I still dont understand why Sloppy has such a low rating. It has a score higher than that of Bugchess and few other engines in my tournament, but sloppy is ranked lower. May be bugchess managed to trash previous version of some strong engines before a new version replaced them, who knows. Note that Bugchess did play against a wider variety of opponents. It was in my 3rd division, it won and I promoted it to 2nd, then to first, and to premier. So it has a total of 450+games. May be I should let sloppy walk the same steps.

Code: Select all

Sloppy 0.2.0 X                2572   56   58   129   33%  2705   12%
Sloppy 0.2.0 JA               2562   53   53   124   50%  2559   20% 
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: Sloppy experiment, results after 1 cycle

Post by Rémi Coulom »

Tony Thomas wrote:I have too many engines in my list with 2-20 games, mostly due to incompatibilities. It would be better if I can define the minimum number of games that an engine must have for me to get its rating. I do not want them to be excluded from the rating calculation, but I dont want them to be on the last list. If such an option exist, it would be much easier for me to edit my best versions list. I prefer to get at least 100-120 games per engine, but I have lowered my standards since it isnt always possible to get so many games with around 400 engines. My magic number currently is 60.

Thanks for the reply

Tony
The "ratings" command in bayeselo takes a parameter that does just this. That is to say, if you run "ratings 60", it will only display ratings of programs with at least 60 games (but will use all the games for the computation).

Rémi
Tony Thomas

Re: Sloppy experiment, results after 1 cycle

Post by Tony Thomas »

Rémi Coulom wrote:
Tony Thomas wrote:I have too many engines in my list with 2-20 games, mostly due to incompatibilities. It would be better if I can define the minimum number of games that an engine must have for me to get its rating. I do not want them to be excluded from the rating calculation, but I dont want them to be on the last list. If such an option exist, it would be much easier for me to edit my best versions list. I prefer to get at least 100-120 games per engine, but I have lowered my standards since it isnt always possible to get so many games with around 400 engines. My magic number currently is 60.

Thanks for the reply

Tony
The "ratings" command in bayeselo takes a parameter that does just this. That is to say, if you run "ratings 60", it will only display ratings of programs with at least 60 games (but will use all the games for the computation).

Rémi
Thanks so much Remi...Was that instruction on the internet site? I dont remember seeing it.