H4 or S5 !?

Ajedrecista · Post by **Ajedrecista** » Mon Jun 02, 2014 9:52 pm

Hello:

michiguel wrote:

IWB wrote:Hello all,

This is quite interesting:

The official method for the IPON is Bayeselo with mm 0 1, draw rate consideration. The pure TOP 16 one on one looks like this:

Code: Select all

   1 Houdini 4           3111    9    9  3300   75%  2921   31% 
   2 Stockfish 5         3106    9    8  3300   75%  2921   39% 
   3 Komodo 7a           3088    9    9  3300   72%  2922   37% 
   4 Gull 3              3057    8    8  3300   68%  2924   41% 
   5 Critter 1.4a        2980    8    8  3300   57%  2930   46% 
   6 Equinox 2.02        2975    8    8  3300   56%  2930   47% 
   7 Deep Rybka 4.1      2959    8    8  3300   54%  2931   45% 
   8 Deep Fritz 14       2894    8    8  3300   44%  2935   45% 
   9 Chiron 2            2889    8    8  3300   44%  2936   45% 
  10 Protector 1.6.0     2870    8    8  3300   41%  2937   44% 
  11 Hannibal 1.4b       2870    8    8  3300   41%  2937   43% 
  12 Naum 4.2            2838    8    9  3300   36%  2939   41% 
  13 Texel 1.04          2838    8    8  3300   37%  2939   38% 
  14 Senpai 1.0          2838    8    8  3300   36%  2939   41% 
  15 HIARCS 14 WCSC 32b  2812    9    9  3300   33%  2941   37% 
  16 Jonny 6.00          2798    9    9  3300   31%  2942   36%

The same set of data with Bayes default:

Code: Select all

   1 Houdini 4           3111   11   11  3300   75%  2931   31% 
   2 Stockfish 5         3105   10   10  3300   75%  2931   39% 
   3 Komodo 7a           3088   10   10  3300   72%  2932   37% 
   4 Gull 3              3057   10   10  3300   68%  2934   41% 
   5 Critter 1.4a        2984   10    9  3300   57%  2939   46% 
   6 Equinox 2.02        2980    9   10  3300   56%  2939   47% 
   7 Deep Rybka 4.1      2964   10   10  3300   54%  2940   45% 
   8 Deep Fritz 14       2905    9   10  3300   44%  2944   45% 
   9 Chiron 2            2900   10   10  3300   44%  2945   45% 
  10 Protector 1.6.0     2883   10   10  3300   41%  2946   44% 
  11 Hannibal 1.4b       2883   10   10  3300   41%  2946   43% 
  12 Naum 4.2            2854   10   10  3300   36%  2948   41% 
  13 Texel 1.04          2854   10   10  3300   37%  2948   38% 
  14 Senpai 1.0          2853   10   10  3300   36%  2948   41% 
  15 HIARCS 14 WCSC 32b  2830   10   10  3300   33%  2949   37% 
  16 Jonny 6.00          2816   10   10  3300   31%  2950   36%

Now with Elostat:

Code: Select all

  1 Stockfish 5                    : 3115   10  10  3300    74.9 %   2924   38.6 %
  2 Houdini 4                      : 3111   11  10  3300    74.5 %   2925   30.7 %
  3 Komodo 7a                      : 3091   10  10  3300    72.1 %   2926   37.0 %
  4 Gull 3                         : 3059    9   9  3300    68.0 %   2928   41.0 %
  5 Critter 1.4a                   : 2982    9   9  3300    57.0 %   2933   46.1 %
  6 Equinox 2.02                   : 2978    9   9  3300    56.3 %   2933   46.9 %
  7 Deep Rybka 4.1                 : 2962    9   9  3300    53.9 %   2935   45.2 %
  8 Deep Fritz 14                  : 2899    9   9  3300    44.4 %   2939   44.9 %
  9 Chiron 2                       : 2894    9   9  3300    43.5 %   2939   45.1 %
 10 Protector 1.6.0                : 2877    9   9  3300    40.9 %   2940   44.1 %
 11 Hannibal 1.4b                  : 2875    9   9  3300    40.7 %   2940   42.6 %
 12 Texel 1.04                     : 2846    9   9  3300    36.5 %   2942   38.5 %
 13 Naum 4.2                       : 2845    9   9  3300    36.4 %   2942   40.9 %
 14 Senpai 1.0                     : 2845    9   9  3300    36.3 %   2942   40.7 %
 15 HIARCS 14 WCSC 32b             : 2822   10  10  3300    33.2 %   2944   37.5 %
 16 Jonny 6.00                     : 2808   10  10  3300    31.2 %   2945   35.7 %

and finaly with ORDO:

Code: Select all

   # PLAYER                : RATING    POINTS  PLAYED    (%)
   1 Stockfish 5           : 3115.1    2473.0    3300   74.9%
   2 Houdini 4             : 3111.0    2458.5    3300   74.5%
   3 Komodo 7a             : 3089.3    2379.0    3300   72.1%
   4 Gull 3                : 3054.9    2245.5    3300   68.0%
   5 Critter 1.4a          : 2968.9    1882.0    3300   57.0%
   6 Equinox 2.02          : 2963.8    1859.5    3300   56.3%
   7 Deep Rybka 4.1        : 2945.6    1778.5    3300   53.9%
   8 Deep Fritz 14         : 2875.7    1464.5    3300   44.4%
   9 Chiron 2              : 2869.4    1436.5    3300   43.5%
  10 Protector 1.6.0       : 2850.1    1351.0    3300   40.9%
  11 Hannibal 1.4b         : 2848.3    1343.0    3300   40.7%
  12 Texel 1.04            : 2816.4    1204.5    3300   36.5%
  13 Naum 4.2              : 2815.5    1200.5    3300   36.4%
  14 Senpai 1.0            : 2814.9    1198.0    3300   36.3%
  15 HIARCS 14 WCSC 32b    : 2790.6    1096.0    3300   33.2%
  16 Jonny 6.00            : 2774.4    1030.0    3300   31.2%

That is very good, as everyone can take the list he likes

Regards
Ingo

There is only one correct answer, and that is SF5 should be #1 (by a very tiny small margin, though). Why? this is a round robin, so everybody played each other in the same conditions etc. etc. so, the programs who score mores points overall should be #1. This is one of the cases in which there is no doubt about the relative order. As a reference, in the output of Ordo you can see the actual points (the others give %). Whatever program you use, the relative order should follow the number of points. Basically, SF won this gigantic RR tournament, and should be #1.

1 Stockfish 5 : 3115.1 2473.0 3300 74.9%
2 Houdini 4 : 3111.0 2458.5 3300 74.5%

Miguel

I agree with Miguel here: if SF won the RR (more points than its opponents) then it should have the top rating. Of course differences are too narrow and well inside the error bars, as both Ingo and Miguel wrote. A similar thing happens with Naum and Texel.

I ran my own clumsy rating programme (which is very similar to EloSTAT in the results). Engine 01, engine 02, etc. correspond to the engine in that position of EloSTAT and Ordo lists:

Code: Select all

Round Robin with 16 engines and   3300 games per engine.
Total number of games:     26400 games.
 
 Engines:     Performance:     Score:
 
Engine 01:      3114.77       74.94 %
Engine 02:      3111.00       74.50 %
Engine 03:      3091.03       72.09 %
Engine 04:      3059.72       68.05 %
Engine 05:      2983.07       57.03 %
Engine 06:      2978.56       56.35 %
Engine 07:      2962.47       53.89 %
Engine 08:      2900.56       44.38 %
Engine 09:      2894.97       43.53 %
Engine 10:      2877.75       40.94 %
Engine 11:      2876.12       40.70 %
Engine 12:      2847.39       36.50 %
Engine 13:      2846.54       36.38 %
Engine 14:      2846.01       36.30 %
Engine 15:      2823.90       33.21 %
Engine 16:      2809.04       31.21 %
 
Mean of ratings:  2938.93 Elo.

I fixed Houdini 4 rating to 3111, just like Ingo did. Then I compare ratings in a way suggested by Peter Österlund some time ago: with their z-score = (rating - average)/(sample standard deviation).

Code: Select all

Bayeselo   Bayeselo     EloSTAT      Ordo       My tool
(mm 0 1)   (default)

  313        295         307         340.7       305.73   Max.(list) - min.(list)

2932.69     2941.69     2938.06     2918.99     2938.93   Average of ratings.
 108.63      101.75      106.67      118.76      106.17   Sample standard deviation of ratings.

 1.6415	     1.6641      1.6212      1.6168      1.6207   Houdini 4
 1.5955	     1.6051      1.6587      1.6513      1.6562   Stockfish 5
 1.4298	     1.4380      1.4337      1.4341      1.4326   Komodo 7a
 1.1444	     1.1333      1.1337      1.1444      1.1377   Gull 3
 0.4356	     0.4159      0.4119      0.4202      0.4157   Critter 1.4a
 0.3895	     0.3765      0.3744      0.3773      0.3733   Equinox 2.02
 0.2422	     0.2193      0.2244      0.2240      0.2217   Deep Rybka 4.1
-0.3561	    -0.3606     -0.3662     -0.3646     -0.3614   Deep Fritz 14
-0.4022     -0.4097     -0.4131     -0.4176     -0.4141   Chiron 2
-0.5771	    -0.5768     -0.5724     -0.5801     -0.5763   Protector 1.6.0
-0.5771	    -0.5768     -0.5912     -0.5953     -0.5916   Hannibal 1.4b
-0.8717	    -0.8618     -0.8724     -0.8715     -0.8702   Naum 4.2
-0.8717	    -0.8618     -0.8630     -0.8639     -0.8622   Texel 1.04
-0.8717	    -0.8717     -0.8724     -0.8765     -0.8752   Senpai 1.0
-1.1110	    -1.0977     -1.0880     -1.0812     -1.0835   HIARCS 14 WCSC 32b
-1.2399	    -1.2353     -1.2193     -1.2176     -1.2235   Jonny 6.00

I hope no typos; z-scores are more less similar in general (even in the Texel-Naum case) but very different with SF and H (Bayeselo versus the rest). Naum won less points than Texel but with a higher draw ratio, so I start to think that the draw ratio is not the most important factor... just a wild guess.

Regards from Spain.

Ajedrecista.

michiguel · Post by **michiguel** » Mon Jun 02, 2014 9:53 pm

IWB wrote:
michiguel wrote: I do not understand what you mean. What argument have I had before? I am confused about those numbers 6,7 3,4 13 and 14. What are those?

That should be just an example that, what now is obvious for No 1 and 2, might be the case for Engines ranked 6 and 7 or 3 and 4 or whatever pair you like in the past. Just examples where nobody cared ... and not it is important suddenly? (Because of 5 Elo which are fully in one SD ... No! It is because of the Number in front - if it is a one or a two )

My problem is that people usually do not mind conditions but just rankings! Worse, they look for No 1, 2 and maybe 3. Thats it!

At least we agree that there is very little difference between the Tops

Bye
Ingo

Since you mention it, there is also a difference between Texel and Naum relative positions. I did not bring that up because my answer was general.

Miguel
EDIT: Jesus mention it, in an almost simultaneous message.

Uri Blass · Post by **Uri Blass** » Mon Jun 02, 2014 11:56 pm

IWB wrote:

Even if I can follow your argumentation here are 3 argument which are valid as well:

1. Your argument are there for years for No. 6 and 7 or 3 and 4 or 13 and 14 but nobody cared, in contrary the draw consideration was an important argument ... and now it is wrong?
2. Humans tend to value a decisive game more than a tie hence a small reward for 8% more decided games is not that bad ...

Bye
Ingo

I disagree that humans value a decisive game more than a tie
when the result is equal.

I feel that it is the opposite.

If A beats B 40-0 with 960 draws then
it shows that A is clearly the better player.

If A beats B 520-480 with no draws then I am even not sure that A is stronger than B so my feeling is that if 2 programs have the same score the program that got more draws should be number 1.

lucasart · Post by **lucasart** » Tue Jun 03, 2014 12:53 am

IWB wrote:

Code: Select all

   1 Houdini 4           3111    9    9  3300   75%  2921   31% 
   2 Stockfish 5         3106    9    8  3300   75%  2921   39%

SF gets far too many draws compared to H. It's shame Marco does not want to hear about it, but SF really needs a non zero contempt by default.

Also, SF is playing without syzygy. That alone wins at least 10 elo, and would get it past Houdini here.

I'm pretty sure that SF+syzygy+contempt=10 would be #1 in this list.

carldaman · Post by **carldaman** » Tue Jun 03, 2014 1:13 am

lucasart wrote:
IWB wrote:
Code: Select all
   1 Houdini 4           3111    9    9  3300   75%  2921   31% 
   2 Stockfish 5         3106    9    8  3300   75%  2921   39%
SF gets far too many draws compared to H. It's shame Marco does not want to hear about it, but SF really needs a non zero contempt by default.

Also, SF is playing without syzygy. That alone wins at least 10 elo, and would get it past Houdini here.

I'm pretty sure that SF+syzygy+contempt=10 would be #1 in this list.

Isn't H4 also playing without syzygy? Regardless, conditions should be the same for the purpose of fairness and accuracy - either H4, SF5 and K7 all use syzygy, or they all don't.

Regards,
CL

carldaman · Post by **carldaman** » Tue Jun 03, 2014 1:19 am

michiguel wrote:

If you had exactly the same opposition and got more points, how can you not have higher rating?
Miguel

I'm with Miguel on this one. Any rating system that gives the top scorer a 2nd best rating, in a multi-RR that all ratings are based upon, is not adequate.

CL

John_F · Post by **John_F** » Tue Jun 03, 2014 1:21 am

SF gets far too many draws compared to H. It's shame Marco does not want to hear about it, but SF really needs a non zero contempt by default.

That's is what I was inclined to think also, but Stefan Pohl actually tested the idea. Seems the contempt setting doesn't improve results for Stockfish, for whatever reason.

I did some LS tests with Stockfish with Contempt=15 and Contempt=30 (see Fishcooking). The draw-rate lowered a little bit, but the score didnt get better.

Stefan

IWB · Post by **IWB** » Tue Jun 03, 2014 6:39 am

carldaman wrote:
Isn't H4 also playing without syzygy? Regardless, conditions should be the same for the purpose of fairness and accuracy - either H4, SF5 and K7 all use syzygy, or they all don't.

CL

H4 was playing with nalimov (or Gaviota, I have to check).
I disagree here. I offer 4pc tbs to everyone. If an engine can't (or don't want to) use them it is not a matter of fairness but the decision of the programs author.

Anyhow, as I am curious I will run the S5 match again with 4pc SYSYSY bases. I personaly have some doubts that it will improve the rating but if it does significantly I will replace the original SF5 with this one.

BYe
Ingo

PS: I'll try to start this evening

IWB · Post by **IWB** » Tue Jun 03, 2014 6:42 am

michiguel wrote:
Since you mention it, there is also a difference between Texel and Naum relative positions. I did not bring that up because my answer was general.
...

Ahh thx, I did not see that by myself but I know that I had similar "wrong" rankings in the past.

THats Bayes, I can't help it.

BYe
Ingo

IWB · Post by **IWB** » Tue Jun 03, 2014 6:46 am

Ajedrecista wrote:
... Naum won less points than Texel but with a higher draw ratio, so I start to think that the draw ratio is not the most important factor...

Ahh interesting. If it is not the draw rate why is the ranking in that order with bayes? Question to everyone who can explain in simple words

Thx for this
Ingo

H4 or S5 !?

Re: H4 or SF5!?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or S5 !?

Re: H4 or SF5!?