Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.Houdini wrote:A recent illustration of what I was saying, Ahmed Kamal's Top 10 Rating List at the Chess2U forum:Houdini wrote:There's so much more in Houdini 2 than its 25 Elo increase (when you look at more test results than just the IPON list you'll find that this is about the average gain that is found).
http://www.chess2u.com/t3992-top10-rati ... ember-2011
Houdini 2.0 (1530 games) +54 ELO improvement over Houdini 1.5a.
It's amazing how much the results from the various rating lists differ...
Robert
Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.
Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.
Some possible explanations:
1. The tester does not know how to do scientific testing.
2. The tester has introduced his own biases (perhaps inadvertently)
3. The test is done done at a very fast level.
If you test at really fast time controls, these results might make more sense but still seem pretty far off. One thing we have discovered is that all of the Ippo based programs get a fast start - their ratings are off the charts at fast time controls (a minute or less) but decline with increasing depth. We are not sure that this represents a permanent scalability bug or whether it's localized to levels below 5 or 10 minutes.
Code: Select all
----------------------------------------------------------------------------- Program Elo + - Games Score Av.Op. Draws ----------------------------------------------------------------------------- 1 Houdini 2.0 x64----------------: 3071 16 16 1530 75.3 % 2878 27.3 % 2 Critter 1.2 x64----------------: 2964 7 7 5820 61.3 % 2884 38.9 % 3 Ivanhoe B47cB x64--------------: 2957 8 8 3820 56.2 % 2914 46.6 % 4 Fire 2.2 xTreme x64------------: 2956 8 8 3670 56.2 % 2913 45.8 % 5 Deep Rybka 4.1 x64-------------: 2917 8 8 3820 49.6 % 2920 43.1 % 6 komodo 3 x64-------------------: 2907 14 14 1520 49.5 % 2910 34.1 % 7 Stockfish 2.1.1 JA x64---------: 2895 9 9 3820 45.9 % 2923 39.1 % 8 Gull 1.2 x64-------------------: 2786 9 9 3820 34.6 % 2896 30.3 % 9 Naum 4.2 x64-------------------: 2785 9 9 3820 34.4 % 2896 30.8 % 10 Deep Shredder 12 x64----------: 2734 10 10 3820 27.2 % 2905 27.3 %
Houdini 2.0 running for the IPON
Moderator: Ras
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Houdini 2.0 running for the IPON
Does anyone know what the testing conditions and time control is for this ridiculous test?
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Houdini 2.0 running for the IPON
Different lists have different settings and different reference points.Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.
Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.
Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.
The author of the list finds a +54 Elo improvement with the new Houdini, with more than 1500 games played.
It's just another data point...
Robert
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Houdini 2.0 running for the IPON
If you saw a list that showed Houdini 118 ELO weaker than on the IPON list, I think you would call the results into question. You would not consider it just another data point. And you would be right to do so.Houdini wrote:Different lists have different settings and different reference points.Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.
Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.
Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.
The author of the list finds a +54 Elo improvement with the new Houdini, with more than 1500 games played.
It's just another data point...
Robert
In this case however, it shows Houdini 2.0 a whopping 3137 ELO and 118 ELO more than the IPON list. That calls into serious question it's status as just another data point, ESPECIALLY since as you point out there are "more than 1500 games played."
This is not just another data point, it's a broken list.
-
- Posts: 9773
- Joined: Wed Mar 08, 2006 8:44 pm
- Location: Amman,Jordan
Re: Houdini 2.0 running for the IPON
I do agree here....The list is broken....Don wrote:If you saw a list that showed Houdini 118 ELO weaker than on the IPON list, I think you would call the results into question. You would not consider it just another data point. And you would be right to do so.Houdini wrote:Different lists have different settings and different reference points.Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.
Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.
Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.
The author of the list finds a +54 Elo improvement with the new Houdini, with more than 1500 games played.
It's just another data point...
Robert
In this case however, it shows Houdini 2.0 a whopping 3137 ELO and 118 ELO more than the IPON list. That calls into serious question it's status as just another data point, ESPECIALLY since as you point out there are "more than 1500 games played."
This is not just another data point, it's a broken list.
A +118 Elo increase in that list shows a severe misunderstanding of the computer chess engines testing concept in general
Dr.D
_No one can hit as hard as life.But it ain’t about how hard you can hit.It’s about how hard you can get hit and keep moving forward.How much you can take and keep moving forward….
-
- Posts: 463
- Joined: Mon Jun 07, 2010 3:13 am
- Location: Holland, MI
- Full name: Martin W
Re: Houdini 2.0 running for the IPON
Could you or Remi help explain how the drawelo/drawdist parameter helps with refining the rating, and how it is used?
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Houdini 2.0 running for the IPON
I think I know where you are going with this. Is your theory that he set up bayeselo with a different drawelo or other parameter? That'a a good theory.gaard wrote:Could you or Remi help explain how the drawelo/drawdist parameter helps with refining the rating, and how it is used?
I'm not sure what time control was used, but my theory is that these games are played really fast. I have noticed that Houdini and all Ippo clones do not scale as well as other top programs - if you run game in 5 seconds (for example) they will perform incredibly, but as you test at longer and longer time controls the ELO will drop relative to other programs. So my working theory on this is that this list is based on wicked fast time controls.
-
- Posts: 463
- Joined: Mon Jun 07, 2010 3:13 am
- Location: Holland, MI
- Full name: Martin W
Re: Houdini 2.0 running for the IPON
I'm sure you're correct re the time control without even looking at the stats. My question was more re rating predictions in general. I have an alternate rating calculator I use to remove the compression I see in BayesElo ratings, and I am always looking for ways to refine it.Don wrote:I think I know where you are going with this. Is your theory that he set up bayeselo with a different drawelo or other parameter? That'a a good theory.gaard wrote:Could you or Remi help explain how the drawelo/drawdist parameter helps with refining the rating, and how it is used?
I'm not sure what time control was used, but my theory is that these games are played really fast. I have noticed that Houdini and all Ippo clones do not scale as well as other top programs - if you run game in 5 seconds (for example) they will perform incredibly, but as you test at longer and longer time controls the ELO will drop relative to other programs. So my working theory on this is that this list is based on wicked fast time controls.
-
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Houdini 2.0 running for the IPON
I was assuming it was super fast, but I guess not (I consider 1 minute on that hardware fast, but not super fast).
Here is what is says:
- Time control : 1 min/game
- Ponder : OFF
- GUI : Shredder Classic
- Opening book : The default book (up to 12 moves)
- Learning : OFF
- EGTB : None
- Operating system : Windows 7 ultimate 64-bit
- Processors : Intel Core2 Quad Q6700
- Cores : 1 core for each engine
- Games : 1.350 Per Engine Minimal
- Database : 21980 games
- Last update : 6th September 2011
Here is how he says he does rating:
- The Rating is calculated by EloStat 1.3 (Start ELO = 2900)
Top 10 list:
-----------------------------------------------------------------------------
Program Elo + - Games Score Av.Op. Draws
-----------------------------------------------------------------------------
1 Houdini 2.0 x64----------------: 3071 16 16 1530 75.3 % 2878 27.3 %
2 Critter 1.2 x64----------------: 2964 7 7 5820 61.3 % 2884 38.9 %
3 Ivanhoe B47cB x64--------------: 2957 8 8 3820 56.2 % 2914 46.6 %
4 Fire 2.2 xTreme x64------------: 2956 8 8 3670 56.2 % 2913 45.8 %
5 Deep Rybka 4.1 x64-------------: 2917 8 8 3820 49.6 % 2920 43.1 %
6 komodo 3 x64-------------------: 2907 14 14 1520 49.5 % 2910 34.1 %
7 Stockfish 2.1.1 JA x64---------: 2895 9 9 3820 45.9 % 2923 39.1 %
8 Gull 1.2 x64-------------------: 2786 9 9 3820 34.6 % 2896 30.3 %
9 Naum 4.2 x64-------------------: 2785 9 9 3820 34.4 % 2896 30.8 %
10 Deep Shredder 12 x64----------: 2734 10 10 3820 27.2 % 2905 27.3 %
The absolute rating numbers are of course not relevant. The thing worth judging is the rating differences. I have not bothered to look if these are consistent or not, but do not look completely ridiculous to me at this time control with no increment. I doubt these are faked results, just someone having fun testing. No need to be too harsh. The main conclusion I would draw is Houdini has probably improved more at fast games than slow games.
-Sam
Here is what is says:
- Time control : 1 min/game
- Ponder : OFF
- GUI : Shredder Classic
- Opening book : The default book (up to 12 moves)
- Learning : OFF
- EGTB : None
- Operating system : Windows 7 ultimate 64-bit
- Processors : Intel Core2 Quad Q6700
- Cores : 1 core for each engine
- Games : 1.350 Per Engine Minimal
- Database : 21980 games
- Last update : 6th September 2011
Here is how he says he does rating:
- The Rating is calculated by EloStat 1.3 (Start ELO = 2900)
Top 10 list:
-----------------------------------------------------------------------------
Program Elo + - Games Score Av.Op. Draws
-----------------------------------------------------------------------------
1 Houdini 2.0 x64----------------: 3071 16 16 1530 75.3 % 2878 27.3 %
2 Critter 1.2 x64----------------: 2964 7 7 5820 61.3 % 2884 38.9 %
3 Ivanhoe B47cB x64--------------: 2957 8 8 3820 56.2 % 2914 46.6 %
4 Fire 2.2 xTreme x64------------: 2956 8 8 3670 56.2 % 2913 45.8 %
5 Deep Rybka 4.1 x64-------------: 2917 8 8 3820 49.6 % 2920 43.1 %
6 komodo 3 x64-------------------: 2907 14 14 1520 49.5 % 2910 34.1 %
7 Stockfish 2.1.1 JA x64---------: 2895 9 9 3820 45.9 % 2923 39.1 %
8 Gull 1.2 x64-------------------: 2786 9 9 3820 34.6 % 2896 30.3 %
9 Naum 4.2 x64-------------------: 2785 9 9 3820 34.4 % 2896 30.8 %
10 Deep Shredder 12 x64----------: 2734 10 10 3820 27.2 % 2905 27.3 %
The absolute rating numbers are of course not relevant. The thing worth judging is the rating differences. I have not bothered to look if these are consistent or not, but do not look completely ridiculous to me at this time control with no increment. I doubt these are faked results, just someone having fun testing. No need to be too harsh. The main conclusion I would draw is Houdini has probably improved more at fast games than slow games.
-Sam
-
- Posts: 1539
- Joined: Thu Mar 09, 2006 2:02 pm
Re: Houdini 2.0 running for the IPON
Thanks Larry, I fixed the download now!lkaufman wrote:
I notice that you made a comment about the huge difference of the top engine on Elostat vs. Bayeselo. I think you made a huge mistake. The Bayeselo list is based on Shredder 12 = 2800, but the Elostat list is based on Shredder 12 = 3000! So all the ratings are about 200 higher! Once you correct this, Houdini 2.0 is five elo lower on Elostat than on Bayeselo, no big deal.
The 'huge' was refered to the 5 Elo but reading it from someone else brings back some relation to it. 5 Elo phhh

Thanks again
Ingo
-
- Posts: 3226
- Joined: Wed May 06, 2009 10:31 pm
- Location: Fuquay-Varina, North Carolina
Re: Houdini 2.0 running for the IPON
drawdist computes (I believe) a maximum likelihood estimate for the drawelo parameter. I do know that the drawelo estimate increases when the percentage of draws increases.
drawelo is used in the following equations, which is Bayeselo's adjustment to the Elo equations:
P(White) = 1/(1+10^((Black Elo - White Elo - Advantage + Drawelo)/400))
P(Black)= 1/(1+10^((White Elo - Black Elo + Advantage + Drawelo)/400))
Which lead to a third equation:
P(Draws) = 1 - P(White) - P(Black)
All in all, it provides additional information for the Elo estimate. For a given number of White wins and Black wins, more draws lead to a smaller estimated Elo difference.
drawelo is used in the following equations, which is Bayeselo's adjustment to the Elo equations:
P(White) = 1/(1+10^((Black Elo - White Elo - Advantage + Drawelo)/400))
P(Black)= 1/(1+10^((White Elo - Black Elo + Advantage + Drawelo)/400))
Which lead to a third equation:
P(Draws) = 1 - P(White) - P(Black)
All in all, it provides additional information for the Elo estimate. For a given number of White wins and Black wins, more draws lead to a smaller estimated Elo difference.