Houdini 2.0 running for the IPON

Don · Post by **Don** » Wed Sep 07, 2011 3:10 pm

Does anyone know what the testing conditions and time control is for this ridiculous test?

Don wrote:
Houdini wrote:
Houdini wrote:There's so much more in Houdini 2 than its 25 Elo increase (when you look at more test results than just the IPON list you'll find that this is about the average gain that is found).
A recent illustration of what I was saying, Ahmed Kamal's Top 10 Rating List at the Chess2U forum:
http://www.chess2u.com/t3992-top10-rati ... ember-2011

Houdini 2.0 (1530 games) +54 ELO improvement over Houdini 1.5a.
It's amazing how much the results from the various rating lists differ...

Robert
That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.

Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.

Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.

Some possible explanations:

1. The tester does not know how to do scientific testing.
2. The tester has introduced his own biases (perhaps inadvertently)
3. The test is done done at a very fast level.

If you test at really fast time controls, these results might make more sense but still seem pretty far off. One thing we have discovered is that all of the Ippo based programs get a fast start - their ratings are off the charts at fast time controls (a minute or less) but decline with increasing depth. We are not sure that this represents a permanent scalability bug or whether it's localized to levels below 5 or 10 minutes.
Code: Select all
    -----------------------------------------------------------------------------
      Program                          Elo    +  -  Games    Score  Av.Op. Draws 
    -----------------------------------------------------------------------------
    1 Houdini 2.0 x64----------------: 3071  16  16  1530    75.3 %  2878  27.3 %
    2 Critter 1.2 x64----------------: 2964    7  7  5820    61.3 %  2884  38.9 %
    3 Ivanhoe B47cB x64--------------: 2957    8  8  3820    56.2 %  2914  46.6 %
    4 Fire 2.2 xTreme x64------------: 2956    8  8  3670    56.2 %  2913  45.8 %
    5 Deep Rybka 4.1 x64-------------: 2917    8  8  3820    49.6 %  2920  43.1 %
    6 komodo 3 x64-------------------: 2907  14  14  1520    49.5 %  2910  34.1 %
    7 Stockfish 2.1.1 JA x64---------: 2895    9  9  3820    45.9 %  2923  39.1 %
    8 Gull 1.2 x64-------------------: 2786    9  9  3820    34.6 %  2896  30.3 %
    9 Naum 4.2 x64-------------------: 2785    9  9  3820    34.4 %  2896  30.8 %
    10 Deep Shredder 12 x64----------: 2734  10  10  3820    27.2 %  2905  27.3 %

Houdini · Post by **Houdini** » Wed Sep 07, 2011 3:23 pm

Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.

Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.

Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.

Different lists have different settings and different reference points.
The author of the list finds a +54 Elo improvement with the new Houdini, with more than 1500 games played.
It's just another data point...

Robert

Don · Post by **Don** » Wed Sep 07, 2011 4:34 pm

Houdini wrote:
Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.

Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.

Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.
Different lists have different settings and different reference points.
The author of the list finds a +54 Elo improvement with the new Houdini, with more than 1500 games played.
It's just another data point...

Robert

If you saw a list that showed Houdini 118 ELO weaker than on the IPON list, I think you would call the results into question. You would not consider it just another data point. And you would be right to do so.

In this case however, it shows Houdini 2.0 a whopping 3137 ELO and 118 ELO more than the IPON list. That calls into serious question it's status as just another data point, ESPECIALLY since as you point out there are "more than 1500 games played."

This is not just another data point, it's a broken list.

Dr.Wael Deeb · Post by **Dr.Wael Deeb** » Wed Sep 07, 2011 5:22 pm

Don wrote:
Houdini wrote:
Don wrote:That list is completely broken, you cannot use it for making a point about anything if you want to be taken seriously.

Shredder is rated 2734 on that list - so to compare it to Ingo's list you must add 66 ELO which makes Houdini 1.5 3017 + 66 = 3083 and Houdini 2.0 would be 3137. Even WITHOUT the 66 ELO adjustment Houdini 1.5 comes out stronger on that list - so it's a ridiculous list.

Critter is rated 52 ELO above Komodo on this list, but on Ingo's list Komodo is 12 ELO higher. That is a discrepancy of 64 ELO.
Different lists have different settings and different reference points.
The author of the list finds a +54 Elo improvement with the new Houdini, with more than 1500 games played.
It's just another data point...

Robert
If you saw a list that showed Houdini 118 ELO weaker than on the IPON list, I think you would call the results into question. You would not consider it just another data point. And you would be right to do so.

In this case however, it shows Houdini 2.0 a whopping 3137 ELO and 118 ELO more than the IPON list. That calls into serious question it's status as just another data point, ESPECIALLY since as you point out there are "more than 1500 games played."

This is not just another data point, it's a broken list.

I do agree here....The list is broken....
A +118 Elo increase in that list shows a severe misunderstanding of the computer chess engines testing concept in general
Dr.D

gaard · Post by **gaard** » Wed Sep 07, 2011 7:29 pm

Could you or Remi help explain how the drawelo/drawdist parameter helps with refining the rating, and how it is used?

Don · Post by **Don** » Wed Sep 07, 2011 8:02 pm

gaard wrote:Could you or Remi help explain how the drawelo/drawdist parameter helps with refining the rating, and how it is used?

I think I know where you are going with this. Is your theory that he set up bayeselo with a different drawelo or other parameter? That'a a good theory.

I'm not sure what time control was used, but my theory is that these games are played really fast. I have noticed that Houdini and all Ippo clones do not scale as well as other top programs - if you run game in 5 seconds (for example) they will perform incredibly, but as you test at longer and longer time controls the ELO will drop relative to other programs. So my working theory on this is that this list is based on wicked fast time controls.

gaard · Post by **gaard** » Wed Sep 07, 2011 8:18 pm

Don wrote:
gaard wrote:Could you or Remi help explain how the drawelo/drawdist parameter helps with refining the rating, and how it is used?
I think I know where you are going with this. Is your theory that he set up bayeselo with a different drawelo or other parameter? That'a a good theory.

I'm not sure what time control was used, but my theory is that these games are played really fast. I have noticed that Houdini and all Ippo clones do not scale as well as other top programs - if you run game in 5 seconds (for example) they will perform incredibly, but as you test at longer and longer time controls the ELO will drop relative to other programs. So my working theory on this is that this list is based on wicked fast time controls.

I'm sure you're correct re the time control without even looking at the stats. My question was more re rating predictions in general. I have an alternate rating calculator I use to remove the compression I see in BayesElo ratings, and I am always looking for ways to refine it.

BubbaTough · Post by **BubbaTough** » Wed Sep 07, 2011 8:35 pm

I was assuming it was super fast, but I guess not (I consider 1 minute on that hardware fast, but not super fast).

Here is what is says:

- Time control : 1 min/game
- Ponder : OFF
- GUI : Shredder Classic
- Opening book : The default book (up to 12 moves)
- Learning : OFF
- EGTB : None
- Operating system : Windows 7 ultimate 64-bit
- Processors : Intel Core2 Quad Q6700
- Cores : 1 core for each engine
- Games : 1.350 Per Engine Minimal
- Database : 21980 games
- Last update : 6th September 2011

Here is how he says he does rating:

- The Rating is calculated by EloStat 1.3 (Start ELO = 2900)

Top 10 list:
-----------------------------------------------------------------------------
Program Elo + - Games Score Av.Op. Draws
-----------------------------------------------------------------------------
1 Houdini 2.0 x64----------------: 3071 16 16 1530 75.3 % 2878 27.3 %
2 Critter 1.2 x64----------------: 2964 7 7 5820 61.3 % 2884 38.9 %
3 Ivanhoe B47cB x64--------------: 2957 8 8 3820 56.2 % 2914 46.6 %
4 Fire 2.2 xTreme x64------------: 2956 8 8 3670 56.2 % 2913 45.8 %
5 Deep Rybka 4.1 x64-------------: 2917 8 8 3820 49.6 % 2920 43.1 %
6 komodo 3 x64-------------------: 2907 14 14 1520 49.5 % 2910 34.1 %
7 Stockfish 2.1.1 JA x64---------: 2895 9 9 3820 45.9 % 2923 39.1 %
8 Gull 1.2 x64-------------------: 2786 9 9 3820 34.6 % 2896 30.3 %
9 Naum 4.2 x64-------------------: 2785 9 9 3820 34.4 % 2896 30.8 %
10 Deep Shredder 12 x64----------: 2734 10 10 3820 27.2 % 2905 27.3 %

The absolute rating numbers are of course not relevant. The thing worth judging is the rating differences. I have not bothered to look if these are consistent or not, but do not look completely ridiculous to me at this time control with no increment. I doubt these are faked results, just someone having fun testing. No need to be too harsh. The main conclusion I would draw is Houdini has probably improved more at fast games than slow games.

-Sam

IWB · Post by **IWB** » Wed Sep 07, 2011 9:50 pm

lkaufman wrote:
I notice that you made a comment about the huge difference of the top engine on Elostat vs. Bayeselo. I think you made a huge mistake. The Bayeselo list is based on Shredder 12 = 2800, but the Elostat list is based on Shredder 12 = 3000! So all the ratings are about 200 higher! Once you correct this, Houdini 2.0 is five elo lower on Elostat than on Bayeselo, no big deal.

Thanks Larry, I fixed the download now!

The 'huge' was refered to the 5 Elo but reading it from someone else brings back some relation to it. 5 Elo phhh

Thanks again
Ingo

Adam Hair · Post by **Adam Hair** » Thu Sep 08, 2011 2:49 am

drawdist computes (I believe) a maximum likelihood estimate for the drawelo parameter. I do know that the drawelo estimate increases when the percentage of draws increases.

drawelo is used in the following equations, which is Bayeselo's adjustment to the Elo equations:

P(White) = 1/(1+10^((Black Elo - White Elo - Advantage + Drawelo)/400))
P(Black)= 1/(1+10^((White Elo - Black Elo + Advantage + Drawelo)/400))

Which lead to a third equation:
P(Draws) = 1 - P(White) - P(Black)

All in all, it provides additional information for the Elo estimate. For a given number of White wins and Black wins, more draws lead to a smaller estimated Elo difference.

Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON

Re: Houdini 2.0 running for the IPON