program style, risk aversion

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

My numeric method for determine draw trends of each engine.

Post by Ajedrecista »

Hello Kai:

I took a look at your original post suggesting D/[µ*(1 - µ)] when you posted it, and I must admit that I liked the idea.

My numeric method works very bad if µ is too close to 0 or 1. My factor correction µ*(1 - µ) was chosen by random at the time I was writing my original post, so it is expected that this factor (and also k) is a source of errors... I do not know if a more adecuate factor could be used instead of µ*(1 - µ), surely yes... but which one? Please let me explain a little more with a extreme case, where my model cracks:

Code: Select all

(+99 =1 -0): µ = 0.995; D = 0.01.

k*µ*(1 - µ) = 1*0.995*0.005 = 0.004975 (µ*(1 - µ) tends to 0 if µ tends to 0 or 1).

------------

(+99 =0 -1): µ = 0.99; D = 0.

k*µ*(1 - µ) = 0*0.99*0.01 = 0.
In this extreme case, your model also has two very different results because of D:

Code: Select all

(+99 =1 -0): µ = 0.995; D = 0.01.

D/[µ*(1 - µ)] = 0.01/(0.995*0.005) ~ 2.01005.

------------

(+99 =0 -1): µ = 0.99; D = 0.

D/[µ*(1 - µ)] = 0/(0.99*0.01) = 0.
Both of us fail to give a number if µ = {0, 1}, as expected.

At least I posted that my parameter k*µ*(1 - µ) could be useful in logical Round Robin tournaments where there are not huge Elo differences... a score of 90% means a difference of many Elo between the winner engine (I should suppose that 90% is enough to win unless there are more good engines, making a 'GM vs. novices' senseless Round Robin tournament) but of course I can not say anything in such unbalanced matches.

------------------------
Laskos wrote:I didn't quite get this k factor, and I use only score*(1-score), or µ(1-µ)) in the notation of Jesus when comparing to draw ratio. The problem with the assumption that k*score*(1-score) is somehow constant is evidenced by this:

Score = 0.5, Draw Ratio = d = 0.40
Then k=0.4
k*s*(1-s)=0.1 assumed to be a constant for an engine

Same engine:
Score = 0.9
k is smaller than 1 by definition
Then k*s*(1-s) is smaller than 0.09
There is no k to match the old k*s*(1-s) of 0.1, and even the maximum k=1 is unrealistic, as there would be lots of wins and draws, but no losses.


Good point, but I will try to defend myself a little: I think that the 'draw aversion' not only depends on the engine but also in the opponents (so, it is not constant for every engine in every tournament IMHO). For example, taking SF in a tournament where µ_SF = 50% (and for example D_SF = 40%), it should mean that the rest of engines were already strong for holding against SF; OTOH, if µ_SF = 90% with enough games, I understand that the rest of engines were clearly weaker than SF, and of course SF will avoid more draws because the other engines are not so strong as SF and will blunder more often, fact that SF will take advantage of.
Laskos wrote:On the other hand, if the factor is d / s*(1-s) then

Score = 0.5, Draw Ratio = d = 0.40
d / s*(1-s) = 1.6

Same engine:
Score = 0.9
The prediction for the same 1.6 is that d / 0.1*0.9 =! 1.6
Then d=0.09*1.6=0.144
So it predicts a result of 82.8% wins, 14.4% draws, and 2.8% losses, which is pretty realistic.

Therefore I think that d / s*(1-s) a more useful quantity than k*s*(1-s).
Good try, but your method can have errors when you try to estimate the draw ratio in this way. Please take a look in this slightly modified example exposed by you:

Code: Select all

µ = 0.5, D = 0.6 (perfectly possible: +20% =60% -20%).

D/[µ*(1 - µ)] = 0.6/(0.5)² = 2.4

------------

µ' = 0.9; here you suppose D/[µ*(1 - µ)] = D'/[(µ')*(1 - µ')] = 2.4 = constant.

Prediction: D' = 2.4*(µ')*(1 - µ') = 2.4*0.9*0.1 = 0.216 = 21.6%.

But &#40;D')_max = 2*min&#40;µ', 1 - µ') = 2*0.1 = 0.2 < D'.
I was wicked in the example but it is certainly possible. I recomend again more less balanced matches (I mean, not SF and a random move generator). But, what can one must understand from a 'not unbalanced match'? Maybe all the engines of a Round Robin must be in the range of [0.15, 0.85] for µ? This interval could be other, of course! But the way, I did not write [0.15, 0.85] by chance:

Code: Select all

w = wins; d = draws; l = loses; n = games = w + d + l.

&#40;Rating difference&#41; = 400*log&#123;&#91;1 + &#40;w - l&#41;/n&#93;/&#91;1 - &#40;w - l&#41;/n&#93;&#125; by definition &#40;just toy with variables&#41;.

&#40;Rating performance&#41;&#58; win ---> rating + 400; draw ---> rating; lose ---> rating - 400.
&#40;Average rating performance&#41; = <rating> + 400*&#40;w - l&#41;/n; &#40;average opponent's rating&#58; <rating>).
&#40;Rating difference&#41; = &#40;average rating performance&#41; - <rating> = 400*&#40;w - l&#41;/n.
If you call x = (w - l)/n and then you plot 400*log[(1 + x)/(1 - x)] and 400x (you can even get rid of the constant 400, which is the same in both cases), you will see that they are similar in the range -2/3 < x < 2/3 more less. It means the following:

Code: Select all

µ_min. ===> w = 0, l = 2n/3, d = n/3&#58; µ = d/&#40;2n&#41; = 1/6.
µ_max. ===> w = 2n/3, l = 0, d = n/3&#58; µ = &#40;w + d/2&#41;/n = 5/6.

&#91;µ_min., µ_max.&#93; = &#91;1/6, 5/6&#93;; for not being so strict&#58; &#91;0.15, 0.85&#93;.
It is only an example, so it is not a rigurous definition of a 'not unbalanced match'.

------------------------
Laskos wrote:Anyway, I ran another test test with adjusted for strength engines (not perfectly adjusted):

Code: Select all

 

    Program                            Score     %     Elo     Draws 

  1 Stockfish 2.3.1                &#58; 520.5/834  62.4   3073     32.5 % 
  2 Komodo 5                       &#58; 421.5/842  50.1   3000     35.0 % 
  3 Rybka 4.1                      &#58; 404.5/819  49.4   2996     34.8 % 
  4 Hiarcs 14                      &#58; 415.0/845  49.1   2995     29.1 % 
  5 Houdini 3                      &#58; 394.0/840  46.9   2982     33.6 % 
  6 Junior 13                      &#58; 353.5/838  42.2   2954     26.1 %



And the draw averseness (smaller - more averse) is:

Code: Select all

 

       Engine            d / s*&#40;1-s&#41; 

   Junior 13               1.07 
   Hiarcs 14               1.16 
   Houdini 3               1.35 
   Stockfish 2.3.1         1.39 
   Rybka 4.1               1.39 
   Komodo 5                1.40



Again, "older style" engines seem more draw-averse.

Kai
I compute my dubious number k*µ*(1 - µ) rounded up to 0.0001, just for comparison (a higher k*µ*(1 - µ) means less 'draw aversion'). I hope no typos:

Code: Select all

       Engine            d / s*&#40;1-s&#41;     k*µ*&#40;1 - µ&#41;

   Junior 13               1.07            0.1791
   Hiarcs 14               1.16            0.1508
   Houdini 3               1.35            0.19
   Stockfish 2.3.1         1.39            0.2697
   Rybka 4.1               1.39            0.1783
   Komodo 5                1.40            0.1756
As you see, results are completely different although I have to say that each engine did not play the same number of games, FWIW.

Both models (yours and mine) have some drawbacks with non-trivial solutions. I encourage people to find/suggest these solutions because I am sure that I will not get anything better in the case I would think more about it.

It would be nice to compare our results with Adam's, which has worked a lot. 'Draw aversion' is an interesting topic to investigate but it is difficult to reach a plausible criterium!

Regards from Spain.

Ajedrecista.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: My numeric method for determine draw trends of each engi

Post by Adam Hair »

Here are the IPON stats Jesús' k*µ*(1-µ), Kai's D/(µ*(1-µ), and my regression estimates, sorted from most draw averse to least draw averse (as determined from regression):

Code: Select all

Name                     Score      D       D_max     k       k*u*&#40;1-u&#41;  D/&#40;u*&#40;1-u&#41;) Draw deviation
Deep Junior 13.3         39.00%   34.00%   78.00%    0.4359   0.1037      1.429      -5.21%
Houdini 3 STD            82.00%   24.00%   36.00%    0.6667   0.0984      1.626      -2.05%
Gull 1.2                 45.00%   39.00%   90.00%    0.4333   0.1073      1.576      -1.93%
Quazar 0.4               36.00%   37.00%   72.00%    0.5139   0.1184      1.606      -1.07%
HIARCS 14 WCSC 32b       48.00%   40.00%   96.00%    0.4167   0.1040      1.603      -1.06%
Komodo 5                 73.00%   34.00%   54.00%    0.6296   0.1241      1.725      -0.19%
Protector 1.4.0          39.00%   39.00%   78.00%    0.5      0.1190      1.639      -0.16%
Deep Shredder 12         45.00%   40.00%   90.00%    0.4444   0.1100      1.616      -0.08%
Deep Fritz 13 32b        51.00%   40.00%   98.00%    0.4082   0.1020      1.601      -0.07%
Hannibal 1.2             45.00%   40.00%   90.00%    0.4444   0.1100      1.616      -0.04%
spark-1.0                41.00%   39.00%   82.00%    0.4756   0.1151      1.612      -0.03%
Zappa Mexico II          32.00%   35.00%   64.00%    0.5469   0.1190      1.608       0.00%
Critter 1.4a             71.00%   37.00%   58.00%    0.6379   0.1314      1.797       0.10%
Spike 1.4 32b            42.00%   40.00%   84.00%    0.4762   0.1160      1.642       0.24%
Deep Sjeng c't 2010 32b  43.00%   41.00%   86.00%    0.4767   0.1169      1.673       0.45%
Naum 4.2                 50.00%   42.00%   100.00%   0.42     0.1050      1.680       0.63%
MinkoChess 1.3           31.00%   36.00%   62.00%    0.5806   0.1242      1.683       0.69%
Chiron 1.5               52.00%   42.00%   96.00%    0.4375   0.1092      1.683       0.87%
Stockfish 2.2.2 JA       69.00%   40.00%   62.00%    0.6452   0.1380      1.870       1.68%
Deep Rybka 4.1           68.00%   40.00%   64.00%    0.625    0.1360      1.838       2.41%
If we assume the regression estimate gives the best indicator of draw aversion, then it appears to me that Kai's method does better, at least with the IPON data. Both methods have some trouble when an engine's score is much less or more than 50% (as Jesús noted). So, some correction is needed for the engines high or low scores.

I hope we can find a correction factor, for the process of preparing the information for regression requires the pgn and some time (though I think I have this sort of thing down to a science now :) ). It would be much easier to have the score, draw rate, and a formula to determine how drawish an engine is.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: My numeric method for determine draw trends of each engi

Post by Don »

Adam Hair wrote:Here are the IPON stats Jesús' k*µ*(1-µ), Kai's D/(µ*(1-µ), and my regression estimates, sorted from most draw averse to least draw averse (as determined from regression):

Code: Select all

Name                     Score      D       D_max     k       k*u*&#40;1-u&#41;  D/&#40;u*&#40;1-u&#41;) Draw deviation
Deep Junior 13.3         39.00%   34.00%   78.00%    0.4359   0.1037      1.429      -5.21%
Houdini 3 STD            82.00%   24.00%   36.00%    0.6667   0.0984      1.626      -2.05%
Gull 1.2                 45.00%   39.00%   90.00%    0.4333   0.1073      1.576      -1.93%
Quazar 0.4               36.00%   37.00%   72.00%    0.5139   0.1184      1.606      -1.07%
HIARCS 14 WCSC 32b       48.00%   40.00%   96.00%    0.4167   0.1040      1.603      -1.06%
Komodo 5                 73.00%   34.00%   54.00%    0.6296   0.1241      1.725      -0.19%
Protector 1.4.0          39.00%   39.00%   78.00%    0.5      0.1190      1.639      -0.16%
Deep Shredder 12         45.00%   40.00%   90.00%    0.4444   0.1100      1.616      -0.08%
Deep Fritz 13 32b        51.00%   40.00%   98.00%    0.4082   0.1020      1.601      -0.07%
Hannibal 1.2             45.00%   40.00%   90.00%    0.4444   0.1100      1.616      -0.04%
spark-1.0                41.00%   39.00%   82.00%    0.4756   0.1151      1.612      -0.03%
Zappa Mexico II          32.00%   35.00%   64.00%    0.5469   0.1190      1.608       0.00%
Critter 1.4a             71.00%   37.00%   58.00%    0.6379   0.1314      1.797       0.10%
Spike 1.4 32b            42.00%   40.00%   84.00%    0.4762   0.1160      1.642       0.24%
Deep Sjeng c't 2010 32b  43.00%   41.00%   86.00%    0.4767   0.1169      1.673       0.45%
Naum 4.2                 50.00%   42.00%   100.00%   0.42     0.1050      1.680       0.63%
MinkoChess 1.3           31.00%   36.00%   62.00%    0.5806   0.1242      1.683       0.69%
Chiron 1.5               52.00%   42.00%   96.00%    0.4375   0.1092      1.683       0.87%
Stockfish 2.2.2 JA       69.00%   40.00%   62.00%    0.6452   0.1380      1.870       1.68%
Deep Rybka 4.1           68.00%   40.00%   64.00%    0.625    0.1360      1.838       2.41%
If we assume the regression estimate gives the best indicator of draw aversion, then it appears to me that Kai's method does better, at least with the IPON data. Both methods have some trouble when an engine's score is much less or more than 50% (as Jesús noted). So, some correction is needed for the engines high or low scores.

I hope we can find a correction factor, for the process of preparing the information for regression requires the pgn and some time (though I think I have this sort of thing down to a science now :) ). It would be much easier to have the score, draw rate, and a formula to determine how drawish an engine is.
It remains to be seen if we can come up with something that is actually meaningful. Does it really say something about style? I think the only way to calibrate this and know that it's right is to time adjust these same engines off-line and run long matches. That would establish a baseline and then we could experiment with finding the right formula. I don't think it's necessary to run them for the long time controls IPON uses since we are trying to measure style, not ELO so it need not take a huge amount of time.

It would be a challenge getting them ELO adjusted to within 5 or 10 ELO which I think you need to do to call it a baseline. It's not too difficult with just 3 or 4 engines however.

I also believe there is minimal error if you run adjust the times as the test proceeds - as long as you end up with a score that is almost even. Assuming you don't go too crazy with those adjustments of course. I was afraid to try that but I think it would be valid.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: My numeric method for determine draw trends of each engi

Post by Adam Hair »

When my quad becomes free again in a few days, I will take a stab at your suggestion. I have played around with time odds matches enough that I may be able to dial in the appropriate time adjustments without too much trouble.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: My numeric method for determine draw trends of each engi

Post by Don »

Adam Hair wrote:When my quad becomes free again in a few days, I will take a stab at your suggestion. I have played around with time odds matches enough that I may be able to dial in the appropriate time adjustments without too much trouble.
Sounds like a good plan. Do you have the same programs used on IPON for reference? You probably don't need all of them, just a few.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: My numeric method for determine draw trends of each engi

Post by Adam Hair »

Don wrote:
Adam Hair wrote:When my quad becomes free again in a few days, I will take a stab at your suggestion. I have played around with time odds matches enough that I may be able to dial in the appropriate time adjustments without too much trouble.
Sounds like a good plan. Do you have the same programs used on IPON for reference? You probably don't need all of them, just a few.

Don
The engines I lack are current versions of Hiarcs, Junior, Shredder, and Sjeng.
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

My numeric method for determine draw trends of each engine.

Post by Ajedrecista »

Hello again:
Ajedrecista wrote:I compute my dubious number k*µ*(1 - µ) rounded up to 0.0001, just for comparison (a higher k*µ*(1 - µ) means less 'draw aversion'). I hope no typos:

Code: Select all

       Engine            d / s*&#40;1-s&#41;     k*µ*&#40;1 - µ&#41;

   Junior 13               1.07            0.1791
   Hiarcs 14               1.16            0.1508
   Houdini 3               1.35            0.19
   Stockfish 2.3.1         1.39            0.2697
   Rybka 4.1               1.39            0.1783
   Komodo 5                1.40            0.1756
As you see, results are completely different although I have to say that each engine did not play the same number of games, FWIW.

Both models (yours and mine) have some drawbacks with non-trivial solutions. I encourage people to find/suggest these solutions because I am sure that I will not get anything better in the case I would think more about it.

It would be nice to compare our results with Adam's, which has worked a lot. 'Draw aversion' is an interesting topic to investigate but it is difficult to reach a plausible criterium!

Regards from Spain.

Ajedrecista.
I did not calculate well k*µ*(1 - µ) with this data: I was tired and in a hurry (sigh) and I repeated the same mistake as previously in this topic: k*µ*(1 - µ) = D*max.(µ, 1 - µ)/2, but I calculated D*max.(µ, 1 - µ)/[2*min.(µ, 1 - µ)]. I expect that it is the correct table, runding up to 0.0001 again:

Code: Select all

       Engine            d / s*&#40;1-s&#41;     k*µ*&#40;1 - µ&#41;

   Junior 13               1.07            0.0755
   Hiarcs 14               1.16            0.074
   Houdini 3               1.35            0.0891
   Stockfish 2.3.1         1.39            0.1014
   Rybka 4.1               1.39            0.0881
   Komodo 5                1.40            0.0877
Sorry for my mistake (I was puzzled with my previous numbers, but these ones make total sense for me now). We agree that both Junior and HIARCS have more draw aversion than the others. It would be good that Adam will run his regression with this data (I guess that he needs the PGN file) for comparison purposes.

------------------------

Adam said that, taking his regression as the best estimator, then Kai's method is better than mine... at least I tried it! I am happy because I started this subtopic inside the main topic and some people came with their work... I do not know if Kai already used D/[µ*(1 - µ)] before the start of this thread, but I am sure that Adam thought about his regression model in these few days! Congratulations to both of you.

------------------------

@Don: I think you should paste the link of this whole topic in Open Chess Forum for allowing people to also follow Kai's and Adam's methods, which by the way seem better than mine. Thanks again for your interest!

Regards from Spain.

Ajedrecista.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: My numeric method for determine draw trends of each engi

Post by Don »

Adam Hair wrote:
Don wrote:
Adam Hair wrote:When my quad becomes free again in a few days, I will take a stab at your suggestion. I have played around with time odds matches enough that I may be able to dial in the appropriate time adjustments without too much trouble.
Sounds like a good plan. Do you have the same programs used on IPON for reference? You probably don't need all of them, just a few.

Don
The engines I lack are current versions of Hiarcs, Junior, Shredder, and Sjeng.
It's possible that you could appeal to the authors to provide you with these for this study. I don't think you have to have all the programs but I would hate to see some of these missing as they are all original and different. Many top 10 lists have only 2 or 3 original programs on them so we desperately need variety at the top. If you are using the Ipon list as your reference the situation is much better of course.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: My numeric method for determine draw trends of each engi

Post by Adam Hair »

Ajedrecista wrote: Adam said that, taking his regression as the best estimator, then Kai's method is better than mine... at least I tried it! I am happy because I started this subtopic inside the main topic and some people came with their work... I do not know if Kai already used D/[µ*(1 - µ)] before the start of this thread, but I am sure that Adam thought about his regression model in these few days! Congratulations to both of you.

Regards from Spain.

Ajedrecista.
We can credit Kirill Kryukov for applying regression methods to determining the draw rate characteristics of engines. What I have done is simply a continuation of his ideas.

It is not certain that your method does not work better for other data. I do suspect some modification of your formula or Kai's formula is needed.

Thanks for starting this, Jesús (and Don too).
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: My numeric method for determine draw trends of each engi

Post by Don »

Adam Hair wrote:
Ajedrecista wrote: Adam said that, taking his regression as the best estimator, then Kai's method is better than mine... at least I tried it! I am happy because I started this subtopic inside the main topic and some people came with their work... I do not know if Kai already used D/[µ*(1 - µ)] before the start of this thread, but I am sure that Adam thought about his regression model in these few days! Congratulations to both of you.

Regards from Spain.

Ajedrecista.
We can credit Kirill Kryukov for applying regression methods to determining the draw rate characteristics of engines. What I have done is simply a continuation of his ideas.

It is not certain that your method does not work better for other data. I do suspect some modification of your formula or Kai's formula is needed.

Thanks for starting this, Jesús (and Don too).
Are you going to run that match? I would love to see it. Probably take you a couple of days to get the programs adjusted.

There is so much difference in strengths even in the IPON top 10 that you have to give enormous time advantages to some programs. So you might have to consider running the top programs at a pretty fast time control to have a test that does not take weeks.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.