lazy smp questions

mbootsector · Post by **mbootsector** » Thu Oct 01, 2015 5:51 pm

cdani wrote:Two more tests, with not much cores but with more time and more games:
www.andscacs.com/stockfish/stockfish_la ... ames_2.zip

Code: Select all

rst_master_lsmpv2_100.1_6t_I75820K.pgn  &#40;100 seconds + 0.1 added for move&#41;
1   st_master    +2  +37/=279/-35 50.28%  176.5/351
2   st_lsmpv2    -2  +35/=279/-37 49.72%  174.5/351

rst_normal_lzmpv2_200.1_7t_FX8350.pgn  &#40;200 seconds + 0.1 added for move&#41;
1   st_modern          +11  +53/=400/-37 51.63%  253.0/490
2   st_modern_lzmpv2   -11  +37/=400/-53 48.37%  237.0/490

Thank you! Looks worse when using a longer time control, as expected.

Joerg Oster · Post by **Joerg Oster** » Thu Oct 01, 2015 7:40 pm

mbootsector wrote:
cdani wrote:Two more tests, with not much cores but with more time and more games:
www.andscacs.com/stockfish/stockfish_la ... ames_2.zip
Code: Select all
rst_master_lsmpv2_100.1_6t_I75820K.pgn  &#40;100 seconds + 0.1 added for move&#41;
1   st_master    +2  +37/=279/-35 50.28%  176.5/351
2   st_lsmpv2    -2  +35/=279/-37 49.72%  174.5/351

rst_normal_lzmpv2_200.1_7t_FX8350.pgn  &#40;200 seconds + 0.1 added for move&#41;
1   st_modern          +11  +53/=400/-37 51.63%  253.0/490
2   st_modern_lzmpv2   -11  +37/=400/-53 48.37%  237.0/490
Thank you! Looks worse when using a longer time control, as expected.

Yes, not unexpected.
However, thanks for your great work.

Currently I'm running a version in a local test which seems to do a bit better with 3 threads than before.
Maybe I will submit a test.

Michel · Post by **Michel** » Tue Oct 06, 2015 8:45 am

Michel wrote:
bob wrote:
Michel wrote:
Robert Hyatt wrote:You have to look at all the data. For example, look at the average opponent rating for cheng 4cpu vs cheng 1cpu. 1cpu played against an opponent average about 50 Elo stronger than cheng 4cpu. What would you expect that to cause? Make cheng 1cpu look weaker? You can't compare Elo numbers between partially or fully disjoint sets of opponents...
Why not? As long if the graph is connected the comparison is fine.

If A plays B and B plays C and C plays D you can still compare A and D. The comparison via intermediate engines just blows up the error bars.
That was my point. If the average ratings for player A's opponents is X, and the average rating for player B's opponents is X+50, it is going to be VERY difficult to compare their ratings with any accuracy and use the resulting Elo numbers to predict outcome between the two versions. The two versions of the original program are different, the average opponents are different, WHICH is responsible for the Elo gain or loss?

So in that specific CEGT comparison, the error bars are not +/-26. They are more like +/- 75...

In this case, saying A is +130 better than B is quite inaccurate. It is most likely better, to be sure. But how much better is much harder to determine without more data points.

Ok I had a look to see if you have any point.

You object to the use of the Pythagorean formula to estimate the error bars on the elo difference between two engines.

I am going to simply the analysis a bit by assuming that we are dealing with a large database so that varying the elo of two engines has little effect on the average elo. A more precise analysis gives the same result.

Furthermore I am going to take a Bayesian viewpoint (elo's are random variables) since this is easier to explain.

The formula for the variance of the difference of two random variables is

Var(X-Y)=Var(X)+Var(Y)-2 rho(X,Y) * sqrt(Var(X) * Var(Y))

where rho(X,Y) is the correlation coefficient. So negative rho's are bad since they make us under estimate the error bars.

To see how negative the rho's can be I hacked up a BayesElo program in Python and applied it to the CCRL404 database (12,000,000 million games).

It seems the most negative correlation coefficient is -0.17 which makes us under estimate the error bars by at most 8%.

So I believe that your objection has been refuted and that we can safely use the Pythagorean formula.

Unfortunately the above analysis isn't quite correct. BayesElo seems to compute error bars in a different way from what I thought. This dawned on me when initially I was not able to duplicate the BayesElo results with my own rating program in Python. BayesElo does not take into account the variation of in elo of the other players. Technically it takes the inverse square root of the diagonal of the Hessian multiplied by 1.96. There are very good reasons why it does this (for one thing: there is no canonical way of taking the variation of the other elo's into account) but this complicates the analysis.

If I computed correctly then under the assumption that the two players have approximately the same error bars the figure of 8% given above is still approximately correct. In general more care should be taken in combining BayesElo error bars than what I thought.

lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions