TalkChess.com

Posted: **Sat Dec 29, 2018 9:59 am**

I was going to experiment with this on fishtest but I am currently a bit busy (moreover it is a tradition, started by Marco Costalba, to forcibly deride _any_ kind of research on fishtest even for those who contribute more games than they use, so it is always an uphill battle).

I wonder if it has been tried to combine self testing, the pentanomial model and fixed node games. In the case of self testing of closely related engines with fixed nodes the reversed color return games will (supposedly) be heavily correlated to the original games. This will compress the score towards 50% but the pentanomial variance will also be heavily reduced. So in the end it could be a win in normalized elo which is all that matters for the efficiency of engine testing. Here is a note on normalized elo: http://hardy.uhasselt.be/Toga/normalized_elo.pdf (the pentanomial model is discussed in section 4). It seems not possible to predict theoretically if this would be a win or not since elo models do not take into account correlation.

A similar idea could be applied when testing against a third party engine. It is common wisdom that in that case one needs 4 times the number of games to attain the same resolution. However this is only true if the games against the third engine are uncorrelated. If we use fixed nodes and the same openings then presumably we create correlation and we should again use the pentanomial model (this time for the score differences) to get a more accurate variance for the score differences.

Posted: **Sun Dec 30, 2018 3:28 pm**

Michel wrote: ↑Sat Dec 29, 2018 9:59 am I was going to experiment with this on fishtest but I am currently a bit busy (moreover it is a tradition, started by Marco Costalba, to forcibly deride _any_ kind of research on fishtest even for those who contribute more games than they use, so it is always an uphill battle).

I wonder if it has been tried to combine self testing, the pentanomial model and fixed node games. In the case of self testing of closely related engines with fixed nodes the reversed color return games will (supposedly) be heavily correlated to the original games. This will compress the score towards 50% but the pentanomial variance will also be heavily reduced. So in the end it could be a win in normalized elo which is all that matters for the efficiency of engine testing. Here is a note on normalized elo: http://hardy.uhasselt.be/Toga/normalized_elo.pdf (the pentanomial model is discussed in section 4). It seems not possible to predict theoretically if this would be a win or not since elo models do not take into account correlation.

A similar idea could be applied when testing against a third party engine. It is common wisdom that in that case one needs 4 times the number of games to attain the same resolution. However this is only true if the games against the third engine are uncorrelated. If we use fixed nodes and the same openings then presumably we create correlation and we should again use the pentanomial model (this time for the score differences) to get a more accurate variance for the score differences.

I tried SF_dev versus the recent SF10, maybe a bit too distant relative. First, fixed nodes does not reflect the fixed time (strength) numbers, one has to be careful using fixed nodes only for patches not affecting NPS. SF_dev is about 8 +/- 3 Elo points stronger than SF10 at fixed time, but at 10,000 nodes per move it came as -5 +/- 4 Elo points weaker.

Second, the correlation seems to not be significantly higher with fixed nodes compared to fixed time (at approximately same time per move used). The correlation in both cases increases with number of nodes per move or time control. And the correlation also is very dependent on openings used. I used their 2moves_v1.pgn, and these are a sort of openings giving not a very high correlation.

Pentanomial error (square root of the variance) versus trinomial one:

10,000 nodes per move: 1% smaller
100,000 nodes per move: 4% samller
300,000 nodes per move: 6% smaller

The last one is pretty equivalent on my PC as time consumed to:
6s +0.1s : 6% smaller

There is no significant difference, so sticking to fixed time is probably better. The tests were on one thread, but SF resolution in nodes is about 1000 nodes or so IIRC.
Their STC is 10''+0.1'', and probably 7-8% smaller pentanomial error margins can be get, or ~15% smaller number of the games.
For LTC 60''+0.6'' the difference in errors might get to 10-15% with pentanomial variance and 2moves_v1, or 20-30% less number of games.

They seem to have reduced the draw rate probably due to Contempt. The use of unbalanced openings with high correlations becomes efficient with the pentanomial variance at draw rates above 60%, becoming very efficient towards 80%+ draw rates (in the past they were approaching these draw rates).

Posted: **Sun Dec 30, 2018 5:11 pm**

Laskos wrote: ↑Sun Dec 30, 2018 3:28 pm

I tried SF_dev versus the recent SF10, maybe a bit too distant relative. First, fixed nodes does not reflect the fixed time (strength) numbers, one has to be careful using fixed nodes only for patches not affecting NPS. SF_dev is about 8 +/- 3 Elo points stronger than SF10 at fixed time, but at 10,000 nodes per move it came as -5 +/- 4 Elo points weaker.

Second, the correlation seems to not be significantly higher with fixed nodes compared to fixed time (at approximately same time per move used). The correlation in both cases increases with number of nodes per move or time control. And the correlation also is very dependent on openings used. I used their 2moves_v1.pgn, and these are a sort of openings giving not a very high correlation.

Pentanomial error (square root of the variance) versus trinomial one:

10,000 nodes per move: 1% smaller
100,000 nodes per move: 4% samller
300,000 nodes per move: 6% smaller

The last one is pretty equivalent on my PC as time consumed to:
6s +0.1s : 6% smaller

There is no significant difference, so sticking to fixed time is probably better. The tests were on one thread, but SF resolution in nodes is about 1000 nodes or so IIRC.
Their STC is 10''+0.1'', and probably 7-8% smaller pentanomial error margins can be get, or ~15% smaller number of the games.
For LTC 60''+0.6'' the difference in errors might get to 10-15% with pentanomial variance and 2moves_v1, or 20-30% less number of games.

They seem to have reduced the draw rate probably due to Contempt. The use of unbalanced openings with high correlations becomes efficient with the pentanomial variance at draw rates above 60%, becoming very efficient towards 80%+ draw rates (in the past they were approaching these draw rates).

Thanks a lot for testing! Needless to say I am a bit disappointed since fixed node testing completely eliminates the noise associated with TC. On the other hand is might be comforting to know that correlation is at most a minor issue even if one tries very hard to create it since correlation is not covered by elo models and so is messy to deal with theoretically.

Posted: **Sun Dec 30, 2018 5:34 pm**

Michel wrote: ↑Sun Dec 30, 2018 5:11 pm
Laskos wrote: ↑Sun Dec 30, 2018 3:28 pm

I tried SF_dev versus the recent SF10, maybe a bit too distant relative. First, fixed nodes does not reflect the fixed time (strength) numbers, one has to be careful using fixed nodes only for patches not affecting NPS. SF_dev is about 8 +/- 3 Elo points stronger than SF10 at fixed time, but at 10,000 nodes per move it came as -5 +/- 4 Elo points weaker.

Second, the correlation seems to not be significantly higher with fixed nodes compared to fixed time (at approximately same time per move used). The correlation in both cases increases with number of nodes per move or time control. And the correlation also is very dependent on openings used. I used their 2moves_v1.pgn, and these are a sort of openings giving not a very high correlation.

Pentanomial error (square root of the variance) versus trinomial one:

10,000 nodes per move: 1% smaller
100,000 nodes per move: 4% samller
300,000 nodes per move: 6% smaller

The last one is pretty equivalent on my PC as time consumed to:
6s +0.1s : 6% smaller

There is no significant difference, so sticking to fixed time is probably better. The tests were on one thread, but SF resolution in nodes is about 1000 nodes or so IIRC.
Their STC is 10''+0.1'', and probably 7-8% smaller pentanomial error margins can be get, or ~15% smaller number of the games.
For LTC 60''+0.6'' the difference in errors might get to 10-15% with pentanomial variance and 2moves_v1, or 20-30% less number of games.

They seem to have reduced the draw rate probably due to Contempt. The use of unbalanced openings with high correlations becomes efficient with the pentanomial variance at draw rates above 60%, becoming very efficient towards 80%+ draw rates (in the past they were approaching these draw rates).
Thanks a lot for testing! Needless to say I am a bit disappointed since fixed node testing completely eliminates the noise associated with TC. On the other hand is might be comforting to know that correlation is at most a minor issue even if one tries very hard to create it since correlation is not covered by elo models and so is messy to deal with theoretically.

Yes, it is small, but at ultra-bullet. At bullet it's already probably above 20% in variance even from 2moves_v1. High correlation can be obtained with unbalanced openings, but they are efficient in resolution (Normalized Elo) at high draw rate, say higher than 60%.

I might try 3 million nodes per move with 2moves_v1, but in only 400 games or so. It should give a clearer picture of variance in bullet to blitz games. I was also suspecting that fixed nodes should give higher correlation, but was not sure how to use that, as fixed nodes is a bit tricky for precise measurement of strength, one has to be careful.

Posted: **Sun Dec 30, 2018 11:57 pm**

Laskos wrote: ↑Sun Dec 30, 2018 5:34 pm
Yes, it is small, but at ultra-bullet. At bullet it's already probably above 20% in variance even from 2moves_v1. High correlation can be obtained with unbalanced openings, but they are efficient in resolution (Normalized Elo) at high draw rate, say higher than 60%.

I might try 3 million nodes per move with 2moves_v1, but in only 400 games or so. It should give a clearer picture of variance in bullet to blitz games. I was also suspecting that fixed nodes should give higher correlation, but was not sure how to use that, as fixed nodes is a bit tricky for precise measurement of strength, one has to be careful.

A bit of nitpicking (sorry...). Unbalanced openings should not create more correlation since the return game is still independent from the original game. In other words the outcome (win, draw, loss) of the original game does not influence the probabilities of the return game (the probabilities should be solely determined by the bias and the elo difference). Of course with closely related engines there may be some correlation due to the same opening being used so that the resulting games might be similar, but if there is such an effect then I would expect it to be also present for balanced openings (and the idea of using fixed nodes testing was precisely to strengthen the correlation in this case).

Posted: **Mon Dec 31, 2018 12:13 am**

Michel wrote: ↑Sun Dec 30, 2018 11:57 pm
Laskos wrote: ↑Sun Dec 30, 2018 5:34 pm
Yes, it is small, but at ultra-bullet. At bullet it's already probably above 20% in variance even from 2moves_v1. High correlation can be obtained with unbalanced openings, but they are efficient in resolution (Normalized Elo) at high draw rate, say higher than 60%.

I might try 3 million nodes per move with 2moves_v1, but in only 400 games or so. It should give a clearer picture of variance in bullet to blitz games. I was also suspecting that fixed nodes should give higher correlation, but was not sure how to use that, as fixed nodes is a bit tricky for precise measurement of strength, one has to be careful.
A bit of nitpicking (sorry...). Unbalanced openings should not create more correlation since the return game is still independent from the original game. In other words the outcome (win, draw, loss) of the original game does not influence the probabilities of the return game (the probabilities should be solely determined by the bias and the elo difference). Of course with closely related engines there may be some correlation due to the same opening being used so that the resulting games might be similar, but if there is such an effect then I would expect it to be also present for balanced openings (and the idea of using fixed nodes testing was precisely to strengthen the correlation in this case).

You mean that by introducing some eloBias in some model for unbalanced openings, what I call "correlation" should be reduced to that of balanced openings? I think we meant two a bit different things, you some "intrinsic" correlation, unaccounted in any Elo model, I meant the brute correlation, which includes the eloBias or model-related correlation too (besides "intrinsic" correlation). Yes, that "intrinsic" correlation is pretty small (but increasing with TC), and probably even smaller that that shown by 2moves_v1, which contains some a bit unbalanced openings. I will maybe do some experiments tomorrow with extremely balanced openings (according to Stockfish) selected from that 2moves_v1 to see even clearer this hidden "intrinsic correlation". Also, maybe I should play 2 almost identical versions of Stockfish against each other, as SF10 and SF_dev are quite different already.

The 3 million nodes per move match of 400 games shows only 8% decrease with pentanomial errors, lower than I expected. 17% in variance or number of games to same confidence.

Posted: **Mon Dec 31, 2018 12:41 am**

Laskos wrote: ↑Mon Dec 31, 2018 12:13 am
Michel wrote: ↑Sun Dec 30, 2018 11:57 pm
Laskos wrote: ↑Sun Dec 30, 2018 5:34 pm
Yes, it is small, but at ultra-bullet. At bullet it's already probably above 20% in variance even from 2moves_v1. High correlation can be obtained with unbalanced openings, but they are efficient in resolution (Normalized Elo) at high draw rate, say higher than 60%.

I might try 3 million nodes per move with 2moves_v1, but in only 400 games or so. It should give a clearer picture of variance in bullet to blitz games. I was also suspecting that fixed nodes should give higher correlation, but was not sure how to use that, as fixed nodes is a bit tricky for precise measurement of strength, one has to be careful.
A bit of nitpicking (sorry...). Unbalanced openings should not create more correlation since the return game is still independent from the original game. In other words the outcome (win, draw, loss) of the original game does not influence the probabilities of the return game (the probabilities should be solely determined by the bias and the elo difference). Of course with closely related engines there may be some correlation due to the same opening being used so that the resulting games might be similar, but if there is such an effect then I would expect it to be also present for balanced openings (and the idea of using fixed nodes testing was precisely to strengthen the correlation in this case).
You mean that by introducing some eloBias in some model for unbalanced openings, what I call "correlation" should be reduced to that of balanced openings? I think we meant two a bit different things, you some "intrinsic" correlation, unaccounted in any Elo model, I meant the brute correlation, which includes the eloBias or model-related correlation too (besides "intrinsic" correlation). Yes, that "intrinsic" correlation is pretty small (but increasing with TC), and probably even smaller that that shown by 2moves_v1, which contains some a bit unbalanced openings. I will maybe do some experiments tomorrow with extremely balanced openings (according to Stockfish) selected from that 2moves_v1 to see even clearer this hidden "intrinsic correlation". Also, maybe I should play 2 identical copies of Stockfish against each other, as SF10 and SF_dev are quite different already.

The 3 million nodes per move match of 400 games shows only 8% decrease with pentanomial errors, lower than I expected. 17% in variance or number of games to same confidence.

Well 17% is not nothing...

I don't want to make a big deal out of this, but nonetheless let me repeat. What you call correlation in the case of unbalanced opening is not what is technically called correlation. Correlation between two random variables means that the covariance is not zero. https://en.wikipedia.org/wiki/Covarianc ... orrelation. Independent random variables are always uncorrelated. In a game pair the two games are independent and hence their outcomes are uncorrelated (but the distributions of the outcomes of the two games in the pair are different trinomial distributions).

The reason why we need the pentanomial model in the case of unbalanced openings is that if X,Y are independent random variable with different trinomial distributions (indexed by 0,1,2) then the distribution of X+Y is not the same as the distribution of Z1+Z2 with Z1,Z2 having independent identical trinomial distributions

Posted: **Mon Dec 31, 2018 9:33 am**

Michel wrote: ↑Mon Dec 31, 2018 12:41 am
Laskos wrote: ↑Mon Dec 31, 2018 12:13 am
Michel wrote: ↑Sun Dec 30, 2018 11:57 pm
Laskos wrote: ↑Sun Dec 30, 2018 5:34 pm
Yes, it is small, but at ultra-bullet. At bullet it's already probably above 20% in variance even from 2moves_v1. High correlation can be obtained with unbalanced openings, but they are efficient in resolution (Normalized Elo) at high draw rate, say higher than 60%.

I might try 3 million nodes per move with 2moves_v1, but in only 400 games or so. It should give a clearer picture of variance in bullet to blitz games. I was also suspecting that fixed nodes should give higher correlation, but was not sure how to use that, as fixed nodes is a bit tricky for precise measurement of strength, one has to be careful.
A bit of nitpicking (sorry...). Unbalanced openings should not create more correlation since the return game is still independent from the original game. In other words the outcome (win, draw, loss) of the original game does not influence the probabilities of the return game (the probabilities should be solely determined by the bias and the elo difference). Of course with closely related engines there may be some correlation due to the same opening being used so that the resulting games might be similar, but if there is such an effect then I would expect it to be also present for balanced openings (and the idea of using fixed nodes testing was precisely to strengthen the correlation in this case).
You mean that by introducing some eloBias in some model for unbalanced openings, what I call "correlation" should be reduced to that of balanced openings? I think we meant two a bit different things, you some "intrinsic" correlation, unaccounted in any Elo model, I meant the brute correlation, which includes the eloBias or model-related correlation too (besides "intrinsic" correlation). Yes, that "intrinsic" correlation is pretty small (but increasing with TC), and probably even smaller that that shown by 2moves_v1, which contains some a bit unbalanced openings. I will maybe do some experiments tomorrow with extremely balanced openings (according to Stockfish) selected from that 2moves_v1 to see even clearer this hidden "intrinsic correlation". Also, maybe I should play 2 identical copies of Stockfish against each other, as SF10 and SF_dev are quite different already.

The 3 million nodes per move match of 400 games shows only 8% decrease with pentanomial errors, lower than I expected. 17% in variance or number of games to same confidence.
Well 17% is not nothing...

I don't want to make a big deal out of this, but nonetheless let me repeat. What you call correlation in the case of unbalanced opening is not what is technically called correlation. Correlation between two random variables means that the covariance is not zero. https://en.wikipedia.org/wiki/Covarianc ... orrelation. Independent random variables are always uncorrelated. In a game pair the two games are independent and hence their outcomes are uncorrelated (but the distributions of the outcomes of the two games in the pair are different trinomial distributions).

The reason why we need the pentanomial model in the case of unbalanced openings is that if X,Y are independent random variable with different trinomial distributions (indexed by 0,1,2) then the distribution of X+Y is not the same as the distribution of Z1+Z2 with Z1,Z2 having independent identical trinomial distributions

Ah, sorry, I was thinking for unbalanced openings more as "means" and "standard deviations" talking of correlation. Correlation might play no role in unbalanced openings when compressing variance, right? I built a toy model with binomial outcomes in two arrays (side-reversed) having correlation 0, but trinomial variance coming much smaller than binomial one. I somehow forgot that covariance is ordered element to ordered element.

I got from ultra-balanced 2-movers at 300,000 nodes per move 4.5% compression in sigma (was about 6% from 2moves_v1) with pentanomial, so some (small) effect on compression of variance was due to unbalancedness of some openings in 2moves_v1, not correlations. At higher node count this compression of 9-10% in variance increases.

Posted: **Mon Dec 31, 2018 4:41 pm**

Laskos wrote: ↑Mon Dec 31, 2018 9:33 am
Michel wrote: ↑Mon Dec 31, 2018 12:41 am
Laskos wrote: ↑Mon Dec 31, 2018 12:13 am
Michel wrote: ↑Sun Dec 30, 2018 11:57 pm
Laskos wrote: ↑Sun Dec 30, 2018 5:34 pm
Yes, it is small, but at ultra-bullet. At bullet it's already probably above 20% in variance even from 2moves_v1. High correlation can be obtained with unbalanced openings, but they are efficient in resolution (Normalized Elo) at high draw rate, say higher than 60%.

I might try 3 million nodes per move with 2moves_v1, but in only 400 games or so. It should give a clearer picture of variance in bullet to blitz games. I was also suspecting that fixed nodes should give higher correlation, but was not sure how to use that, as fixed nodes is a bit tricky for precise measurement of strength, one has to be careful.
A bit of nitpicking (sorry...). Unbalanced openings should not create more correlation since the return game is still independent from the original game. In other words the outcome (win, draw, loss) of the original game does not influence the probabilities of the return game (the probabilities should be solely determined by the bias and the elo difference). Of course with closely related engines there may be some correlation due to the same opening being used so that the resulting games might be similar, but if there is such an effect then I would expect it to be also present for balanced openings (and the idea of using fixed nodes testing was precisely to strengthen the correlation in this case).
You mean that by introducing some eloBias in some model for unbalanced openings, what I call "correlation" should be reduced to that of balanced openings? I think we meant two a bit different things, you some "intrinsic" correlation, unaccounted in any Elo model, I meant the brute correlation, which includes the eloBias or model-related correlation too (besides "intrinsic" correlation). Yes, that "intrinsic" correlation is pretty small (but increasing with TC), and probably even smaller that that shown by 2moves_v1, which contains some a bit unbalanced openings. I will maybe do some experiments tomorrow with extremely balanced openings (according to Stockfish) selected from that 2moves_v1 to see even clearer this hidden "intrinsic correlation". Also, maybe I should play 2 identical copies of Stockfish against each other, as SF10 and SF_dev are quite different already.

The 3 million nodes per move match of 400 games shows only 8% decrease with pentanomial errors, lower than I expected. 17% in variance or number of games to same confidence.
Well 17% is not nothing...

I don't want to make a big deal out of this, but nonetheless let me repeat. What you call correlation in the case of unbalanced opening is not what is technically called correlation. Correlation between two random variables means that the covariance is not zero. https://en.wikipedia.org/wiki/Covarianc ... orrelation. Independent random variables are always uncorrelated. In a game pair the two games are independent and hence their outcomes are uncorrelated (but the distributions of the outcomes of the two games in the pair are different trinomial distributions).

The reason why we need the pentanomial model in the case of unbalanced openings is that if X,Y are independent random variable with different trinomial distributions (indexed by 0,1,2) then the distribution of X+Y is not the same as the distribution of Z1+Z2 with Z1,Z2 having independent identical trinomial distributions
Ah, sorry, I was thinking for unbalanced openings more as "means" and "standard deviations" talking of correlation. Correlation might play no role in unbalanced openings when compressing variance, right? I built a toy model with binomial outcomes in two arrays (side-reversed) having correlation 0, but trinomial variance coming much smaller than binomial one. I somehow forgot that covariance is ordered element to ordered element.

I got from ultra-balanced 2-movers at 300,000 nodes per move 4.5% compression in sigma (was about 6% from 2moves_v1) with pentanomial, so some (small) effect on compression of variance was due to unbalancedness of some openings in 2moves_v1, not correlations. At higher node count this compression of 9-10% in variance increases.

Thanks!! But the real question is if (score-0.5)/sigma (normalized elo) actually goes up? Sadly it may not be possible to measure this in reasonable time if high node count is required since correlation probably only exists for closely related engines and in that case the score will be very close to 0.5 and the standard deviation on normalized elo is 1/sqrt(N).

Posted: **Mon Jan 07, 2019 9:41 am**

Michel wrote:I don't want to make a big deal out of this, but nonetheless let me repeat. What you call correlation in the case of unbalanced opening is not what is technically called correlation. Correlation between two random variables means that the covariance is not zero. https://en.wikipedia.org/wiki/Covarianc ... orrelation. Independent random variables are always uncorrelated. In a game pair the two games are independent and hence their outcomes are uncorrelated.

Actually I was wrong about this. If the bias varies then there will be correlation since the outcome of the first game in a game pair will give information on the bias which will influence the prediction for the outcome of the second game (consider the case where the book only contains forced mate positions, sometimes for white and sometimes for black). Luckily the pentanomial model seamlessly takes care of all these effects.

TalkChess.com

Fixed nodes games and the pentanomial model.

Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.

Re: Fixed nodes games and the pentanomial model.