## LOS calculation: Does the same result is always the same?

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### LOS calculation: Does the same result is always the same?

I am really clueless in this field of statistical calculations.

Nevertheless I noted some things when looking at a matches between engine A and B, that I'd like experts to comment on:

1) The current algorithm to calculate ELO, LOS, etc use the "final" result, the numbers of wins, lost, draws

2) Experiencing with testing at different TC I noted that at fast TC results are more "volatile" than at longer TC. At fast TC a +10 ELO after 1000 games could easily be reverted in the next 10K games, instead at long TC already after 300 games the current winner has a big potential to be the final winner even after 10K games.

3) In all the ELO calculations the probability for the stronger engine to win a match is assumed to be independent from the TC used. This IMHO is a flaw.

My understanding is that in any match there is a noise level that alters the natural result, i.e. the stronger wins. The reason why we need a lot of games is to average out this noise that is assumed with zero median. But I also think that this noise is not independent from the TC used.

When playing a match we have much more information that the final result. We have all the single games results series. My understanding is that analyzing the sequence of single results a "variance" or noise level estimation could be calculated. The level of noise could be then used to reach the final LOS: my guess is that at long TC less games are needed than at very fast TC (note that total time could be the same or even longer for long TC) to reach a given LOS with a given accuracy.

So my final question is, does anybody ever considered to model the single game result noise using the games series and then use it to calculate the LOS ?

Ajedrecista
Posts: 1405
Joined: Wed Jul 13, 2011 7:04 pm
Contact:

### LOS calculation: does the same result is always the same?

Hello Marco:
mcostalba wrote:I am really clueless in this field of statistical calculations.

Nevertheless I noted some things when looking at a matches between engine A and B, that I'd like experts to comment on:

1) The current algorithm to calculate ELO, LOS, etc use the "final" result, the numbers of wins, lost, draws

2) Experiencing with testing at different TC I noted that at fast TC results are more "volatile" than at longer TC. At fast TC a +10 ELO after 1000 games could easily be reverted in the next 10K games, instead at long TC already after 300 games the current winner has a big potential to be the final winner even after 10K games.

3) In all the ELO calculations the probability for the stronger engine to win a match is assumed to be independent from the TC used. This IMHO is a flaw.

My understanding is that in any match there is a noise level that alters the natural result, i.e. the stronger wins. The reason why we need a lot of games is to average out this noise that is assumed with zero median. But I also think that this noise is not independent from the TC used.

When playing a match we have much more information that the final result. We have all the single games results series. My understanding is that analyzing the sequence of single results a "variance" or noise level estimation could be calculated. The level of noise could be then used to reach the final LOS: my guess is that at long TC less games are needed than at very fast TC (note that total time could be the same or even longer for long TC) to reach a given LOS with a given accuracy.

So my final question is, does anybody ever considered to model the single game result noise using the games series and then use it to calculate the LOS ?
You can try Protools 1.3 by Ed Schröder: if you read the bottom of that web, you will see that this tool can plot the evolution of the score of a match. I suggest you making tests with STC and LTC to see the evolution of scores and you can write a script or whatever small programme for calculate LOS (I assume a direct match of only two engines, otherwise I do not know how to calculate LOS) using the number of wins and loses or using a normal distribution with mean (indeed the score of one of the engines) and standard deviation (with or without the draw ratio). This programme can calculate LOS after each game is finished and you can see the evolution of LOS with each TC.

Sorry for the little help I can provide you... good luck! Thank you very much for the recent SF release.

Regards from Spain.

Ajedrecista.

Eelco de Groot
Posts: 4186
Joined: Sun Mar 12, 2006 1:40 am
Location: Groningen

### Re: LOS calculation: Does the same result is always the same

mcostalba wrote:I am really clueless in this field of statistical calculations.

Nevertheless I noted some things when looking at a matches between engine A and B, that I'd like experts to comment on:

1) The current algorithm to calculate ELO, LOS, etc use the "final" result, the numbers of wins, lost, draws

2) Experiencing with testing at different TC I noted that at fast TC results are more "volatile" than at longer TC. At fast TC a +10 ELO after 1000 games could easily be reverted in the next 10K games, instead at long TC already after 300 games the current winner has a big potential to be the final winner even after 10K games.

3) In all the ELO calculations the probability for the stronger engine to win a match is assumed to be independent from the TC used. This IMHO is a flaw.

My understanding is that in any match there is a noise level that alters the natural result, i.e. the stronger wins. The reason why we need a lot of games is to average out this noise that is assumed with zero median. But I also think that this noise is not independent from the TC used.

When playing a match we have much more information that the final result. We have all the single games results series. My understanding is that analyzing the sequence of single results a "variance" or noise level estimation could be calculated. The level of noise could be then used to reach the final LOS: my guess is that at long TC less games are needed than at very fast TC (note that total time could be the same or even longer for long TC) to reach a given LOS with a given accuracy.

So my final question is, does anybody ever considered to model the single game result noise using the games series and then use it to calculate the LOS ?
I also had that impression that there is a "noise" in the results that is stronger at very fast timecontrols. But the fact that it diminishes with timecontrol, suggests that it is an overlaying noise, and therefore not related to the playing strength of the engines. As such it may be a Gaussian noise, or partly non Gaussian (as in experiments with systematic error), but the elo model does not apply to it (either way). Elo also does not take this kind of noise in consideration, as far as I know. At least I can't remember reading much about it in Arpad Elo's book but it was rather a long time ago that I read that.

I am not sure if there is much you can do by studying subsamples of the whole dataset used. (*). Maybe it could be used to estimate how strong this, still hypothetical I think, overlaying noise is. I am not sure if it would change the actual calculation of the elonumbers. It is just a hunch but maybe a more complex model with this noise taken into consideration, would only demand more games for a given confidence level or needed LOS. But I think Rémi Coulom or Harm Geert would be more capable to answer those questions.

Regards, Eelco

(*):There was an article in Scientific American, maybe ten years or more ago, that went into these kind of techniques without too much math. If I had a reference to a webbased article of it I would give it, but it was in a time when there was not so much to be found on the net. I vaguely remember the article also mentioned something like "bootstrapping" in this context, but I don't remember the specifics of it.
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### Re: LOS calculation: Does the same result is always the same

Eelco de Groot wrote: I am not sure if there is much you can do by studying subsamples of the whole dataset used. (*). Maybe it could be used to estimate how strong this, still hypothetical I think, overlaying noise is. I am not sure if it would change the actual calculation of the elonumbers. It is just a hunch but maybe a more complex model with this noise taken into consideration, would only demand more games for a given confidence level or needed LOS. But I think Rémi Coulom or Harm Geert would be more capable to answer those questions.
Eelco, of course it changes the way we calculate LOS!

In an ideal condition, without noise you just need 1 game to understand who is the stronger engine out of 2!!!

In an extremely noisy environment 100K games can be not enough. I remember I was testing with little blitzer on my QUAD some time ago. The same 2 engines with the same TC with the same number of threads (2 each one), when tested in single match I got a result converging after about 10K games, when I repeated the test with two games in parallel (so that all the 4 cores where used and also with some overlap) I got nothing after 10K games and I needed a much longer (in number of games) test.

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

### Re: LOS calculation: Does the same result is always the same

mcostalba wrote: In an ideal condition, without noise you just need 1 game to understand who is the stronger engine out of 2!!!
The noise is intrinsic in the game of chess and cannot be removed.

It is due to the fact that in all the non trivial positions the search tree is only a small subset of the full tree. A better engine searches on a "better" sub-tree than a weak engine. Where "better" it means a sub-tree in which there is a higher probability to find a good move /avoid a mistake. So a sub-tree can be better than another one because is bigger (faster engine) or because avoids useless nodes (more selective engine).

Anyhow also the best engine cannot do more than picking up the best move of the searched sub-tree, but because this is only a part of the whole, there is always a possibility that the "best move" in the searched sub-tree turns out to be a bad one in the full tree. When this happens this is a "noise" event because the stronger engine will play a weak move and will lose the game, although to be the stronger.

At very fast TC the searched sub-tree is smaller and so increases the possibility that the best move in the sub-tree is the wrong one in the full game tree.

Vinvin
Posts: 4413
Joined: Thu Mar 09, 2006 8:40 am
Full name: Vincent Lejeune

### Re: LOS calculation: Does the same result is always the same

mcostalba wrote:...
2) Experiencing with testing at different TC I noted that at fast TC results are more "volatile" than at longer TC. At fast TC a +10 ELO after 1000 games could easily be reverted in the next 10K games, instead at long TC already after 300 games the current winner has a big potential to be the final winner even after 10K games.
You probably pointed an interesting thing : the number of computed nodes (for comparable engines) is as much important as number of games. Knowledge shows it's strength at each nodes but with nearby versions the probability the knowledge change the best move at this node is tiny (1 on 1000000 ?). So you've to examine a lot of nodes to see (non-)improvements ...

This was a diversion from my observation about human playing strength : "strength of human players is shown at each move" and players have to play good succeeding moves (and analyze a lot of moves) to win a game over a weaker opponent.

hgm
Posts: 23871
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

### Re: LOS calculation: Does the same result is always the same

The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.

Eelco de Groot
Posts: 4186
Joined: Sun Mar 12, 2006 1:40 am
Location: Groningen

### Re: LOS calculation: Does the same result is always the same

hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
But what if a certain percentage of the outcomes is not determined by the games and difference in chess-strength, but by totally unrelated factors. For instance a not so good operating system might have trouble equally dividing system time between two engines. You would naturally expect that effect much more at extremely short timecontrols. It would get worse if the operating system has to balance several threads, all running with chess engines. The problem would get worse if more background processes are running but the effect of ultra short timecontrols would dominate. Would this not increase volatility, make matches more unpredictable? I think this is roughly the kind of effects we are seeing (or imagining we are seeing) and then it would be totally unrelated to the elo difference.

Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan

hgm
Posts: 23871
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

### Re: LOS calculation: Does the same result is always the same

It would not have the slightest effect on the standard error, if it was independent between games. So if the OS would randomly kill an engine every five minutes, so it forfeits, it would not have no effect. Even if the OS would only kill engine A every 5 minutes it would have no effect on the standard error. (It would have effect on the score, of course.)

Only if the OS would decide to disadvantage engine A for several games in a row, and at other times engine B for several games in a row, it would induce a correlation between games needed to drive up the variance of the match result.

Uri Blass
Posts: 8642
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

### Re: LOS calculation: Does the same result is always the same

hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
If you play a match from the same position with opposite colors then it is obvious that results of individual games are dependent and I expect the correlation to be higher at long time control.

I think that long time control has more draws so it makes sense that 10 elo difference at long time control with the same number of games is more
significant in the meaning that you can have more confidence that it is an improvement.

An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.