Valid objection. It stems from the fact that in ELO you assume a given N=number of games, and one has to optimise W-L for fixed N. In WILO you don't assume a fixed number of games, and one has to optimise W/L every time. If you assume in WILO a fixed N, W-L and W/L descriptions are equivalent. So rating lists with games played according to ELO, which optimize W-L for a certain given number of games, shouldn't be used as WILO rating lists with cleaned-out Draws, because the optimizing of ELO with fixed N and WILO with floating N' bring different playing goals. Only games played to optimize WILO should be used, which are none . Fixed N for WILO might be adopted (replaying drawn games, which would bring goals back on track). Your point is valid, it's a bit different game, and both humans and engines will have to adjust to a maybe better rating system for Chess like WILO. Also, there are many cases when the game of Chess has different goals, depending on the tournament, Match, RR with 2 opponents, RR with 40 opponents, Swiss, Knock-Out, Tie-Breaks, ELO gap 200, ELO gap 1000, and so on. For each, the "evals" of both Humans and Engines have to adjust.lkaufman wrote:

Another issue is that the programming of engines, as well as the play of human grandmasters, is aimed to maximize score with draws counting as 1/2, rather than just number of wins (although wins are sometimes used as a tiebreak). WILO might be better mathematically, but it does not correspond to the actual scoring of tournaments. This is not a minor issue. Suppose Komodo (or even Carlsen) reaches a middlegame position with a half-pawn advantage or so. He has to decide between retaining queens with let's say a 60% winning chance, a 20% losing chance, and a 20% drawing chance. Or he can simplify to an endgame with a 24% winning chance, a 75% drawing chance, and a 1% losing chance (i.e. a gross blunder or flag fall). In any normal tournament or match, he should keep queens on (assuming a neutral tournament/match situation) to maximize his expected score. But to maximize WILO, he should trade queens. Komodo has code to try to avoid simplifying in such a situation (maybe not very effective, but that's irrelevant); if we wanted to maximize WILO we would have to make significant program changes. In my view, we would have to return to the old practice of replaying draws until someone wins to justify switching to WILO. Elimination tournaments with playoffs at faster time limits to break ties are a version of this, but then you are rating blitz games together with slow ones. This is also my objection to Bayes Elo; it also makes an assumption that does not correspond to normal match/tournament scoring.

## Wilo rating properties from FGRL rating lists

**Moderators:** bob, hgm, Harvey Williamson

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

### Re: Wilo rating properties from FGRL rating lists

### Re: Wilo rating properties from FGRL rating lists

OK sure. But that is hardly justification for a rating system that ignores drawn games.Laskos wrote:That LOS with a uniform prior is independent of draws? The math is not that complicated, probably best presented here:

http://www.talkchess.com/forum/viewtopi ... 05&t=30624

### Re: Wilo rating properties from FGRL rating lists

It seems likely that WiLo is an ideal model for predicting the winner of a head to head match. (In the case of dann's extreme example of a trillion draws and 1 loss it should still produce an error margin so large that the answer 'hell if i know' should be the same as the elo model).

It seems less ideal for a tournament setting where how well you bash apart the field is more important than how you do against your rival (giving away draws against weaker opponents is a problem).

I'd be interested in seeing a system to optimise for 3-1-0 tournaments where W+L > 2 * D should lead to more interesting play in the optimised case.

It seems less ideal for a tournament setting where how well you bash apart the field is more important than how you do against your rival (giving away draws against weaker opponents is a problem).

I'd be interested in seeing a system to optimise for 3-1-0 tournaments where W+L > 2 * D should lead to more interesting play in the optimised case.

### Re: Wilo rating properties from FGRL rating lists

There was just an interesting test on fishtest.

SF8 and the latest dev version were pitted against each other at time controls 10+0.1, 60+0.6 and 180+1.8.

Despite the 18 fold in TC the value of elo/sigma(elo)/sqrt(games) remained almost exactly the same (let's call it normalized elo).

So it seems that at least in this case elo/sigma(elo)/sqrt(games) is a good statistic to use.

The properties of elo/sigma(elo)/sqrt(games) are:

(1) Its expectation values is independent of the number of games.

(2) Its standard deviation is 1/sqrt(games), i.e. independent of the draw ratio.

normalized elo is probably related to wilo (if wilo is indeed TC independent) but it has a more solid theoretic foundation. It is a measure for the amount of games it takes to establish the superiority of one engine over another with a specified p-value (or LOS).

Here is some simple python code that computes the normalized elo with an error estimate (for small elo values)

SF8 and the latest dev version were pitted against each other at time controls 10+0.1, 60+0.6 and 180+1.8.

Despite the 18 fold in TC the value of elo/sigma(elo)/sqrt(games) remained almost exactly the same (let's call it normalized elo).

So it seems that at least in this case elo/sigma(elo)/sqrt(games) is a good statistic to use.

The properties of elo/sigma(elo)/sqrt(games) are:

(1) Its expectation values is independent of the number of games.

(2) Its standard deviation is 1/sqrt(games), i.e. independent of the draw ratio.

normalized elo is probably related to wilo (if wilo is indeed TC independent) but it has a more solid theoretic foundation. It is a measure for the amount of games it takes to establish the superiority of one engine over another with a specified p-value (or LOS).

Here is some simple python code that computes the normalized elo with an error estimate (for small elo values)

Code: Select all

```
from __future__ import division
def sens(W=None,D=None,L=None):
N=W+D+L
(w,d,l)=(W/N,D/N,L/N)
s=w+d/2
var=w*(1-s)**2+d*(1/2-s)**2+l*(0-s)**2
sigma=var**.5
return ((s-1/2)/sigma-1.96/N**.5,((s-1/2)/sigma+1.96/N**.5))
if __name__=='__main__':
print('stc',sens(W=7460,L=5146,D=27394))
print('ltc',sens(W=3777,L=2201,D=21308))
print('vltc',sens(W=3919,L=2178,D=28614))
```

Ideas=science. Simplification=engineering.

Without ideas there is nothing to simplify.

Without ideas there is nothing to simplify.

### Re: Wilo rating properties from FGRL rating lists

Nice observation! For elo/sigma(elo) you can use the simplified expression for small Elo differences: (Wins-Losses)/sqrt(Wins+Losses). And we see that although this expression is independent of Draws, and is identical in ELO and WILO, the time control invariant for ELO elo/sigma(elo)/sqrt(games) is dependent on draw rate (but is independent of the total number of games).Michel wrote:There was just an interesting test on fishtest.

SF8 and the latest dev version were pitted against each other at time controls 10+0.1, 60+0.6 and 180+1.8.

Despite the 18 fold in TC the value of elo/sigma(elo)/sqrt(games) remained almost exactly the same (let's call it normalized elo).

So it seems that at least in this case elo/sigma(elo)/sqrt(games) is a good statistic to use.

The properties of elo/sigma(elo)/sqrt(games) are:

(1) Its expectation values is independent of the number of games.

(2) Its standard deviation is 1/sqrt(games), i.e. independent of the draw ratio.

normalized elo is probably related to wilo (if wilo is indeed TC independent) but it has a more solid theoretic foundation. It is a measure for the amount of games it takes to establish the superiority of one engine over another with a specified p-value (or LOS).

Here is some simple python code that computes the normalized elo with an error estimate (for small elo values)

Code: Select all

`from __future__ import division def sens(W=None,D=None,L=None): N=W+D+L (w,d,l)=(W/N,D/N,L/N) s=w+d/2 var=w*(1-s)**2+d*(1/2-s)**2+l*(0-s)**2 sigma=var**.5 return ((s-1/2)/sigma-1.96/N**.5,((s-1/2)/sigma+1.96/N**.5)) if __name__=='__main__': print('stc',sens(W=7460,L=5146,D=27394)) print('ltc',sens(W=3777,L=2201,D=21308)) print('vltc',sens(W=3919,L=2178,D=28614))`

**Invariant ELO = (Wins-Losses)/sqrt(Wins+Losses)/sqrt(games)**

It seems to not depend much on time control. It is independent of the number of games. It is dependent on draw ratio.

The 95% confidence interval for "Invariant ELO" is

**+- 1.96/sqrt(games)**,

and is independent on the draw rate.

The interpretation of the "invariant Elo" can be seen also as this: invert it

sqrt(Wins+Losses) * sqrt(games) / (Wins-Losses)

The square of this: (Wins+Losses) * games / (Wins-Losses)**2 is the number of games required to get 1 standard deviation or LOS=84%, and is invariant with regard to time according to your observation. It is also independent of the number of games. It depends on draw rate. To get X standard deviations (and the according LOS), one needs X^2 as much games.

WILO is identical if we keep Draws in the number of games. If WILO is applied to a "drawless Chess" (Draws do not occur), only dealing with Wins and Losses in "games", then the number of games needed for 1SD in WILO should decrease with time control, this quantity is not anymore time control independent in WILO.

### Re: Wilo rating properties from FGRL rating lists

Thanks. Yes to understand the difference between elo,wilo and normalized elo one can indeed look at small elo differences and do taylor series development. So we put

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

In the fishtest experiment normalized elo seemed to remain constant across TC. So eps/sqrt(a) stayed constant. Since a goes down with TC

one expects:

elo goes down with TC

normalized elo stays constant (this was our the hypothesis)

wilo goes up with TC

This was indeed what was observed.

Of course this is only a single data point and it is dangerous to draw serious conclusions from it, but it points the way to a more systematic analysis of "scaling".

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

- elo is proportional to eps

normalized elo is proportional to eps/sqrt(a)

wilo is proportional to eps/a

In the fishtest experiment normalized elo seemed to remain constant across TC. So eps/sqrt(a) stayed constant. Since a goes down with TC

one expects:

elo goes down with TC

normalized elo stays constant (this was our the hypothesis)

wilo goes up with TC

This was indeed what was observed.

Of course this is only a single data point and it is dangerous to draw serious conclusions from it, but it points the way to a more systematic analysis of "scaling".

Ideas=science. Simplification=engineering.

Without ideas there is nothing to simplify.

Without ideas there is nothing to simplify.

### Re: Wilo rating properties from FGRL rating lists

Good, so Normalized ELO is somehow middle the road between ELO and WILO, and has a very nice interpretation, being possibly a time control invariant.Michel wrote:Thanks. Yes to understand the difference between elo,wilo and normalized elo one can indeed look at small elo differences and do taylor series development. So we put

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

- elo is proportional to eps

normalized elo is proportional to eps/sqrt(a)

wilo is proportional to eps/a

In the fishtest experiment normalized elo seemed to remain constant across TC. So eps/sqrt(a) stayed constant. Since a goes down with TC

one expects:

elo goes down with TC

normalized elo stays constant (this was our the hypothesis)

wilo goes up with TC

This was indeed what was observed.

Of course this is only a single data point and it is dangerous to draw serious conclusions from it, but it points the way to a more systematic analysis of "scaling".

I used FGRL data to compute the Normalized ELO rating lists and scaling globally. I used your python script, as ELO differences are not that small.

Code: Select all

```
60'' + 0.6''
# PLAYER Norm ELO Error (95%)
================================================
0.041
1. Stockfish 8 0.904
2. Houdini 5 0.849
3. Komodo 10.4 0.745
4. Shredder 13 0.011
5. Fire 5 -0.073
6. Fizbo -0.177
7. Gull 3 -0.289
8. Andscacs 0.90 -0.501
9. Fritz 15 -0.509
10. Chiron 4 -0.536
```

Code: Select all

```
60' + 15''
# PLAYER Norm ELO Error (95%)
================================================
0.053
1. Stockfish 8 0.760
2. Komodo 10.4 0.700
3. Houdini 5 0.605
4. Shredder 13 0.056
5. Fire 5 -0.026
6. Gull 3 -0.242
7. Fizbo 1.9 -0.263
8. Andscacs 0.90 -0.264
9. Chiron 4 -0.513
10. Fritz 15 -0.575
```

Code: Select all

```
Scaling:
# PLAYER Norm ELO Error (95%)
================================================
0.067
1. Andscacs 0.90 0.237
2. Fire 5 0.047
3. Gull 3 0.047
4. Shredder 13 0.045
5. Chiron 4 0.023
6. Komodo 10.4 -0.045
7. Fritz 15 -0.066
8. Fizbo 1.9 -0.086
9. Stockfish 8 -0.144
10. Houdini 5 -0.244
```

Just visually, it seems that Normalized ELO favors in scaling, like ELO, weaker engines. WILO didn't exhibit this behavior.

I also played two matches of 2000 games each between Stockfish dev and Komodo 10.4 at 10''+0.1'' and 60''+0.6'':

Code: Select all

```
Score of SF dev vs Komodo: 654 - 142 - 1204 [0.628] 2000
ELO difference: 90.97 +/- 9.38
Score of SF dev vs Komodo: 470 - 126 - 1404 [0.586] 2000
ELO difference: 60.36 +/- 8.12
```

Code: Select all

```
STC: 0.444 +/- 0.044
LTC: 0.332 +/- 0.044
```

### Re: Wilo rating properties from FGRL rating lists

I prefer to measure things in BayesElo. The point is that DrawElo absorbs the higher draw ratio between different tc. Constant BayesElo scaling is in fact the implicit assumption underpinning SF testing methodology.

So I would conclude that Stockfish has good scaling. Both BayesElo and Wilo are still increasing between LTC and VLTC. Scaling is even better from STC to LTC.

The BayesElo gain from LTC to VLTC is probably still within error bars. However, error bars are difficult to calculate for this model. Apart from running MC simulations, I can't see another way.

Code: Select all

```
tc W L D DrawElo BayesElo Wilo
10+0.1 7460 5146 27394 294 38.2 64.5
60+0.6 5397 3150 30010 368 52.5 93.5
180+1.8 4318 2394 31584 414 56.0 102.5
```

The BayesElo gain from LTC to VLTC is probably still within error bars. However, error bars are difficult to calculate for this model. Apart from running MC simulations, I can't see another way.

Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

- Ajedrecista
**Posts:**1405**Joined:**Wed Jul 13, 2011 7:04 pm**Location:**Madrid, Spain.-
**Contact:**

### Re: WiLo rating properties from FGRL rating lists.

Hello:

Just to complete the picture:

In fact, not taking into account constants of 200 and 400:

So I expect some similarities for WiLo and Bayeselo for small differences (|eps| << 1), unless I got wrong somewhere.

Corrections and other insights are welcome.

Regards from Spain.

Ajedrecista.

Just to complete the picture:

Michel wrote:Thanks. Yes to understand the difference between elo,wilo and normalized elo one can indeed look at small elo differences and do taylor series development. So we put

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

elo is proportional to eps

normalized elo is proportional to eps/sqrt(a)

wilo is proportional to eps/a

For small differences, using Bayeselo = 200*log10{(1 - L)*W/[(1 - W)*L]} and Michel's notation, I get Bayeselo ~ 2*eps/[a*(1 - a)] ~ eps/[a*(1 - a)], which is somewhat similar to WiLo's behaviour for reasonable values of a {please remember that 0 =< a =< 1/2, then just compare 1/a and 1/[a*(1 - a)]}.lucasart wrote:Both BayesElo and Wilo are still increasing between LTC and VLTC. Scaling is even better from STC to LTC.

In fact, not taking into account constants of 200 and 400:

Code: Select all

```
WiLo = 400*log10(W/L)
Bayeselo = 200*log10{(1 - L)*W/[(1 - W)*L]}
~ means 'proportional to' here.
WiLo ~ ln(W/L) ~ 2*eps/a ~ eps/a
Bayeselo ~ ln[(1 - L)/(1 - W)] + ln(W/L) ~ [(1 - L)/(1 - W)] + WiLo
Bayeselo ~ 2*eps/(1 - a) + 2*eps/a ~ eps*[1/(1 - a) + 1/a] ~ eps/[a*(1 - a)]
```

Corrections and other insights are welcome.

Regards from Spain.

Ajedrecista.

### Re: WiLo rating properties from FGRL rating lists.

That's the main issue, IMO.lkaufman wrote: Another issue is [...] not a minor issue [...] to maximize WILO, he should trade queens

The logic is quite simple: I'm playing an interesting unclear game but draws don't matter -> I manage to draw this game and I'll try to win the next one (no matter if I will win after a trillion of games).

Chess' goal has always been to try to win avoiding draws, since draws ARE time-costing and often effortful. Trying to avoid them just ignoring them is a fruitless placebo effect, if not producing the negative effect of the time (and #games) wasting.

OK, measuring engines' "strength" is already a different goal. In this case Wilo rating can be useful, as well as an unbalanced set of openings (which would reduce draws' weight as well, IMO in a more efficient way if the set is chosen accurately), but this strength should always have as reference the main goal above.Laskos wrote:Also, there are many cases when the game of Chess has different goals [...]