Michel wrote:Michel wrote:
If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.
I'll look for that - do you have any references?
We don't really test head to head, but each version of Komodo tests against a set of opponents (not Komodo versions.)
The Wald test also works for non head to head tests. The key point is that the elo of the opponents needs to be known (which is the case for a head to head test since you may assume that the elo of the single opponent is zero).
The Wald test is for testing a single parameter (e.g. elo). It is implemented as a random walk were each outcome (win/draw/loss) changes a certain number (the likelihood ratio) in a precomputed way. Once the likelihood ratio falls outside a precomputed interval you accept either H0 or H1.
In practice you also have a truncation time and a threshold which determines whether you accept H0 or H1 at truncation.
I wrote a python utility that computes the parameters (alas for now only for a head to head test, I could extend it for more opponents).
http://hardy.uhasselt.be/Toga/wald
Code: Select all
Usage: wald [OPTION]...
Generate parameters for a truncated Wald test and optionally
produce an informative graph.
--alpha type I error probability when there is no elo difference
--beta type II error probability with elo difference equal to epsilon
--epsilon see --beta
--truncation truncate test after this many observations
--graph produce graph
--draw_ratio draw ratio when there is no elo difference
--help print this message
It will print the parameters of the random walk (i.e. what to add in case of W/D/L) as well as the stopping conditions.
It will also produce a graph which looks as follows
http://hardy.uhasselt.be/Toga/graph3.png
The Wald-test is different from the Likelihood-Ratio test. For general hypothesis testing, if you have estimated a model for an alternative hypothesis H1, that gives you back a log-likelihood function LogL1(theta) and parameter theta1 that maximizes LogL1, the Wald-test to check whether theta1 is signicificantly different from its value theta0 under the null hypothesis is simply the statistics
(theta1-theta0) / var(theta1)
which is distributed as a chi-squared variable with 1 degree of freedom in the case of a single rating. Note that you don't need to estimate a model of the null hypothesis. Here, var(theta1) is given by the second derivative of the log-likelihood function at the maximum.
The likelihood-ratio test on the other hand, needs the log-likelihood for two models: the null and the alternative hypothesis, say LogL1(theta1) and LogL0(theta0). The likelihood ratio test is then the statistic
2(LogL1-LogL0)
which is also distributed as a chi-squared variable with 1 degree of freedom. Asympotitically (sample size goes to infinity) the two tests behave the same, but for finite samples, LR <= Wald. So with Wald-tests, you will reject the null hypothesis more often than with the LR-test (at least for small samples).