Ok but both sets have the same amount of time used?yurikvelo wrote:with small amount of games your test run will give bogus result.stegemma wrote:Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.
If you take any Fishtest run with negative result, e.g.
http://tests.stockfishchess.org/tests/v ... 1c761f5f78
you will always find sub-runs with opposite eval.
This patch had +1323-1456=4403, but user ttruscott-3cores had +183-165=555
More games or more plies in testing engines?
Moderators: hgm, Rebel, chrisw
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: More games or more plies in testing engines?
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: More games or more plies in testing engines?
I don't know, but it wouldn't be surprising. If a change is truly a no-op, the probability of obtaining a result as good as 183 victories to 165 losses (the draws don't really matter) is about 16.7%. So if you accept changes with this level of evidence, you are bound to accept some bad changes along the way. And this is with 903 games played, instead of the 330 in your example.stegemma wrote: Ok but both sets have the same amount of time used?
You should become very familiar with the statistics of binomial distributions, so you can make better decisions about what changes to accept.
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: More games or more plies in testing engines?
Of course more games gets more statistically valid results but those "more games" are shorter than with longer time controls. I was thinking that the right parameter were the number of evaluation function calls (and this depends on the whole time, the branching factor and the nps of the engine) and not the number of games. Maybe I'm wrong, because at the end you compute ELO difference on the final result of the game, and this is a sort of "bottleneck", that lets using longer time control not productive, at the end.AlvaroBegue wrote:I don't know, but it wouldn't be surprising. If a change is truly a no-op, the probability of obtaining a result as good as 183 victories to 165 losses (the draws don't really matter) is about 16.7%. So if you accept changes with this level of evidence, you are bound to accept some bad changes along the way. And this is with 903 games played, instead of the 330 in your example.stegemma wrote: Ok but both sets have the same amount of time used?
You should become very familiar with the statistics of binomial distributions, so you can make better decisions about what changes to accept.
From a statistical point of view, if you want to know anything about a product in a market, it is better to have 30 polls each one using a (different) population of 1000 people (on a 1,000,000 whole population) or having 300 polls against 100 people each one? Any poll gives you the answers:
- good products
- bad product
This is the situation, if we consider the evaluation function calls as the count of goodness of our tuning session.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: More games or more plies in testing engines?
I understand what you are trying to say, but it's just wrong. You are not collecting one data point per call to the evaluation function; only one per game. There will be certain aspects of the program that cannot properly be evaluated using 10-second games (anything that happens only at high depth, time control, parallelism...), but for many changes it's just fine.
It's probably best to have a testing procedure that involves lots of very fast games and some not-so-fast games. There will be changes for which you need to run other tests (tight memory limit for changes to hash-replacement policy, for instance).
But it all starts with really understanding the statistics involved.
It's probably best to have a testing procedure that involves lots of very fast games and some not-so-fast games. There will be changes for which you need to run other tests (tight memory limit for changes to hash-replacement policy, for instance).
But it all starts with really understanding the statistics involved.
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: More games or more plies in testing engines?
Ok, thanks. I must study the statistical basis to go further.AlvaroBegue wrote:I understand what you are trying to say, but it's just wrong. You are not collecting one data point per call to the evaluation function; only one per game. There will be certain aspects of the program that cannot properly be evaluated using 10-second games (anything that happens only at high depth, time control, parallelism...), but for many changes it's just fine.
It's probably best to have a testing procedure that involves lots of very fast games and some not-so-fast games. There will be changes for which you need to run other tests (tight memory limit for changes to hash-replacement policy, for instance).
But it all starts with really understanding the statistics involved.
Maybe that would help to improve my weak engine!
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com
-
- Posts: 7216
- Joined: Mon May 27, 2013 10:31 am
-
- Posts: 7216
- Joined: Mon May 27, 2013 10:31 am
Re: More games or more plies in testing engines?
If it loses first five games in a row then maybe latest change is not a very promising improvement.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: More games or more plies in testing engines?
It's a trade-off between bias and variance.
Imagine that given any time control and two engines, there's a true Elo difference you can find by playing an infinite number of games.
When you play less than an infinite number of games, the Elo difference you get will be from a probability distribution centered around the true Elo difference. The more games you play, the narrower the probability distribution is, so the more confident you are that the Elo difference you get is close to the true difference.
The difference between measured Elo difference and true Elo difference is one source of error.
However, there is another source of error - the fact that the true Elo difference at short time control may not be the same as the true Elo difference at long time control.
Consider the 2 extreme cases -
1) Playing 0.000001s games. You can play millions of games, and get extremely close to the true difference between the two versions AT 0.000001s. But this is clearly a bad idea, because the true difference at this time control is probably very different from the true difference at reasonable time controls.
2) Playing 2 hour games. You can only play a few games in the same amount of time. The true difference here is very close to the true difference you care about, but your estimate of that true difference now has a very large error bar because you have only played very few games.
Neither of these cases are good.
In choosing a time control for testing, you have to find a balance between the two.
Imagine that given any time control and two engines, there's a true Elo difference you can find by playing an infinite number of games.
When you play less than an infinite number of games, the Elo difference you get will be from a probability distribution centered around the true Elo difference. The more games you play, the narrower the probability distribution is, so the more confident you are that the Elo difference you get is close to the true difference.
The difference between measured Elo difference and true Elo difference is one source of error.
However, there is another source of error - the fact that the true Elo difference at short time control may not be the same as the true Elo difference at long time control.
Consider the 2 extreme cases -
1) Playing 0.000001s games. You can play millions of games, and get extremely close to the true difference between the two versions AT 0.000001s. But this is clearly a bad idea, because the true difference at this time control is probably very different from the true difference at reasonable time controls.
2) Playing 2 hour games. You can only play a few games in the same amount of time. The true difference here is very close to the true difference you care about, but your estimate of that true difference now has a very large error bar because you have only played very few games.
Neither of these cases are good.
In choosing a time control for testing, you have to find a balance between the two.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.