More games or more plies in testing engines?

stegemma · Post by **stegemma** » Tue Aug 09, 2016 10:52 pm

yurikvelo wrote:
stegemma wrote:
elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.
with small amount of games your test run will give bogus result.

If you take any Fishtest run with negative result, e.g.
http://tests.stockfishchess.org/tests/v ... 1c761f5f78

you will always find sub-runs with opposite eval.

This patch had +1323-1456=4403, but user ttruscott-3cores had +183-165=555

Ok but both sets have the same amount of time used?

AlvaroBegue · Post by **AlvaroBegue** » Wed Aug 10, 2016 3:24 am

stegemma wrote: Ok but both sets have the same amount of time used?

I don't know, but it wouldn't be surprising. If a change is truly a no-op, the probability of obtaining a result as good as 183 victories to 165 losses (the draws don't really matter) is about 16.7%. So if you accept changes with this level of evidence, you are bound to accept some bad changes along the way. And this is with 903 games played, instead of the 330 in your example.

You should become very familiar with the statistics of binomial distributions, so you can make better decisions about what changes to accept.

stegemma · Post by **stegemma** » Wed Aug 10, 2016 9:12 am

AlvaroBegue wrote:
stegemma wrote: Ok but both sets have the same amount of time used?
I don't know, but it wouldn't be surprising. If a change is truly a no-op, the probability of obtaining a result as good as 183 victories to 165 losses (the draws don't really matter) is about 16.7%. So if you accept changes with this level of evidence, you are bound to accept some bad changes along the way. And this is with 903 games played, instead of the 330 in your example.

You should become very familiar with the statistics of binomial distributions, so you can make better decisions about what changes to accept.

Of course more games gets more statistically valid results but those "more games" are shorter than with longer time controls. I was thinking that the right parameter were the number of evaluation function calls (and this depends on the whole time, the branching factor and the nps of the engine) and not the number of games. Maybe I'm wrong, because at the end you compute ELO difference on the final result of the game, and this is a sort of "bottleneck", that lets using longer time control not productive, at the end.

From a statistical point of view, if you want to know anything about a product in a market, it is better to have 30 polls each one using a (different) population of 1000 people (on a 1,000,000 whole population) or having 300 polls against 100 people each one? Any poll gives you the answers:

- good products
- bad product

This is the situation, if we consider the evaluation function calls as the count of goodness of our tuning session.

AlvaroBegue · Post by **AlvaroBegue** » Wed Aug 10, 2016 1:37 pm

I understand what you are trying to say, but it's just wrong. You are not collecting one data point per call to the evaluation function; only one per game. There will be certain aspects of the program that cannot properly be evaluated using 10-second games (anything that happens only at high depth, time control, parallelism...), but for many changes it's just fine.

It's probably best to have a testing procedure that involves lots of very fast games and some not-so-fast games. There will be changes for which you need to run other tests (tight memory limit for changes to hash-replacement policy, for instance).

But it all starts with really understanding the statistics involved.

stegemma · Post by **stegemma** » Wed Aug 10, 2016 3:35 pm

AlvaroBegue wrote:I understand what you are trying to say, but it's just wrong. You are not collecting one data point per call to the evaluation function; only one per game. There will be certain aspects of the program that cannot properly be evaluated using 10-second games (anything that happens only at high depth, time control, parallelism...), but for many changes it's just fine.

It's probably best to have a testing procedure that involves lots of very fast games and some not-so-fast games. There will be changes for which you need to run other tests (tight memory limit for changes to hash-replacement policy, for instance).

But it all starts with really understanding the statistics involved.

Ok, thanks. I must study the statistical basis to go further.

Maybe that would help to improve my weak engine!

Henk · Post by **Henk** » Wed Aug 10, 2016 5:00 pm

I only remember Student's t-distribution.

https://en.wikipedia.org/wiki/Student%2 ... stribution

Henk · Post by **Henk** » Thu Aug 11, 2016 12:03 pm

If it loses first five games in a row then maybe latest change is not a very promising improvement.

matthewlai · Post by **matthewlai** » Sat Aug 13, 2016 2:18 pm

It's a trade-off between bias and variance.

Imagine that given any time control and two engines, there's a true Elo difference you can find by playing an infinite number of games.

When you play less than an infinite number of games, the Elo difference you get will be from a probability distribution centered around the true Elo difference. The more games you play, the narrower the probability distribution is, so the more confident you are that the Elo difference you get is close to the true difference.

The difference between measured Elo difference and true Elo difference is one source of error.

However, there is another source of error - the fact that the true Elo difference at short time control may not be the same as the true Elo difference at long time control.

Consider the 2 extreme cases -
1) Playing 0.000001s games. You can play millions of games, and get extremely close to the true difference between the two versions AT 0.000001s. But this is clearly a bad idea, because the true difference at this time control is probably very different from the true difference at reasonable time controls.

2) Playing 2 hour games. You can only play a few games in the same amount of time. The true difference here is very close to the true difference you care about, but your estimate of that true difference now has a very large error bar because you have only played very few games.

Neither of these cases are good.

In choosing a time control for testing, you have to find a balance between the two.

More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?