More games or more plies in testing engines?

stegemma · Post by **stegemma** » Tue Aug 09, 2016 2:38 pm

That was an old question, I know. Now I'm thinking about an idea that more plies could be better than more games. My though is that if you test with short time control (let's say 10 seconds per game) you get a big number of games but they don't reach the same depth as in longer times control. If we are tuning the evaluation function, we know that it was called at any leave, so in shorter time control we call less times the Evaluate function. Someone could object that more games gives more precise statistical value to tuning but more "less analyzed" games is not the same as less "well analyzed" games?

Maybe we should compare how many times the evaluation function has been called, to know what method is the best one. Surprisingly the number of calls to evaluation function doesn't depend on the time control but only on the whole time of the tuning session. That's why wider time control could be better:

- the number of calls to evaluation function are the same than shorter time controls
- the games are more "well analyzed"
- the games suffers less from horizon effects

i don't mean that it is absolutely wrong to use short time control but that there are no real reason to use it, against wider time controls.

Or doesn't?

Henk · Post by **Henk** » Tue Aug 09, 2016 3:30 pm

These engines that test on short time control can do that because they play better games in one second then our engines do in 100 seconds.

stegemma · Post by **stegemma** » Tue Aug 09, 2016 3:53 pm

Henk wrote:These engines that test on short time control can do that because they play better games in one second then our engines do in 100 seconds.

That's not a good reason: they still play more well in longer time controls.

Henk · Post by **Henk** » Tue Aug 09, 2016 5:18 pm

For instance they reach depth 15 while our engines stick to ply 3 in one second games. So less trouble with tuning depth dependent reductions. For instance how can you test if reducing with 3 or more is better if ply 4 won't even be reached.

stegemma · Post by **stegemma** » Tue Aug 09, 2016 5:41 pm

Henk wrote:For instance they reach depth 15 while our engines stick to ply 3 in one second games. So less trouble with tuning depth dependent reductions. For instance how can you test if reducing with 3 or more is better if ply 4 won't even be reached.

Ok, but you're just saying what I'm saying: "for slow engine you must use longer time controls". I just add that for fast engines too it is better (or almost the same) to test with longer time controls.

elcabesa · Post by **elcabesa** » Tue Aug 09, 2016 7:25 pm

No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.

stegemma · Post by **stegemma** » Tue Aug 09, 2016 7:54 pm

elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.

Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.

AlvaroBegue · Post by **AlvaroBegue** » Tue Aug 09, 2016 8:04 pm

stegemma wrote:
elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.

When you are collecting statistics to evaluate if a change to the program is a good one, you only get one data point per game you play. Imagine that your change really is a 5 Elo improvement. 330 games will give you a very noisy measurement of performance and you won't be able to tell which version is better with any confidence.

EDIT: Assuming you get draws 50% of the time, your 330-game test will tell you that the old version is better about 42% of the time. With 10000 games I think that would happen only 15% of the time. (Is it that bad? Someone check my numbers, please.)

AndrewGrant · Post by **AndrewGrant** » Tue Aug 09, 2016 9:20 pm

Actually, I think that the oppisite is true.

My engine is around ~2500, and Just about every change I make is worth plus or minus 20+ elo points. It is quite easy to determine whether a change is good or bad when the engine is quite poor.

yurikvelo · Post by **yurikvelo** » Tue Aug 09, 2016 10:15 pm

stegemma wrote:
elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.

with small amount of games your test run will give bogus result.

If you take any Fishtest run with negative result, e.g.
http://tests.stockfishchess.org/tests/v ... 1c761f5f78

you will always find sub-runs with opposite eval.

This patch had +1323-1456=4403, but user ttruscott-3cores had +183-165=555

More games or more plies in testing engines?

More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?

Re: More games or more plies in testing engines?