More games or more plies in testing engines?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

More games or more plies in testing engines?

Post by stegemma »

That was an old question, I know. Now I'm thinking about an idea that more plies could be better than more games. My though is that if you test with short time control (let's say 10 seconds per game) you get a big number of games but they don't reach the same depth as in longer times control. If we are tuning the evaluation function, we know that it was called at any leave, so in shorter time control we call less times the Evaluate function. Someone could object that more games gives more precise statistical value to tuning but more "less analyzed" games is not the same as less "well analyzed" games?

Maybe we should compare how many times the evaluation function has been called, to know what method is the best one. Surprisingly the number of calls to evaluation function doesn't depend on the time control but only on the whole time of the tuning session. That's why wider time control could be better:

- the number of calls to evaluation function are the same than shorter time controls
- the games are more "well analyzed"
- the games suffers less from horizon effects

i don't mean that it is absolutely wrong to use short time control but that there are no real reason to use it, against wider time controls.

Or doesn't? ;)
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: More games or more plies in testing engines?

Post by Henk »

These engines that test on short time control can do that because they play better games in one second then our engines do in 100 seconds.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: More games or more plies in testing engines?

Post by stegemma »

Henk wrote:These engines that test on short time control can do that because they play better games in one second then our engines do in 100 seconds.
That's not a good reason: they still play more well in longer time controls.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: More games or more plies in testing engines?

Post by Henk »

For instance they reach depth 15 while our engines stick to ply 3 in one second games. So less trouble with tuning depth dependent reductions. For instance how can you test if reducing with 3 or more is better if ply 4 won't even be reached.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: More games or more plies in testing engines?

Post by stegemma »

Henk wrote:For instance they reach depth 15 while our engines stick to ply 3 in one second games. So less trouble with tuning depth dependent reductions. For instance how can you test if reducing with 3 or more is better if ply 4 won't even be reached.
Ok, but you're just saying what I'm saying: "for slow engine you must use longer time controls". I just add that for fast engines too it is better (or almost the same) to test with longer time controls.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: More games or more plies in testing engines?

Post by elcabesa »

No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: More games or more plies in testing engines?

Post by stegemma »

elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: More games or more plies in testing engines?

Post by AlvaroBegue »

stegemma wrote:
elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.
When you are collecting statistics to evaluate if a change to the program is a good one, you only get one data point per game you play. Imagine that your change really is a 5 Elo improvement. 330 games will give you a very noisy measurement of performance and you won't be able to tell which version is better with any confidence.

EDIT: Assuming you get draws 50% of the time, your 330-game test will tell you that the old version is better about 42% of the time. With 10000 games I think that would happen only 15% of the time. (Is it that bad? Someone check my numbers, please.)
AndrewGrant
Posts: 1752
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: More games or more plies in testing engines?

Post by AndrewGrant »

Actually, I think that the oppisite is true.

My engine is around ~2500, and Just about every change I make is worth plus or minus 20+ elo points. It is quite easy to determine whether a change is good or bad when the engine is quite poor.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
User avatar
yurikvelo
Posts: 710
Joined: Sat Dec 06, 2014 1:53 pm

Re: More games or more plies in testing engines?

Post by yurikvelo »

stegemma wrote:
elcabesa wrote:No one can afford 10000 games of 1 hour, so you have to choose. Usually this problem arise when you have to test a 5 elo point patch. Someone test with a lot of short games to understand if the patch could be good and then verify promising patch with long time control.
Ok, maybe I could not explain myself clearly in english. So, let's say that you can play 10000 games of 10 second each. You need about 200000 seconds = about 55 hours. In the same time you can play about 330 games of 5 minutes each one. Using the same time amount of 55 hours, what I say is that you can obtains the same (or better) results with the longer time control, because you call the evaluation function the same number of times but you play better games. Statistically the "population" of evaluation function calls are numerically equivalent... so why lose time to play a first test session with short time and then another one with longer time?

That make sense only if you have little time to spend in tuning and the "rumor" of few games obscure the equality of evaluation calls.
with small amount of games your test run will give bogus result.

If you take any Fishtest run with negative result, e.g.
http://tests.stockfishchess.org/tests/v ... 1c761f5f78

you will always find sub-runs with opposite eval.

This patch had +1323-1456=4403, but user ttruscott-3cores had +183-165=555