With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?
Just curious.
Stockfish scaling
Moderator: Ras
-
- Posts: 568
- Joined: Tue Dec 12, 2006 10:10 am
- Full name: Gary Linscott
Re: Stockfish scaling
On the testing framework, the longest test that's been done recently is 60 seconds/game at 3 threads. Testing at even longer TCs is interesting, but would tie machines up for a really long time!Rebel wrote:With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?
Just curious.
In general, a fast test at 15 seconds/game is done, then if that succeeds, a test at 60 seconds/game is done. So, at least a basic scaling test is done.
-
- Posts: 7286
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Stockfish scaling
Okay thanks. So,
1. are the 15/all and 60/all results in sync?
2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
1. are the 15/all and 60/all results in sync?
2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Stockfish scaling
Ed, our tests are self-play. We don't sync (for whatever this word means for you) with any xxxx rating list: our aim is to check if a patch is good or bad, this has almost nothing to do with rating lists. We have found many times that successful patches at 15secs fail at 60secs (although at 60secs conditions are stricter). There is a lot of info here and on our forum, if you are interested in this topic I'd suggest to spend 10 minutes browsing for it.Rebel wrote:Okay thanks. So,
1. are the 15/all and 60/all results in sync?
2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish scaling
All the indications are that from ultra-bullet, where SF is weaker than both H3 and K6, to blitz (40/5 or so), SF gains points with respect to Houdini 3 and even Komodo 6. As to Long TC, several results show SF at least equal to K6 and H3, if not superior to them, so it gained some additional ELO points. SF is one of the more scalable engines.Rebel wrote:With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?
Just curious.
-
- Posts: 6213
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Stockfish scaling
I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?mcostalba wrote:Ed, our tests are self-play. We don't sync (for whatever this word means for you) with any xxxx rating list: our aim is to check if a patch is good or bad, this has almost nothing to do with rating lists. We have found many times that successful patches at 15secs fail at 60secs (although at 60secs conditions are stricter). There is a lot of info here and on our forum, if you are interested in this topic I'd suggest to spend 10 minutes browsing for it.Rebel wrote:Okay thanks. So,
1. are the 15/all and 60/all results in sync?
2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Stockfish scaling
Actually TC is essentially a fixed time game, the 0.05 is only needed to avoid losses on time due to communications lags.lkaufman wrote: I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?
We don't test with increments because the average game length turns out to be much longer, for instance even 0.5 secs of increment, so 15+0.5 can bump the average game length from 25-30 secs to about 80-90secs so a X3 factor. This means a throughput 3 times slower, and in case of long TC 60+1 instead of about 2 minutes it would take almost 5 minutes !
So to answer to your question, it is due to testing efficiency.
-
- Posts: 3241
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: Stockfish scaling
Actually Larry's point is valid. I don't think we know (empirical evidence) whether zero increment is better or not than realistic increment. Comparing 15+0.05 and 15+0.5 is not meaningful. Let's assume the average game length is 60 moves: 15"+0.05"->36" while 15"+0.5"->90". Actually the equivalent is 42"+0.05"->90". So the meaningful comparison is between 15"+0.5" and 42"+0.05". The choice is based on how much importance you want to give to the endgame.mcostalba wrote:Actually TC is essentially a fixed time game, the 0.05 is only needed to avoid losses on time due to communications lags.lkaufman wrote: I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?
We don't test with increments because the average game length turns out to be much longer, for instance even 0.5 secs of increment, so 15+0.5 can bump the average game length from 25-30 secs to about 80-90secs so a X3 factor. This means a throughput 3 times slower, and in case of long TC 60+1 instead of about 2 minutes it would take almost 5 minutes !
So to answer to your question, it is due to testing efficiency.
Those are two extremes: 15+0.5 means lots of importance to endgame, while 42+0.05 means much more to the opening and middlegame. In my experience of looking at games of self-testing at super fast TC, endgame is more important than you seem to imagine. Very often that's where the game is decided, due to a blunder, because many depths are necessary to play the right move (passed pawn races king penetration). So I don't know the answer and I don't think we have any evidence going in either direction. My gut feeling, looking at games, is that endgame hsould not be neglected, so I use a ratio of 100 or 120 between base time and increment. But my guts have been wrong in the past

In order to answer this question based on evidence, rather than gut feeling, I suggest the following experiment:
* self test in 60"+0.05" (master against master) and measure the draw rate. be sure to relax adjudication rules to not adjudicate draws too early, as it's endgame blunders we need to measure.
* selt test in 40"+0.4", and measure the draw rate.
The one which gives the highest draw rate means games of the highest quality (less blunders). Both TC have the same testing "throughput" to use your terminology.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 10770
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Stockfish scaling
I guess that 40+0.4 is going to give more draws and my opinion is that it is better to use 10+0.1 for stage 1 and 40+0.4 for stage 2 instead of
15+0.05 for stage 1 and 60+0.05 for stage 2.
This also has the advantage of better testing of some changes that may be relevant only in the endgames like adding stalemate detection so stockfish can finally see the draw in positions like the following position
when stockfish never see the draw(except very small depths) and evaluates the position as more than 6 pawns for white because it always finds way to delay the stalemate in the search to positions when depth is too small to see the stalemate.
[d]7k/6p1/8/7P/8/8/8/K6B w - - 11 1
There is no problem of playing wrong move here but there may be a problem some moves earlier if stockfish prefers to go to this position and not to a different winning position that it evaluates as only +4 pawns or +5
pawns for white.
15+0.05 for stage 1 and 60+0.05 for stage 2.
This also has the advantage of better testing of some changes that may be relevant only in the endgames like adding stalemate detection so stockfish can finally see the draw in positions like the following position
when stockfish never see the draw(except very small depths) and evaluates the position as more than 6 pawns for white because it always finds way to delay the stalemate in the search to positions when depth is too small to see the stalemate.
[d]7k/6p1/8/7P/8/8/8/K6B w - - 11 1
There is no problem of playing wrong move here but there may be a problem some moves earlier if stockfish prefers to go to this position and not to a different winning position that it evaluates as only +4 pawns or +5
pawns for white.
-
- Posts: 3241
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: Stockfish scaling
That is also my guess. But instead of guessing, we need to measure.Uri Blass wrote:I guess that 40+0.4 is going to give more draws and my opinion is that it is better to use 10+0.1 for stage 1 and 40+0.4 for stage 2 instead of
15+0.05 for stage 1 and 60+0.05 for stage 2.
Unfortunately the framework cannot be used for that, for two reasons:
1/ that broken logic it has of not rescaling the increment
2/ adjudication rules in the framework are far too aggressive, and in this case we ideally want no adjudication at all, to avoid polluting the measure.
So it has to be done locally. I can't do it now, because my CPU resources are running full blast on CLOP-ing DiscoCheck all over the place.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.