Stockfish scaling

Rebel · Post by **Rebel** » Fri Nov 15, 2013 10:45 am

With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?

Just curious.

gladius · Post by **gladius** » Fri Nov 15, 2013 5:02 pm

Rebel wrote:With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?

Just curious.

On the testing framework, the longest test that's been done recently is 60 seconds/game at 3 threads. Testing at even longer TCs is interesting, but would tie machines up for a really long time!

In general, a fast test at 15 seconds/game is done, then if that succeeds, a test at 60 seconds/game is done. So, at least a basic scaling test is done.

Rebel · Post by **Rebel** » Sat Nov 16, 2013 8:43 am

Okay thanks. So,

1. are the 15/all and 60/all results in sync?

2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?

mcostalba · Post by **mcostalba** » Sat Nov 16, 2013 9:34 am

Rebel wrote:Okay thanks. So,

1. are the 15/all and 60/all results in sync?

2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?

Ed, our tests are self-play. We don't sync (for whatever this word means for you) with any xxxx rating list: our aim is to check if a patch is good or bad, this has almost nothing to do with rating lists. We have found many times that successful patches at 15secs fail at 60secs (although at 60secs conditions are stricter). There is a lot of info here and on our forum, if you are interested in this topic I'd suggest to spend 10 minutes browsing for it.

Laskos · Post by **Laskos** » Sat Nov 16, 2013 10:35 am

Rebel wrote:With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?

Just curious.

All the indications are that from ultra-bullet, where SF is weaker than both H3 and K6, to blitz (40/5 or so), SF gains points with respect to Houdini 3 and even Komodo 6. As to Long TC, several results show SF at least equal to K6 and H3, if not superior to them, so it gained some additional ELO points. SF is one of the more scalable engines.

lkaufman · Post by **lkaufman** » Sat Nov 16, 2013 10:36 pm

mcostalba wrote:
Rebel wrote:Okay thanks. So,

1. are the 15/all and 60/all results in sync?

2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
Ed, our tests are self-play. We don't sync (for whatever this word means for you) with any xxxx rating list: our aim is to check if a patch is good or bad, this has almost nothing to do with rating lists. We have found many times that successful patches at 15secs fail at 60secs (although at 60secs conditions are stricter). There is a lot of info here and on our forum, if you are interested in this topic I'd suggest to spend 10 minutes browsing for it.

I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?

mcostalba · Post by **mcostalba** » Sun Nov 17, 2013 7:10 am

lkaufman wrote: I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?

Actually TC is essentially a fixed time game, the 0.05 is only needed to avoid losses on time due to communications lags.

We don't test with increments because the average game length turns out to be much longer, for instance even 0.5 secs of increment, so 15+0.5 can bump the average game length from 25-30 secs to about 80-90secs so a X3 factor. This means a throughput 3 times slower, and in case of long TC 60+1 instead of about 2 minutes it would take almost 5 minutes !

So to answer to your question, it is due to testing efficiency.

lucasart · Post by **lucasart** » Sun Nov 17, 2013 9:13 am

mcostalba wrote:
lkaufman wrote: I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?
Actually TC is essentially a fixed time game, the 0.05 is only needed to avoid losses on time due to communications lags.

We don't test with increments because the average game length turns out to be much longer, for instance even 0.5 secs of increment, so 15+0.5 can bump the average game length from 25-30 secs to about 80-90secs so a X3 factor. This means a throughput 3 times slower, and in case of long TC 60+1 instead of about 2 minutes it would take almost 5 minutes !

So to answer to your question, it is due to testing efficiency.

Actually Larry's point is valid. I don't think we know (empirical evidence) whether zero increment is better or not than realistic increment. Comparing 15+0.05 and 15+0.5 is not meaningful. Let's assume the average game length is 60 moves: 15"+0.05"->36" while 15"+0.5"->90". Actually the equivalent is 42"+0.05"->90". So the meaningful comparison is between 15"+0.5" and 42"+0.05". The choice is based on how much importance you want to give to the endgame.

Those are two extremes: 15+0.5 means lots of importance to endgame, while 42+0.05 means much more to the opening and middlegame. In my experience of looking at games of self-testing at super fast TC, endgame is more important than you seem to imagine. Very often that's where the game is decided, due to a blunder, because many depths are necessary to play the right move (passed pawn races king penetration). So I don't know the answer and I don't think we have any evidence going in either direction. My gut feeling, looking at games, is that endgame hsould not be neglected, so I use a ratio of 100 or 120 between base time and increment. But my guts have been wrong in the past

In order to answer this question based on evidence, rather than gut feeling, I suggest the following experiment:
* self test in 60"+0.05" (master against master) and measure the draw rate. be sure to relax adjudication rules to not adjudicate draws too early, as it's endgame blunders we need to measure.
* selt test in 40"+0.4", and measure the draw rate.

The one which gives the highest draw rate means games of the highest quality (less blunders). Both TC have the same testing "throughput" to use your terminology.

Uri Blass · Post by **Uri Blass** » Sun Nov 17, 2013 10:31 am

I guess that 40+0.4 is going to give more draws and my opinion is that it is better to use 10+0.1 for stage 1 and 40+0.4 for stage 2 instead of
15+0.05 for stage 1 and 60+0.05 for stage 2.

This also has the advantage of better testing of some changes that may be relevant only in the endgames like adding stalemate detection so stockfish can finally see the draw in positions like the following position
when stockfish never see the draw(except very small depths) and evaluates the position as more than 6 pawns for white because it always finds way to delay the stalemate in the search to positions when depth is too small to see the stalemate.

[d]7k/6p1/8/7P/8/8/8/K6B w - - 11 1

There is no problem of playing wrong move here but there may be a problem some moves earlier if stockfish prefers to go to this position and not to a different winning position that it evaluates as only +4 pawns or +5
pawns for white.

lucasart · Post by **lucasart** » Sun Nov 17, 2013 2:01 pm

Uri Blass wrote:I guess that 40+0.4 is going to give more draws and my opinion is that it is better to use 10+0.1 for stage 1 and 40+0.4 for stage 2 instead of
15+0.05 for stage 1 and 60+0.05 for stage 2.

That is also my guess. But instead of guessing, we need to measure.

Unfortunately the framework cannot be used for that, for two reasons:
1/ that broken logic it has of not rescaling the increment
2/ adjudication rules in the framework are far too aggressive, and in this case we ideally want no adjudication at all, to avoid polluting the measure.

So it has to be done locally. I can't do it now, because my CPU resources are running full blast on CLOP-ing DiscoCheck all over the place.

Stockfish scaling

Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling

Re: Stockfish scaling