Stockfish scaling

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Rebel
Posts: 4790
Joined: Thu Aug 18, 2011 10:04 am

Stockfish scaling

Post by Rebel » Fri Nov 15, 2013 9:45 am

With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?

Just curious.

gladius
Posts: 538
Joined: Tue Dec 12, 2006 9:10 am

Re: Stockfish scaling

Post by gladius » Fri Nov 15, 2013 4:02 pm

Rebel wrote:With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?

Just curious.
On the testing framework, the longest test that's been done recently is 60 seconds/game at 3 threads. Testing at even longer TCs is interesting, but would tie machines up for a really long time!

In general, a fast test at 15 seconds/game is done, then if that succeeds, a test at 60 seconds/game is done. So, at least a basic scaling test is done.

User avatar
Rebel
Posts: 4790
Joined: Thu Aug 18, 2011 10:04 am

Re: Stockfish scaling

Post by Rebel » Sat Nov 16, 2013 7:43 am

Okay thanks. So,

1. are the 15/all and 60/all results in sync?

2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Stockfish scaling

Post by mcostalba » Sat Nov 16, 2013 8:34 am

Rebel wrote:Okay thanks. So,

1. are the 15/all and 60/all results in sync?

2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
Ed, our tests are self-play. We don't sync (for whatever this word means for you) with any xxxx rating list: our aim is to check if a patch is good or bad, this has almost nothing to do with rating lists. We have found many times that successful patches at 15secs fail at 60secs (although at 60secs conditions are stricter). There is a lot of info here and on our forum, if you are interested in this topic I'd suggest to spend 10 minutes browsing for it.

User avatar
Laskos
Posts: 9545
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Stockfish scaling

Post by Laskos » Sat Nov 16, 2013 9:35 am

Rebel wrote:With all the current massive user hardware contribution has it been investigated how Stockfish scales at longer time controls like the CCRL/CEGT time controls of 40/4 or even 40/20?

Just curious.
All the indications are that from ultra-bullet, where SF is weaker than both H3 and K6, to blitz (40/5 or so), SF gains points with respect to Houdini 3 and even Komodo 6. As to Long TC, several results show SF at least equal to K6 and H3, if not superior to them, so it gained some additional ELO points. SF is one of the more scalable engines.

lkaufman
Posts: 3772
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Stockfish scaling

Post by lkaufman » Sat Nov 16, 2013 9:36 pm

mcostalba wrote:
Rebel wrote:Okay thanks. So,

1. are the 15/all and 60/all results in sync?

2. is the bullet testing in sync with the 40/4. 40/20, 40/40 of the rating lists?
Ed, our tests are self-play. We don't sync (for whatever this word means for you) with any xxxx rating list: our aim is to check if a patch is good or bad, this has almost nothing to do with rating lists. We have found many times that successful patches at 15secs fail at 60secs (although at 60secs conditions are stricter). There is a lot of info here and on our forum, if you are interested in this topic I'd suggest to spend 10 minutes browsing for it.
I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Stockfish scaling

Post by mcostalba » Sun Nov 17, 2013 6:10 am

lkaufman wrote: I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?
Actually TC is essentially a fixed time game, the 0.05 is only needed to avoid losses on time due to communications lags.

We don't test with increments because the average game length turns out to be much longer, for instance even 0.5 secs of increment, so 15+0.5 can bump the average game length from 25-30 secs to about 80-90secs so a X3 factor. This means a throughput 3 times slower, and in case of long TC 60+1 instead of about 2 minutes it would take almost 5 minutes !

So to answer to your question, it is due to testing efficiency.

User avatar
lucasart
Posts: 3046
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Stockfish scaling

Post by lucasart » Sun Nov 17, 2013 8:13 am

mcostalba wrote:
lkaufman wrote: I noticed that your tests always use an extremely small increment like .05 seonds. With Komodo we always use an increment that is at least half a percent of the base time, to avoid games being decided by extremely fast play in the endgame. What is your reason for using such a tiny increment?
Actually TC is essentially a fixed time game, the 0.05 is only needed to avoid losses on time due to communications lags.

We don't test with increments because the average game length turns out to be much longer, for instance even 0.5 secs of increment, so 15+0.5 can bump the average game length from 25-30 secs to about 80-90secs so a X3 factor. This means a throughput 3 times slower, and in case of long TC 60+1 instead of about 2 minutes it would take almost 5 minutes !

So to answer to your question, it is due to testing efficiency.
Actually Larry's point is valid. I don't think we know (empirical evidence) whether zero increment is better or not than realistic increment. Comparing 15+0.05 and 15+0.5 is not meaningful. Let's assume the average game length is 60 moves: 15"+0.05"->36" while 15"+0.5"->90". Actually the equivalent is 42"+0.05"->90". So the meaningful comparison is between 15"+0.5" and 42"+0.05". The choice is based on how much importance you want to give to the endgame.

Those are two extremes: 15+0.5 means lots of importance to endgame, while 42+0.05 means much more to the opening and middlegame. In my experience of looking at games of self-testing at super fast TC, endgame is more important than you seem to imagine. Very often that's where the game is decided, due to a blunder, because many depths are necessary to play the right move (passed pawn races king penetration). So I don't know the answer and I don't think we have any evidence going in either direction. My gut feeling, looking at games, is that endgame hsould not be neglected, so I use a ratio of 100 or 120 between base time and increment. But my guts have been wrong in the past :lol:

In order to answer this question based on evidence, rather than gut feeling, I suggest the following experiment:
* self test in 60"+0.05" (master against master) and measure the draw rate. be sure to relax adjudication rules to not adjudicate draws too early, as it's endgame blunders we need to measure.
* selt test in 40"+0.4", and measure the draw rate.

The one which gives the highest draw rate means games of the highest quality (less blunders). Both TC have the same testing "throughput" to use your terminology.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Uri Blass
Posts: 8611
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

Re: Stockfish scaling

Post by Uri Blass » Sun Nov 17, 2013 9:31 am

I guess that 40+0.4 is going to give more draws and my opinion is that it is better to use 10+0.1 for stage 1 and 40+0.4 for stage 2 instead of
15+0.05 for stage 1 and 60+0.05 for stage 2.

This also has the advantage of better testing of some changes that may be relevant only in the endgames like adding stalemate detection so stockfish can finally see the draw in positions like the following position
when stockfish never see the draw(except very small depths) and evaluates the position as more than 6 pawns for white because it always finds way to delay the stalemate in the search to positions when depth is too small to see the stalemate.

[D]7k/6p1/8/7P/8/8/8/K6B w - - 11 1

There is no problem of playing wrong move here but there may be a problem some moves earlier if stockfish prefers to go to this position and not to a different winning position that it evaluates as only +4 pawns or +5
pawns for white.

User avatar
lucasart
Posts: 3046
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Stockfish scaling

Post by lucasart » Sun Nov 17, 2013 1:01 pm

Uri Blass wrote:I guess that 40+0.4 is going to give more draws and my opinion is that it is better to use 10+0.1 for stage 1 and 40+0.4 for stage 2 instead of
15+0.05 for stage 1 and 60+0.05 for stage 2.
That is also my guess. But instead of guessing, we need to measure.

Unfortunately the framework cannot be used for that, for two reasons:
1/ that broken logic it has of not rescaling the increment
2/ adjudication rules in the framework are far too aggressive, and in this case we ideally want no adjudication at all, to avoid polluting the measure.

So it has to be done locally. I can't do it now, because my CPU resources are running full blast on CLOP-ing DiscoCheck all over the place.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Post Reply