New Stockfish with Lazy_SMP, but what about the TC bug ?

syzygy · Post by **syzygy** » Fri Oct 30, 2015 12:47 am

Dann Corbit wrote:Let's suppose that you have 10 seconds to think at the start of your time slot.
Sleep for 5 seconds and check again.
Let's suppose you have 4.9 seconds left.
Sleep for 2.45 seconds and check again.

As soon as your time left to think is less than your error margin, stop thinking and fire the callback.

OK, this indeed makes sense. Except that it does not solve the problem that sleep() is unreliable....

bob · Post by **bob** » Fri Oct 30, 2015 1:08 am

syzygy wrote:
Sylwy wrote:
syzygy wrote:
Windows and Linux are indeed not real-time operating systems, so they should not be used to run a nuclear plant.

What about QNX ? That's for sure an RTOS. The BlackBerry smartphones use modified versions of the QNX operating system !

So now you know which phone to buy to control your private nuclear plant with

http://www.itbusiness.ca/news/nuclear-p ... me-os/9084
http://www.qnx.com/company/
Over the past 35 years, QNX software has become a big part of everyday life. People encounter QNX-controlled systems whenever they drive, shop, watch TV, use the Internet, or even turn on a light. Its ultra-reliable nature means QNX software is the preferred choice for life-critical systems such as air traffic control systems, surgical equipment, and nuclear power plants.
http://www.financialpost.com/news/Guess ... story.html

Problem is, that when minor issues like fairness, load balancing, response times to user input and such occur, you might not be so happy.

I find all of this discussion to be a bit off-the-wall however, who in their right mind would expect an OS to time accurately down to the millisecond, then most OS's want at most 100 timer interrupts per second to limit overhead. Or, in the case of the PC, 18+ ms. And no normal operating system wants to context-switch like mad due to efficiency issues, so they don't use sub-millisecond scheduling quantum. 100ms is about as short as anyone is willing to go. I can't see why one would not reserve at LEAST a couple of seconds for overhead time to be safe... In my case, when the game starts, I know the base time control, which gives me a time per move. After 10 moves, it is likely I have used less than 10x the time per move, due to book moves and ponder hits, so I am ahead. Saving 2-3-4-5 seconds out of that somehow lowers Elo???

You don't need much, but certainly 100ms is always going to be suspect since the scheduling quantum is bigger than that.

Dann Corbit · Post by **Dann Corbit** » Fri Oct 30, 2015 1:30 am

Right, but you don't care that sleep is unreliable, because you stop calling sleep when you decide that the margin is too close.

Instead of one big slab of time, that you hope comes out accurate, you chop the time slices into smaller and smaller pieces and you can quit when you get close enough.

syzygy · Post by **syzygy** » Fri Oct 30, 2015 1:42 am

Dann Corbit wrote:Right, but you don't care that sleep is unreliable, because you stop calling sleep when you decide that the margin is too close.

And then what do you do?

Instead of one big slab of time, that you hope comes out accurate, you chop the time slices into smaller and smaller pieces and you can quit when you get close enough.

The idea is fine as it greatly reduces the number of sleep()s and therefore greatly reduces the number of times that the timer thread has to be woken up (so lower context switching overhead). But it does not help at all in getting the timing more reliable.

Whatever you do when the margin gets "too close" with your approach (e.g. simply stop the search), you can do with SF's current approach when the margin gets "too close". The whole problem is that now and then spikes occur which make "too close" not that close at all.

Instead of one big slab of time, that you hope comes out accurate, you chop the time slices into smaller and smaller pieces and you can quit when you get close enough.

Maybe you are assuming that SF currently is sleep()ing for one big slab of time. It is not. It is sleep()ing for 5ms at a time, or at least that's what it tries to do. Every time it wakes up, it checks the time.

The problem is not that waiting for one big slab of time is inaccurate. In fact, SF could most likely simply sleep for one big slab of time instead of for 5ms at a time without decreasing timing accuracy at all; the OS certainly is able to accurately keep track of milliseconds on a time scale of minutes.

The problem is that when the timing thread is supposed to wake up after the sleep() time has expired, it sometimes takes maybe up to 150-200ms before it actually wakes up. So you try to sleep for 5ms but wake up after 205ms. Or you try to sleep for one big slab of 100000ms but wake up after 100200ms. Either way, if the safety margin is less than 200ms, SF might lose on time.

Dann Corbit · Post by **Dann Corbit** » Fri Oct 30, 2015 1:55 am

syzygy wrote:
Dann Corbit wrote:Right, but you don't care that sleep is unreliable, because you stop calling sleep when you decide that the margin is too close.
And then what do you do?

Instead of one big slab of time, that you hope comes out accurate, you chop the time slices into smaller and smaller pieces and you can quit when you get close enough.
The idea is fine as it greatly reduces the number of sleep()s and therefore greatly reduces the number of times that the timer thread has to be woken up (so lower context switching overhead). But it does not help at all in getting the timing more reliable.

Whatever you do when the margin gets "too close" with your approach (e.g. simply stop the search), you can do with SF's current approach when the margin gets "too close". The whole problem is that now and then spikes occur which make "too close" not that close at all.

It won't make the time more accurate, but it will make it far less likely to use too much time, which is what you really care about.

syzygy · Post by **syzygy** » Fri Oct 30, 2015 1:57 am

Dann Corbit wrote:It won't make the time more accurate, but it will make it far less likely to use too much time, which is what you really care about.

It won't do that, so it won't do what you care about.

See what I added to my previous comment. It seems you are starting from an incorrect assumption and incorrect understanding of the problem ("one big slab of time"). The problem is the unpredictable but sometimes very large delay in waking up.

Dann Corbit · Post by **Dann Corbit** » Fri Oct 30, 2015 2:18 am

It seems that stockfish sleeps 5 ms between checks, so in a 5 minute search it performs 60,000 time checks.

I would perform log2(60,000) checks. There is considerable overhead to sleep and time stamp collection calls.

If Stockfish goes back to sleep because the time is not used up yet in its current search, it can go over by (max_oversleep - remaining_time).

IOW, 4 ms to go, so we sleep again, supposing max oversleep = 200 ms, we overslept in the worst case by 196 ms.

I would stop sleeping as soon as 1/2 remaining time is equal to the estimated max oversleep.
When the time left was 200 ms, I would not call sleep again[*].

[*]Caveats:
1) If there is base time left, and it is more than max oversleep, go ahead and call it again if you like.
2) If there is a fail low and no base time left, you could play chicken with the deadline by using 1/2 max oversleep if tests indicate that it is worthwhile to do that.

Milos · Post by **Milos** » Fri Oct 30, 2015 5:21 am

lucasart wrote:FInally, the problem is understood, in such a way that we can reproduce measure:
https://groups.google.com/forum/?fromgr ... PNocZQkW-4

We are relying on OS sleep function, which oversleeps on large machines with heavy system load. The oversleeping we managed to measure is beyond our imagination. If you ask the OS to sleep for 5ms, it can easily sleep for 500ms (on large machines under heavy system load)

The problem is not Windows specific, but instead specific to large machines (many cores):
1/ Elan produced some significant oversleeping on a 4 core Windows machine.
2/ Joost produced some even larger oversleeping on a 32 core Linux machine.
3/ On my 4 core Linux machine, no oversleeping at all.
=> your mileage may vary...

Technically there is no bug in SF, but in the OS/hardware. However, we need to code a workaround in SF to avoid this bad OS behaviour, as it is SF that gets penalized for it in the end...

PS: Well, that's at least ONE problem. Maybe once we fix it, we will discover another one hidden behind

Just read the whole discussion on node polling. Why the hell do you need to poll on all the threads???
Just poll on the main search thread (and use one specific node counting variable just on that thread), you'll have the same interval of polling on the same machine regardless on number of threads you are running, and no congestion. Checking every 4096 nodes, i.e. nodes_searched_by_main_thread>>12 is more than fine.

lucasart · Post by **lucasart** » Fri Oct 30, 2015 11:47 am

bob wrote:
Jesse Gersenson wrote:Lucas, I had a hunch (edited my message as you were responding, asking whether the timer function was at issue).

Ok, so it's the sleeper function. How do programs in the financial sector solve this problem?
They don't, they don't depend on ms or usec process scheduling delays.

Ever heard of high-frequency algorithmic trading ?
Just like computer chess, banks have evolved in the last 30 years.

syzygy · Post by **syzygy** » Sat Oct 31, 2015 12:00 am

Dann Corbit wrote:It seems that stockfish sleeps 5 ms between checks, so in a 5 minute search it performs 60,000 time checks.

I would perform log2(60,000) checks. There is considerable overhead to sleep and time stamp collection calls.

Sure, but I already said your idea had merit in so far as it addresses that overhead. (But you could as well do a single sleep for 5 minutes. That should work just as fine and cause still less overhead.)

The point at issue is an entirely different one.

If Stockfish goes back to sleep because the time is not used up yet in its current search, it can go over by (max_oversleep - remaining_time).

IOW, 4 ms to go, so we sleep again, supposing max oversleep = 200 ms, we overslept in the worst case by 196 ms.

I would stop sleeping as soon as 1/2 remaining time is equal to the estimated max oversleep.
When the time left was 200 ms, I would not call sleep again[*].

So you try to leave at least 200ms on the clock, so in the worst case the last sleep() consumes most of that 200ms but still lets you move in time.

Note that this can be done no matter whether you sleep once for 5 minutes, 60,000 times for 5ms, or log2(60,000) times.

New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?

Re: New Stockfish with Lazy_SMP, but what about the TC bug ?