What about QNX ? That's for sure an RTOS. The BlackBerry smartphones use modified versions of the QNX operating system !syzygy wrote:
Windows and Linux are indeed not real-time operating systems, so they should not be used to run a nuclear plant.
New Stockfish with Lazy_SMP, but what about the TC bug ?
Moderator: Ras
-
Sylwy
- Posts: 5181
- Joined: Fri Apr 21, 2006 4:19 pm
- Location: IAȘI - the historical capital of MOLDOVA
- Full name: Silvian Rucsandescu
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
You are thinking about this WRONG. It is not a problem of measuring time to the nearest pico-picosecond. It is about process scheduling once a process blocks or is preempted by the O/S. Getting control BACK within a few msecs or usecs is the problem, and that isn't a matter of how accurately the engine measures time, it is a function of the operating system process scheduler.Jesse Gersenson wrote:disclaimer: I have no idea what I'm talking about...Dann Corbit wrote:Seems likely that there is not a problem then, if the value 50 is changed to something large enough to prevent loss.
Windows has a real time timer accurate to 1 microsecond for Windows 8 or Windows Server 2012 and above. I imagine that timer could be used to prevent time loss (likely the machine in question has a modern operating system).
There is almost certainly a way to collect the same thing using Linux, I imagine. It's just querying the system hardware.
Can engines use GPS clocks? Would a very accurate clock solve whatever you're trying to solve, or is it more complicated than that?
Some quick reading suggests:http://linux.die.net/man/3/clock_gettimeCode: Select all
#include <time.h>OrCLOCK_REALTIME
System-wide realtime clock. Setting this clock requires appropriate privileges.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since some unspecified starting point.
http://www.cplusplus.com/reference/ctim ... S_PER_SEC/
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
They don't, they don't depend on ms or usec process scheduling delays.Jesse Gersenson wrote:Lucas, I had a hunch (edited my message as you were responding, asking whether the timer function was at issue).
Ok, so it's the sleeper function. How do programs in the financial sector solve this problem?
-
mar
- Posts: 2687
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
Interesting, so the problem is Windows specific?lucasart wrote:Here's the offending code:
https://github.com/official-stockfish/S ... d.cpp#L109
Maybe you could try to use timeBeginPeriod(1) and timeEndPeriod(1) at the beginning and end of the program.
It seems to have no impact here (Win8.1) but it might help on other OS versions.
When I measure sleep(1) here (i.e. 1 msec) I get the average of 1.5 msec/sleep call so I don't see the default 16/20 msec granularity.
-
Dann Corbit
- Posts: 12870
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
I think that the problem boils down to the fact that neither Windows nor Linux are real time operating systems.
So when you call Sleep(N) or usleep(n) or whatever, it might be a lot longer than what you asked for.
A solution would be to use a routine which performs a high resolution timer call sequence that does a binary search and fires the event when you get within your own specification.
So when you call Sleep(N) or usleep(n) or whatever, it might be a lot longer than what you asked for.
A solution would be to use a routine which performs a high resolution timer call sequence that does a binary search and fires the event when you get within your own specification.
-
kbhearn
- Posts: 411
- Joined: Thu Dec 30, 2010 4:48 am
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
It is NOT a bug in the operating systems. It's a very desirable property of the thread scheduler that when you have threads that have work to do they get given as large a timeslice as possible minimising the amount of time lost to preempting them frequently (and possibly landing them on a different core with the wrong cache when they resume). As such when you're heavily using every available logical core on the system it should be expected that the N+1th thread might have trouble waking up when you want it to. Sleep functions were never intended to be used as timers and should be read as 'sleep at least this long'. Operating systems provide seperate facilities for timers. The solution of retiring the timer thread is the right one.lucasart wrote:FInally, the problem is understood, in such a way that we can reproduce measure:
https://groups.google.com/forum/?fromgr ... PNocZQkW-4
We are relying on OS sleep function, which oversleeps on large machines with heavy system load. The oversleeping we managed to measure is beyond our imagination. If you ask the OS to sleep for 5ms, it can easily sleep for 500ms (on large machines under heavy system load) :shock:
The problem is not Windows specific, but instead specific to large machines (many cores):
1/ Elan produced some significant oversleeping on a 4 core Windows machine.
2/ Joost produced some even larger oversleeping on a 32 core Linux machine.
3/ On my 4 core Linux machine, no oversleeping at all.
=> your mileage may vary...
Technically there is no bug in SF, but in the OS/hardware. However, we need to code a workaround in SF to avoid this bad OS behaviour, as it is SF that gets penalized for it in the end...
PS: Well, that's at least ONE problem. Maybe once we fix it, we will discover another one hidden behind :lol:
-
Waschbaer
- Posts: 68
- Joined: Mon Dec 12, 2011 11:27 pm
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
Dann : +1
on a hot swapping system (unix), not paging, it can take hours.
In the past customer are telling us the system hangs, but it doesn't, it was "only" a very lousy response time
on a hot swapping system (unix), not paging, it can take hours.
In the past customer are telling us the system hangs, but it doesn't, it was "only" a very lousy response time
-
syzygy
- Posts: 5975
- Joined: Tue Feb 28, 2012 11:56 pm
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
???Dann Corbit wrote:A solution would be to use a routine which performs a high resolution timer call sequence that does a binary search and fires the event when you get within your own specification.
A binary search into time? Back and forth? I am sorry, but do you have anything concrete in mind or are these just words that look nice?
The issue is pretty simple. The engine may search for a particular maximum amount of time. By setting the maximum to a value lower than what is left on the clock, you are sure to move in time PROVIDED that you can reliably move before that maximum amount of time is over. So you need a way to let the program set an internal abort flag once that maximum amount of time runs out. The search threads check that flag and return when it is set.
There are essentially two ways to check the time: let the engine regularly poll the clock, or let the OS somehow signal the process when time runs out.
On an OS like Windows and Linux the polling approach may fail once in a while, because threads and processes are not always running. If the engine is not running for a particular period of time, there is no way it can notice during that period of time that time has run out.
For similar reasons the signalling approach may fail. The OS would probably notice that time has run out, but it then must still wake up the engine. If for some reason it does not give priority to the engine and first runs other processes, a too small safety margin might prove insufficient.
However, given that other engines don't suffer from the same problem (and probably don't all use a safety margin that is much larger than SF's), it seems that SF's timing method is particularly prone to problems.
I might be wrong, but I am guessing that previous TCEC machines had HT enabled. With HT enabled, the timing thread does not have to wait for a (logical) core to become available and most likely wakes up much more reliably. With HT disabled, all cores are typically busy searching and the timing thread has to wait until it gets scheduled by the OS scheduler.
My private engine sets up a signal handler that is invoked by Linux when time runs out. That signal handler sets an abort flag that is checked by all search threads. Stockfish's sleep() approach seems a bit clumsy to me, but I would not immediately know a clean way to do it that also works on Windows. Probably the easiest way is to check the time after every 10k nodes or so by any search thread.
-
syzygy
- Posts: 5975
- Joined: Tue Feb 28, 2012 11:56 pm
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
So now you know which phone to buy to control your private nuclear plant withSylwy wrote:What about QNX ? That's for sure an RTOS. The BlackBerry smartphones use modified versions of the QNX operating system !syzygy wrote:
Windows and Linux are indeed not real-time operating systems, so they should not be used to run a nuclear plant.
http://www.itbusiness.ca/news/nuclear-p ... me-os/9084
http://www.qnx.com/company/
http://www.financialpost.com/news/Guess ... story.htmlOver the past 35 years, QNX software has become a big part of everyday life. People encounter QNX-controlled systems whenever they drive, shop, watch TV, use the Internet, or even turn on a light. Its ultra-reliable nature means QNX software is the preferred choice for life-critical systems such as air traffic control systems, surgical equipment, and nuclear power plants.
-
Dann Corbit
- Posts: 12870
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: New Stockfish with Lazy_SMP, but what about the TC bug ?
Let's suppose that you have 10 seconds to think at the start of your time slot.
Sleep for 5 seconds and check again.
Let's suppose you have 4.9 seconds left.
Sleep for 2.45 seconds and check again.
As soon as your time left to think is less than your error margin, stop thinking and fire the callback.
Sleep for 5 seconds and check again.
Let's suppose you have 4.9 seconds left.
Sleep for 2.45 seconds and check again.
As soon as your time left to think is less than your error margin, stop thinking and fire the callback.