lucasart wrote:I think SF should go with the master branch, for the final. Lazy SMP is stronger than master, but in 3h games the difference should be small:
* We do not have statistically reliable data to know the elo gain at 3h games, and we never will (because statistically reliable means tens of thousand of games, which we cannot do at this tc).
If the Lazy SMP version scales well, the Elo difference on many cores should be large enough that a reasonable number of games would show it.
It is not clear what "scales" means in this context. I have seen references to BOTH higher NPS, AND longer time-to-depth. If I had time, I'd run a few tests on it to see what it does on my 20 or 24 core boxes... But that's time away from working on my code, which is not exactly a good use of time.
scales well mean performing significantly better at longer time control.
In this context, that is not what "scales" means.
What I mean is getting a good benefit in terms of Elo out of a higher number of cores.
Suppose a particular engine gains nothing from 20 cores compared to 4 cores. Then an improvement of that engine that makes it "scale well" to 20 cores should show a clear gain in strength even after a relatively low number of games.
Note that I did not say that SF does/did not gain anything from going from 4 to 20 cores. I am simply giving an example. But it seems to be accepted that the YBWC version does not scale particularly well to 20 cores. (I personally find it an annoying thought that tree-splitting approaches might be inferior to lazy smp on large number of cores, but so be it, at least for now.)
If there is an advantage of at least 30 elo not at blitz then it is possible to show elo advantage by a test of some hundrends of games and you do not need thousands of games.
Yes, that is my point. If lazy smp is really so much better on 20 cores it should not have any problem showing its superiority. This should not be a matter of 1 Elo.
mcostalba wrote:The only time related difference between old version and lazy one is the way engine stops and waits for the slaves threads to terminate the search before to return the best move.
So considering the above, my take is that stopping the threads in lazy smp requires more time on the particular TCEC hardware.
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
Since so far all attempts to reproduce the problem have failed, there seems to be a distinct possibility that the old version (with the same lag parameter setting) might have the same very rarely occurring problem, and that it is just a matter of coincidence that the problem has now occurred twice within a couple of games. It could be a case of winning the lottery... (or rather, getting struck by lightning).
mcostalba wrote:The only time related difference between old version and lazy one is the way engine stops and waits for the slaves threads to terminate the search before to return the best move.
So considering the above, my take is that stopping the threads in lazy smp requires more time on the particular TCEC hardware.
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
Since so far all attempts to reproduce the problem have failed, there seems to be a distinct possibility that the old version (with the same lag parameter setting) might have the same very rarely occurring problem, and that it is just a matter of coincidence that the problem has now occurred twice within a couple of games. It could be a case of winning the lottery... (or rather, getting struck by lightning).
Since the slave threads are each accessing tbs (the time forfeits occurred in the end game), they have to wait until the tbs are released. So I wonder if somehow there is a difference in tb access and release time between the 2 different smp versions. Just throwing it out there as a remote possibility.
mcostalba wrote:The only time related difference between old version and lazy one is the way engine stops and waits for the slaves threads to terminate the search before to return the best move.
So considering the above, my take is that stopping the threads in lazy smp requires more time on the particular TCEC hardware.
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
Since so far all attempts to reproduce the problem have failed, there seems to be a distinct possibility that the old version (with the same lag parameter setting) might have the same very rarely occurring problem, and that it is just a matter of coincidence that the problem has now occurred twice within a couple of games. It could be a case of winning the lottery... (or rather, getting struck by lightning).
Since the slave threads are each accessing tbs (the time forfeits occurred in the end game), they have to wait until the tbs are released. So I wonder if somehow there is a difference in tb access and release time between the 2 different smp versions. Just throwing it out there as a remote possibility.
Not knowing how they handle this "locking" that can certainly be a problem. If they use large memory, without 1gb pages, there's a lot of memory accesses needed to map a virtual address to a physical page. Then the caches get into a giant snit doing all the forwarding/invalidating stuff needed to access a single cache-block sized chunk of memory. Sort of hard imaging that turns into many milliseconds, but I suppose it could. And as cores climbs, that will only get worse.
syzygy wrote:
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
In YBW the threads are started and stopped at split points (and split points are created always inside search) so that reamining time evaluation at root is done _after_ threads are stopped. Instead in lazy smp remaining time evaluation is done _before_ threads are stopped and stopping time is not accounted in.
syzygy wrote:
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
In YBW the threads are started and stopped at split points (and split points are created always inside search) so that reamining time evaluation at root is done _after_ threads are stopped. Instead in lazy smp remaining time evaluation is done _before_ threads are stopped and stopping time is not accounted in.
Anyhow on my QUAD threads stop in less than 1 msec (tested with 4, 8, 16 and 40 threads) so how TCEC machine is slow in stoppin threads is really a puzzle.
mcostalba wrote:
Louis tried hard to reproduce the time loss on his big hardware machine, but he failed even under the extreme conditions he threw to SF.
Perhaps I didn't try hard enough. But there's no way I'm going to install Windows.
mcostalba wrote:The only time related difference between old version and lazy one is the way engine stops and waits for the slaves threads to terminate the search before to return the best move.
So considering the above, my take is that stopping the threads in lazy smp requires more time on the particular TCEC hardware.
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
Since so far all attempts to reproduce the problem have failed, there seems to be a distinct possibility that the old version (with the same lag parameter setting) might have the same very rarely occurring problem, and that it is just a matter of coincidence that the problem has now occurred twice within a couple of games. It could be a case of winning the lottery... (or rather, getting struck by lightning).
Since the slave threads are each accessing tbs (the time forfeits occurred in the end game), they have to wait until the tbs are released. So I wonder if somehow there is a difference in tb access and release time between the 2 different smp versions. Just throwing it out there as a remote possibility.
I'm afraid you don't seem to know what you're talking about.
mcostalba wrote:
Louis tried hard to reproduce the time loss on his big hardware machine, but he failed even under the extreme conditions he threw to SF.
Perhaps I didn't try hard enough. But there's no way I'm going to install Windows.
LOL
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
syzygy wrote:
Without necessarily going into the technical details, do you have any idea why stopping threads in lazy smp takes longer than in the old implementation? I mean, does it make sense at all?
In YBW the threads are started and stopped at split points (and split points are created always inside search) so that reamining time evaluation at root is done _after_ threads are stopped. Instead in lazy smp remaining time evaluation is done _before_ threads are stopped and stopping time is not accounted in.
But this (checking remaining time at root) refers to the case where SF manages to complete an iteration, if I understand right.
I suppose that in the games lost on time, SF did not manage to finish the iteration (maybe I am wrong?). If the iteration takes longer than expected and time runs out, I suppose SF somehow detects this and interrupts the iteration (I have not looked into the code yet). When this happens, it still needs to stop the search threads (both in YBW and in lazy smp).