question about speedup when starting a test position

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

kgburcham
Posts: 2016
Joined: Sun Feb 17, 2008 4:19 pm

question about speedup when starting a test position

Post by kgburcham »

when analyzing a test position why do the kns speedup with time?
what is going on in the program to cause this?
MikeGL
Posts: 1010
Joined: Thu Sep 01, 2011 2:49 pm

Re: question about speedup when starting a test position

Post by MikeGL »

I noticed this too.

During engine startup the kns is slow and then it speeds up a little as the
time increases. This is probably due to CPU scheduling of OS and maybe
prefetching.

I think it's similar to opening MSWord, the first time I open it, the loading of
application will be like 6 to 10 seconds, but if I close it and open
MSWord again, it opens in one 1 second.

So I guess it has nothing to do with engines, but something to do with OS and latency in hardware access.
kgburcham
Posts: 2016
Joined: Sun Feb 17, 2008 4:19 pm

Re: question about speedup when starting a test position

Post by kgburcham »

It is kinda weird how the task manager goes to 100% on each thread as soon as I start analysis.
Cubeman
Posts: 644
Joined: Fri Feb 02, 2007 3:11 am
Location: New Zealand

Re: question about speedup when starting a test position

Post by Cubeman »

Maybe in order to first search a "tree" maybe it needs time to create the tree in the 1st place.
User avatar
Steve Maughan
Posts: 1221
Joined: Wed Mar 08, 2006 8:28 pm
Location: Florida, USA

Re: question about speedup when starting a test position

Post by Steve Maughan »

A couple of things could be happening:

1. The cache is empty to start with. So *every* memory access is slow. As the search progresses the cache fills up with useful information which can be accessed quicker - thus increasing the nps.

2. The engine may do some pre-search analysis. As an extreme example suppose the first 1 second was used to create piece-value tables etc. The nps for the first second would be zero. Then suppose the engine searched 1 million nodes for the second second onwards. The average nps would be 500k nps for the second second. For the third second it would be 666k nps. After a "long time" the nps would reach close to 1 million nps.

I hope this helps,

Steve
kgburcham
Posts: 2016
Joined: Sun Feb 17, 2008 4:19 pm

Re: question about speedup when starting a test position

Post by kgburcham »

Cubeman wrote:Maybe in order to first search a "tree" maybe it needs time to create the tree in the 1st place.
the creation of the tree is the result of the search.
then the tree grows as the search continues.
kgburcham
Terry McCracken
Posts: 16465
Joined: Wed Aug 01, 2007 4:16 am
Location: Canada

Re: question about speedup when starting a test position

Post by Terry McCracken »

kgburcham wrote:when analyzing a test position why do the kns speedup with time?
what is going on in the program to cause this?
I had it happen as well. With Houdini I've reach 10Mnps on my Kentsfield 65nm Q6600 Core2 Quad 2.4Ghz and 20Mnps on my Sandy Bridge 32nm 2600 2nd gen i7 3.4Ghz, with Turbo Boost 2.0, 3.5Ghz across all four cores and with HT enabled. Fairly fast.
Terry McCracken
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: question about speedup when starting a test position

Post by Houdini »

kgburcham wrote:when analyzing a test position why do the kns speedup with time?
what is going on in the program to cause this?
I suppose you're talking about the speed-up when running multiple threads.

The main reason is that the alpha-beta algorithm is in essence a serial algorithm, which our SMP implementations try to transform into a parallel operation. At low search depths not all threads have something useful to do. The CPU is at 100% but the threads are actually idling and waiting for other threads to submit positions to be analyzed. The more threads you have, the more pronounced the effect. With 2 threads the full speed is nearly instantly there, with 8 threads you need to wait several seconds before most threads actually do something useful.

The Houdini 2.0 autotune command shows you the number of "idle" loops that threads have executed, waiting for something useful to do.
For example, see an autotune result obtained on a Core i5-750 running 4 threads:

Code: Select all

info time 1000 nodes 4898859 nps 4898000 tbhits 0 cpuload 919 idle 267M
info time 1999 nodes 6324700 nps 5614000 tbhits 0 cpuload 972 idle 367M
info time 2999 nodes 7088952 nps 6106000 tbhits 0 cpuload 994 idle 404M
info time 3998 nodes 7277099 nps 6400000 tbhits 0 cpuload 956 idle 419M
info time 4998 nodes 6940856 nps 6508000 tbhits 0 cpuload 971 idle 455M
info time 5997 nodes 7310617 nps 6643000 tbhits 0 cpuload 979 idle 467M
info time 7000 nodes 7184213 nps 6717000 tbhits 0 cpuload 964 idle 502M
info time 8003 nodes 7245870 nps 6781000 tbhits 0 cpuload 987 idle 552M
info time 9004 nodes 7254087 nps 6833000 tbhits 0 cpuload 974 idle 590M
info time 10003 nodes 6892642 nps 6839000 tbhits 0 cpuload 960 idle 665M
info time 11004 nodes 7555546 nps 6904000 tbhits 0 cpuload 973 idle 672M
info time 12004 nodes 6982834 nps 6910000 tbhits 0 cpuload 971 idle 702M
info time 13003 nodes 7731080 nps 6974000 tbhits 0 cpuload 987 idle 704M
info time 14002 nodes 7761548 nps 7031000 tbhits 0 cpuload 963 idle 707M
info time 15005 nodes 7491024 nps 7060000 tbhits 0 cpuload 976 idle 726M
info time 16005 nodes 7480882 nps 7086000 tbhits 0 cpuload 927 idle 726M
info time 17006 nodes 7573811 nps 7114000 tbhits 0 cpuload 969 idle 741M
info time 18009 nodes 7475219 nps 7133000 tbhits 0 cpuload 964 idle 781M
info time 19008 nodes 7916744 nps 7175000 tbhits 0 cpuload 960 idle 782M
info time 20010 nodes 7699471 nps 7200000 tbhits 0 cpuload 956 idle 782M
info time 21012 nodes 7753687 nps 7226000 tbhits 0 cpuload 941 idle 785M
info time 22012 nodes 7883482 nps 7256000 tbhits 0 cpuload 995 idle 788M
info time 23014 nodes 7999014 nps 7287000 tbhits 0 cpuload 969 idle 788M
info time 24014 nodes 7901859 nps 7313000 tbhits 0 cpuload 943 idle 788M
info time 25013 nodes 7992600 nps 7340000 tbhits 0 cpuload 960 idle 788M
info time 26012 nodes 7929145 nps 7363000 tbhits 0 cpuload 972 idle 788M
info time 27011 nodes 8034412 nps 7388000 tbhits 0 cpuload 980 idle 788M
info time 28012 nodes 7816414 nps 7403000 tbhits 0 cpuload 926 idle 788M
info time 29014 nodes 7918916 nps 7421000 tbhits 0 cpuload 965 idle 788M
Each line shows the total time, the number of nodes analyzed in the last second, the average node speed, tablebase hits, cpuload, and the idle count.
The "idle" value at the end of each line is the accumulated idle loop count for all 4 threads. It increases very rapidly in the first few seconds, and remains more or less constant after that. This means that after a few seconds all 4 threads are running mostly useful analysis, and the engine is at full speed (about 7,500 to 8,000 kN/s).
But in the first second of the analysis only 4,900 kNodes have been searched, and there have been 267 million idle loops.

Robert
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: question about speedup when starting a test position

Post by hgm »

This is often a consequece of the program having to 'conquer' physical memory for its hash table. As soon as you write anything in a 4KB page, the OS has to bring that page in memory (possibly freeing some memory first, by copying dta of other programs to disk). All subsequents hash hits to that page then fint it already in memory. After some time every page of the hash table has been accessed, and the slowdown disappears.
kgburcham
Posts: 2016
Joined: Sun Feb 17, 2008 4:19 pm

Re: question about speedup when starting a test position

Post by kgburcham »

The main reason is that the alpha-beta algorithm is in essence a serial algorithm, which our SMP implementations try to transform into a parallel operation. At low search depths not all threads have something useful to do. The CPU is at 100% but the threads are actually idling and waiting for other threads to submit positions to be analyzed. The more threads you have, the more pronounced the effect. With 2 threads the full speed is nearly instantly there, with 8 threads you need to wait several seconds before most threads actually do something useful.
Its weird how the cpu can be at 100% yet the thread is at idle.
So it seems the answer is that the demand is not there for 12 threads when the tree is small.
So it also seems since it takes several seconds for 12 threads to fully load then some kns is lost in fast games.
kgburcham