question about speedup when starting a test position

John Major · Post by **John Major** » Sun Dec 04, 2011 2:26 pm

The branch prediction also needs warm-up, besides caches and hash. This is a problem with processor simulation.

zullil · Post by **zullil** » Sun Dec 04, 2011 2:37 pm

kgburcham wrote:
Its weird how the cpu can be at 100% yet the thread is at idle.
So it seems the answer is that the demand is not there for 12 threads when the tree is small.
So it also seems since it takes several seconds for 12 threads to fully load then some kns is lost in fast games.
kgburcham

What setting do you use for minimum split depth? I seem to remember you choosing 14. If so, then it seems that no parallelization could even begin until the search reaches that iteration, if I understand correctly what that parameter means. As an experiment, try setting that parameter to something like 7.

kgburcham · Post by **kgburcham** » Sun Dec 04, 2011 2:54 pm

zullil wrote:
kgburcham wrote:
Its weird how the cpu can be at 100% yet the thread is at idle.
So it seems the answer is that the demand is not there for 12 threads when the tree is small.
So it also seems since it takes several seconds for 12 threads to fully load then some kns is lost in fast games.
kgburcham
What setting do you use for minimum split depth? I seem to remember you choosing 14. If so, then it seems that no parallelization could even begin until the search reaches that iteration, if I understand correctly what that parameter means. As an experiment, try setting that parameter to something like 7.

several 12 thread systems worked with Robert before the 1st release. One thing that was worked on was optimum minimum split depth setting using a benchtest Robert came up with. It was determined that 12 was best setting. Maybe for fast game this could be less, not sure. Usually I set this to 12 split depth. Thanks for the reply Louis but I am not having any issues with the setting, just curious why the speedup increases with time. Good info in some of these posts.
kgburcham

Houdini · Post by **Houdini** » Sun Dec 04, 2011 3:14 pm

kgburcham wrote:Its weird how the cpu can be at 100% yet the thread is at idle.

The "idle" threads are running a small loop, scanning continuously whether other threads have submitted a position to analyze. They are 100% busy doing nothing useful.

Robert

zullil · Post by **zullil** » Sun Dec 04, 2011 6:28 pm

kgburcham wrote: Usually I set this to 12 split depth. Thanks for the reply Louis but I am not having any issues with the setting, just curious why the speedup increases with time.
kgburcham

I'm trying to suggest that the setting itself may partly explain the speedup. No splitting would be occurring at all before the depth 12 iteration, and then the amount of splitting would grow with each subsequent iteration (or so I think; perhaps some expert could explain this correctly).

bob · Post by **bob** » Sun Dec 04, 2011 7:45 pm

kgburcham wrote:when analyzing a test position why do the kns speedup with time?
what is going on in the program to cause this?

There are dozens of issues.

(1) cache fills up over time.

(2) hash table starts off useless, yet provides critical data (Crafty tries hash move before generating moves, if that produces a cutoff it is much faster than generating moves and doing a search).

(3) hash table has to be "faulted in" for the first access to each page. That can take time until every page has been touched at least once.

(4) parallel search works better as the search goes deeper, which makes the NPS climb.

Houdini · Post by **Houdini** » Sun Dec 04, 2011 8:08 pm

Houdini wrote:The main reason is that the alpha-beta algorithm is in essence a serial algorithm, which our SMP implementations try to transform into a parallel operation. At low search depths not all threads have something useful to do. The CPU is at 100% but the threads are actually idling and waiting for other threads to submit positions to be analyzed. The more threads you have, the more pronounced the effect. With 2 threads the full speed is nearly instantly there, with 8 threads you need to wait several seconds before most threads actually do something useful.

The Houdini 2.0 autotune command shows you the number of "idle" loops that threads have executed, waiting for something useful to do.

By the way, you will see a direct relation between the amount of "idle" and the measured node speed.
If one makes a small table of the above-mentioned results for each elapsed second:

Code: Select all

 Time    Nodes    Idle
  msec 	kN/s  	 M
=======================
 1000	  4898	  267
 1999	  6324	  100
 2999	  7088	   37
 3998	  7277	   15
 4998	  6940	   36
 5997	  7310	   12
 7000	  7184	   35
 8003	  7245	   50
 9004	  7254	   38
10003	  6892	   75
15005	  7491	   19
20010	  7699	    0
25013	  7992	    0
29014	  7918	    0
=======================

For an "Idle" count of 100 Million the node speed reduction is about 1,000 kn/s.
A strong indication that this purely algorithmic effect is dominant for the node speed reduction.

Robert

question about speedup when starting a test position

Re: question about speedup when starting a test position

Re: question about speedup when starting a test position

Re: question about speedup when starting a test position

Re: question about speedup when starting a test position

Re: question about speedup when starting a test position

Re: question about speedup when starting a test position

Re: question about speedup when starting a test position