I rewrote the main parts of LOOPs recursive search engine. Now LOOP is searching iterative in order to implement DTS. But the first tests have shown that the searchspeed decreased about 10% compared to the recursive version.
Both search engines are producing identical searchtrees, therefore I dont think to discover some bugs.
The following piece of code are the main parts of my first simple search without any heuristics or trans tables and so on...
How can I improve the pure search speed? This piece of code is definitely to slow compared to the recursive version
LoopList wrote:
I rewrote the main parts of LOOPs recursive search engine. Now LOOP is searching iterative in order to implement DTS. But the first tests have shown that the searchspeed decreased about 10% compared to the recursive version.
Both search engines are producing identical searchtrees, therefore I dont think to discover some bugs.
The following piece of code are the main parts of my first simple search without any heuristics or trans tables and so on...
How can I improve the pure search speed? This piece of code is definitely to slow compared to the recursive version
In the recursive implementation, a block of data for local variables are allocated on the stack, and all local variables for one ply are close together. Which means that when you reference one, you suck a lot of them into a cache line at the same time, which makes referencing those pre-fetched values extremely cheap. And the probability is that you will actually use those local values at about the same time.
In the iterative version, you most likely have separate arrays for each local variable, all indexed by a "ply" variable or something similar. This means that a reference to any single local variable value sucks in several words from that array, but they are each for a different ply and are not needed.
The way to correct this is to create a structure of your local data values, and then make an array of that structure type so that once again all local data for a single ply are close together in menory and will benefit from cache much better...
Along with the potential issues already suggested there are known bottlenecks to the 'giant switch statement' technique. What problems you face depends on what optimizations your compiler makes.
Does it pre-compute a lookup table?
Does it go through the switch statement or does optimize to a goto?
Would rewriting with conditionals get better branch prediction?
There is a lot of research available from online sources. Anton Ertl has a lot of papers on efficient VM interpreters.
I'm talking about the specific example code in the root post. It's a not exactly 'giant' switch but still a switch statement implementing a state machine.
Dan Andersson wrote:I'm talking about the specific example code in the root post. It's a not exactly 'giant' switch but still a switch statement implementing a state machine.
MvH Dan Andersson
OK. I didn't look at the code, it was too long to scroll through. I commented only on how I have seen "iterative searches" implemented in the past, which usually wrecks performance with respect to cache.
I dont really understand?! I created a structure of local data values which belongs together. I managed this in an array and reference to every new ply (stack level) via pointer. Whats wrong (cache unfriendly) with this construction?
This is OK, Bob simply did not read your code, so you should ignore his comments.
It could be that your code is slightly less efficient because every 'local' variable has to be accessed through the pointer. Even if the compiler keeps the latter in a register permanently, it still means you have one less register available for expression evaluation.
It could also be that you are just unlucky, and that the several independent stacks grow in such a way that at the critical depth thy collide in the cache, or collide wih important global data structures. This is especially a high risk in machines with a small number of cache ways, such as AMD CPUs. Better merge all your stacks into a single one, to avoid this, and make sure the system stack does not collide with the globals (e.g Zobrist table).