Don wrote:bob wrote:
I don't "stash away" any undo information, with the only exception being the hash signature, and the 50-move counter.
The pentium-pro was not a bad machine. I had one with 4-way memory interleaving (a big server box). Getting rid of copy/make was a win on that machine as well. I suspect it is the same today as memory speeds have not changed much, yet processor speeds are way beyond that original 200mhz pentium pro. Much faster to make/unmake data in cache, rather than waiting hundreds (or thousands) of cycles for something to poke in from memory today.
But I don't think that is happening. If it were as bad as you are saying my program would be crawling because I have hash tables.
SImple first approximation to the answer: What is your NPS compared to Crafty's on the same hardware? If you want, I can run your code on our 8-core box, you you can download Crafty's source and run it on your box. Crafty is not "tricky". It spends fully 50% of the total execution time in Evaluate() and its derivatives. So it is not a particularly "fast and simple" program at all. But it is a "fast program".
Due to the cache issue I tried not storing hash tables in the quies search and it had no impact on my nodes per second. This is random access all over several megabyes of data, reads and writes.
Interesting, because Bruce Moreland, myself, and a couple of others ran this specific test several years ago. I found that hashing in the q-search slowed me by 10% in terms of NPS. And sped me up by about 10% due to reduced tree size. I elected to leave it out of the qsearch since it introduces less stress on the hash table replacement policies and lets me get by with smaller tables than if I hash at least 4x more data thanks to the q-search stores.
And yet you think a small stack of 30 or so commonly accessed position states is just trashing the whole cache of the machine. I really don't think you are thinking very clearly about this, because you seem to picking and choosing your assumptions. You assume that the extra logic has no appreciable cost but that copying 100 bytes is something modern processors just cannot handle.
I have no assumptions. Only concrete measurements. And not just from my program, but from a couple of others (including Ferret from Bruce).
re-using instructions at every ply in the search is very cache friedly. Using a different "state" at every ply is much less friendly and absolutely does increase cache fill rates. How much on today's machines I do not yet know, but I am going to find out.
You made the arugment that the problem is so bad they have to have level 1, level 2, level 3 cache and so on. So I'll make the argument that this caching technology actually works and you can actually expect 6K of position state to be in the cache.
In _some_ cache. Which do you prefer however, to access in 4 clock cycles, or 20 clock cycles or 100 clock cycles or even (worse case) a couple of thousand clock cycles? L1 is not big.