re-inventing the SMP wheel

hgm · Post by **hgm** » Thu Aug 23, 2007 5:46 pm

To get back on topic:

It still bothers me that at split points there would be multiple threads each generatinfg moves for the same position. So I wonder if there would be some gain in having only one thread, the one that 'owns' the node and has opened the hash entry for it, to generate moves, and pass those to the threads entering the node later as helpers. This way only one thread would be resposible for maintaining the window and updating the score. Threads that help it would be handed a move, search it, but do nothing with the score other than writing it in the hash-table entry for the daughter, where the owner of the current node would sooner or later stumble on it to integrate it in the score of this current node.

This way the owner of the node could exercise some control over the amount of parallelism allowed in the node. If it is working on a crucial move, like the hash move, or a re-search, (which I will refer to as 'alpha moves' as they are expected to raise alpha), it would hand all helpers that offer themselves that same move. If it is working on late moves (with null-window searches) it could hand them all kind of different moves, on which it is not yet working.

A single move could be advertized to the outside world in the Move field of the hash entry of open nodes. This hash entry would be read anyway by the helper threads (on which they discover that the node is open, and they are thus helpers), so we might hide as informative a message for them in it as possible.

To export more moves to helper threads, we could requisition the 'always-replace' entry in the same hash bucket. If the bucket contained 6 entries of 10 bytes, plus 4 spare bytes, we could store 7 moves in the spare bytes plus always-replace entry. The owner of a node could fill those with the next 7 moves of its move list as soon as the node becomes eligible for parallelism (e.g. after its first move is searched). Other threads hitting the node could randomly pick one of those moves for search. When it is done, it reads the entry again (the owner might have changed it, and put another list of moves there), and pick another. This could go on until the owner has no more work for the helper. The helper could then return (without a score) to a lower level, to do something useful there, before being back to fetch the score.

bob · Post by **bob** » Thu Aug 23, 2007 8:10 pm

hgm wrote:
bob wrote:The math looks tough to me. 16K lines mapping into a set that only holds 16 line, is a 1024 to 1 mapping.
Not really. Because 1024 is a large number, randomly assigning the pages becomes the same as randomly deciding per page if it is used or not, with a certain probability. This leads to Poisson statistics for the number of used pages in each set of 1024.
I would claim that _most_ hash probes are cache misses, ...
No disagreement there, but this is not the issue, and it is not relevant. The issue is how long the probed data will stay in L2, and if it is still around when the hash store will want to update it.
I doubt "real programs" are going to fit in L1. No doubt an intentionally small program can, but that won't predict anything about how the real programs are going to perform, which is where we started this discussion...
I cannot comment on that without knowing your definition of "real programs". Is that programs with a rating above 2500? Above 2700? Is Rybka the only "real program"?

My definition of a "real program" is at the very least, a program that has no chance of fitting its "kernel" into L1. And, in fact, most "real programs" tend to really stress L2 significantly, until we get to at least 4M and beyond. Threaded programs in particular are pretty violent in their cache usage because of the significant amount of replicated data structures and the number of them that are created. For example, in Crafty, on 8-way boxes, I have seen > 200 active split points, which just blows any L2 available out of the water.

That is again assuming too much. I probe in the basic search, I don't probe in the q-search. So between probes in the basic search, there is a _lot_ that goes on. From move generation, recursive q-search calls, evaluation, all of which dislodge stuff from L1 and L2 frequently. LRU is not going to keep that old hash entry around long with all the other stuff flowing in and out.
The beauty of it is that it does not assume anything!. The point is that to flush something (that can be anywere) out of a cache with 64K sets you will need about 64K replacements.

There's where we disagree. You are talking "average". I am talking about the bad (not worst, but just bad) cases. It then is not an issue of how many sets, but how many lines per set (ways) since in bad cases, you get lots of aliasing and replacements in a single set. If you have 4K sets (core 2) that gives you 4K opportunities to pile up on one set. The probability of overloading any one set is (I would agree) low. But with large programs, you have 4K chances to overload any one set, which moves the probability right on up into pretty common.

It doesn't matter what these replacements are (other hash probes, table reads, code fetches), and over how many nodes they are distributed, and if they are evenly distributed over those nodes.

After search of a tree large enough to cause 64K cache misses the probed hash entry is likely to be flushed out of cache.
For smaller trees, it is likely to be still in.
So only trees that cause more than 64K cache misses will flush it out, which then requires one extra memory access to get it back in. This cannot drive up the number of memory accesses by more than one part in 64K = 0.0016%.

Same problem with your math there as before, for me. 64K cache misses, running at 5 billion instructions per second, is not very significant. I suppose I could run Crafty for 3 minutes and report the number of L1 and L2 cache misses, it is not terribly difficult to add the inline asm to get that data. However, I really have a very accurate idea, because when we were working with AMD a couple of years back on an 8-way scaling problem, they helped me obtain this data to an incredible degree of accuracy with some internal tools they have. In Crafty, cache misses are _not_ the exception.

Do the measurement!

I'll see if I can find time to do that, in particular I am going to modify my program to probe twice, once at the start of search, once at the end, can capture the number of misses on the second probe.

More data when I get it done.

bob · Post by **bob** » Thu Aug 23, 2007 8:23 pm

hgm wrote:To get back on topic:

It still bothers me that at split points there would be multiple threads each generatinfg moves for the same position. So I wonder if there would be some gain in having only one thread, the one that 'owns' the node and has opened the hash entry for it, to generate moves, and pass those to the threads entering the node later as helpers. This way only one thread would be resposible for maintaining the window and updating the score. Threads that help it would be handed a move, search it, but do nothing with the score other than writing it in the hash-table entry for the daughter, where the owner of the current node would sooner or later stumble on it to integrate it in the score of this current node.

That is the usual way of doing this. You end up with a shared move list at the split point, then processors synchronize via a critical section to extract and search the moves one at a time to make sure no processor searches the same move a second time. That was what PVS, DTS and YBW are all about.

The first rule is to not split unless you are pretty sure you are not going to fail high. YBW does that by not splitting until the first branch is completed and value < beta. You can modify this (strengthen it) by not splitting until the first N moves have been searched. Hurts parallel utilization of course as you have procesors idle until that condition has been satisfied.

DTS uses a similar approach, except that it doesn't just look current node, it looks at all active nodes to find if there is a high-probability-non-CUT node to split at. So all parallel approaches (including the distributed NAG someone mentioned) are based on that idea, it is the implementation of that idea that has some level of freedom in how it is actually implemented.

But now you are on the right path. Depending on the hash to control this is simply not viable in an SMP environment. Remember that as you store crap over here, someone has to read it over there, and that results in a ton of cache traffic since all of cache is not shared, except perhaps for a single chip like the core-2's...

This way the owner of the node could exercise some control over the amount of parallelism allowed in the node. If it is working on a crucial move, like the hash move, or a re-search, (which I will refer to as 'alpha moves' as they are expected to raise alpha), it would hand all helpers that offer themselves that same move. If it is working on late moves (with null-window searches) it could hand them all kind of different moves, on which it is not yet working.

Handing them the "same move" is simply not going to work when you have more than a couple of processors. The replicated searches climb quickly and search overhead (extra nodes searched that the single-thread search does not traverse) increases so quickly that the size of the tree begins to increase proportionally to the number of CPUs being added. No point in using 4 processors to search a tree that becomes 4 times bigger than normal, there is no speedup.

A single move could be advertized to the outside world in the Move field of the hash entry of open nodes. This hash entry would be read anyway by the helper threads (on which they discover that the node is open, and they are thus helpers), so we might hide as informative a message for them in it as possible.

To export more moves to helper threads, we could requisition the 'always-replace' entry in the same hash bucket. If the bucket contained 6 entries of 10 bytes, plus 4 spare bytes, we could store 7 moves in the spare bytes plus always-replace entry. The owner of a node could fill those with the next 7 moves of its move list as soon as the node becomes eligible for parallelism (e.g. after its first move is searched). Other threads hitting the node could randomly pick one of those moves for search. When it is done, it reads the entry again (the owner might have changed it, and put another list of moves there), and pick another. This could go on until the owner has no more work for the helper. The helper could then return (without a score) to a lower level, to do something useful there, before being back to fetch the score.

hgm · Post by **hgm** » Thu Aug 23, 2007 9:39 pm

bob wrote:I doubt "real programs" are going to fit in L1. No doubt an intentionally small program can, but that won't predict anything about how the real programs are going to perform, which is where we started this discussion...

bob wrote: My definition of a "real program" is at the very least, a program that has no chance of fitting its "kernel" into L1. And, in fact, most "real programs" tend to really stress L2 significantly, until we get to at least 4M and beyond.

Well, with this definition, there is no reason to doubt at all... But if "real programs" can't keep up in performance with "non-real programs", who needs them?

Nevertheless, it still seems childishly easy to me to design an OS that would not suffer from the "unpredictable aliasing problem". Simply allocate memory in blocks of the cache-way size. (So on a Core 2 Duo 256KB, aligned on 256KB boundaries.) You don't want to swap single 4KB pages anyway, the seek time and rotational latency would be way to expensive, and you can read/write clusters of 256KB to disk in almost the same time. Even a substandard computer with 512MB could run 2048 tasks without swapping, which seems more than enough. No reason to go so fine-grained as 4KB, just because the hardware allows it. That might have been useful when typical memory sizes were 16MB, and L1 with its 4KB way was the only cache, but nowadays should just be considered a legacy from those days. With 256KB block allocation would be much faster than on a per-page basis, and the memory management would not touch the 18 low-order address bits, leaving the user in total control of the cache usage. In my book not doing so is criminal negligence on the part of the OS designer.

bob wrote:
hgm wrote: The beauty of it is that it does not assume anything!. The point is that to flush something (that can be anywere) out of a cache with 64K sets you will need about 64K replacements.
There's where we disagree. You are talking "average". I am talking about the bad (not worst, but just bad) cases. It then is not an issue of how many sets, but how many lines per set (ways) since in bad cases, you get lots of aliasing and replacements in a single set. If you have 4K sets (core 2) that gives you 4K opportunities to pile up on one set. The probability of overloading any one set is (I would agree) low. But with large programs, you have 4K chances to overload any one set, which moves the probability right on up into pretty common.

I am talking about the average over hash probes. As on a typical search we will be doing millions of hash probes, the actual time used will not differ significantly from the expected average, no matter how much the variance of the time for a single probe is. The sqrt(N) law will see to that.

So in your 'bad case' some sets are heavily overloaded. Good! That means that in such a bad case the average hash probe will survive _longer_ in L2 than in the average case. Because what maps extra into the overloaded sets will not map into the other sets, and hash probes have just as much probability to go into there. It is like firing a machine gun into a crowd. You will make more victims if you spread your fire as wide as possile. If you concentrate your fire most of your bullets will hit people that are already dead. The only thing that would help you is if the crowd clusters, and you happened to blindly concentrate your fire on that point. But there is no room to cluster if the crowd stands 100 rows thick. The hash table is so big that it overloads the sets by a factor 512. To have a 'bad case', like a 3-sigma fluctuation means you have a 10% larger density of the mapping. That won't make a significant difference, as the fact that your other cache replacements heavily concentrate on these sets means that they cause a huge miss rate amongst themselves, so that the hash probe misses in these sets are only a small fraction of the total number of misses in those sets.

Same problem with your math there as before, for me. 64K cache misses, running at 5 billion instructions per second, is not very significant.

What is that supposed to mean? You need 64K misses to cause one new one. That is 0.0016% extra. What does the number of instructions per second have to do with that???

I'll see if I can find time to do that, in particular I am going to modify my program to probe twice, once at the start of search, once at the end, can capture the number of misses on the second probe.

More data when I get it done.

Great. Be careful, though, that normally you would do a write to the cache line when you were done, and probing the line would make sure that the write after it hits the cache. So I hope you can distingush the write misses from the read misses...

bob · Post by **bob** » Thu Aug 23, 2007 10:15 pm

hgm wrote:
bob wrote:I doubt "real programs" are going to fit in L1. No doubt an intentionally small program can, but that won't predict anything about how the real programs are going to perform, which is where we started this discussion...

bob wrote: My definition of a "real program" is at the very least, a program that has no chance of fitting its "kernel" into L1. And, in fact, most "real programs" tend to really stress L2 significantly, until we get to at least 4M and beyond.
Well, with this definition, there is no reason to doubt at all... But if "real programs" can't keep up in performance with "non-real programs", who needs them?

First let's try a match between your non-real program and my real-program, to see if there is any advantage to a program that is optimized to fit L1 rather than optimized to play the best possible game of chess. That's my point about "real programs". To play the game of chess well, you need a pretty complex search and evaluation, and then when you toss in multiple threads, the hope for fitting that into L1 is nil, and it is a challenge to get the important parts to fit into L2 in fact...

There has to be something to go along with the "speed" or it is not worthwhile.

Nevertheless, it still seems childishly easy to me to design an OS that would not suffer from the "unpredictable aliasing problem".

The solution is known, the name of the algorithm is called "page coloring" and the idea is that you set up a map of the cache sets, and each time you allocate a physical page of RAM, you "color" the block in the map that represents the set it maps to. And you will never allocate another physical page of ram that maps to any "colored" map segment, until all map segments have been colored in (in which case you start over) or until there is no remaining physical RAM available that maps into non-colored cache sets.

But it is expensive, and in an operating, expensive is bad. That's why it isn't done in production systems.

Simply allocate memory in blocks of the cache-way size. (So on a Core 2 Duo 256KB, aligned on 256KB boundaries.) You don't want to swap single 4KB pages anyway, the seek time and rotational latency would be way to expensive, and you can read/write clusters of 256KB to disk in almost the same time. Even a substandard computer with 512MB could run 2048 tasks without swapping, which seems more than enough. No reason to go so fine-grained as 4KB, just because the hardware allows it. That might have been useful when typical memory sizes were 16MB, and L1 with its 4KB way was the only cache, but nowadays should just be considered a legacy from those days. With 256KB block allocation would be much faster than on a per-page basis, and the memory management would not touch the 18 low-order address bits, leaving the user in total control of the cache usage. In my book not doing so is criminal negligence on the part of the OS designer.
bob wrote:
hgm wrote: The beauty of it is that it does not assume anything!. The point is that to flush something (that can be anywere) out of a cache with 64K sets you will need about 64K replacements.
There's where we disagree. You are talking "average". I am talking about the bad (not worst, but just bad) cases. It then is not an issue of how many sets, but how many lines per set (ways) since in bad cases, you get lots of aliasing and replacements in a single set. If you have 4K sets (core 2) that gives you 4K opportunities to pile up on one set. The probability of overloading any one set is (I would agree) low. But with large programs, you have 4K chances to overload any one set, which moves the probability right on up into pretty common.
I am talking about the average over hash probes. As on a typical search we will be doing millions of hash probes, the actual time used will not differ significantly from the expected average, no matter how much the variance of the time for a single probe is. The sqrt(N) law will see to that.

So in your 'bad case' some sets are heavily overloaded. Good! That means that in such a bad case the average hash probe will survive _longer_ in L2 than in the average case. Because what maps extra into the overloaded sets will not map into the other sets, and hash probes have just as much probability to go into there. It is like firing a machine gun into a crowd. You will make more victims if you spread your fire as wide as possile. If you concentrate your fire most of your bullets will hit people that are already dead. The only thing that would help you is if the crowd clusters, and you happened to blindly concentrate your fire on that point. But there is no room to cluster if the crowd stands 100 rows thick. The hash table is so big that it overloads the sets by a factor 512. To have a 'bad case', like a 3-sigma fluctuation means you have a 10% larger density of the mapping. That won't make a significant difference, as the fact that your other cache replacements heavily concentrate on these sets means that they cause a huge miss rate amongst themselves, so that the hash probe misses in these sets are only a small fraction of the total number of misses in those sets.
Same problem with your math there as before, for me. 64K cache misses, running at 5 billion instructions per second, is not very significant.
What is that supposed to mean? You need 64K misses to cause one new one. That is 0.0016% extra. What does the number of instructions per second have to do with that???
I'll see if I can find time to do that, in particular I am going to modify my program to probe twice, once at the start of search, once at the end, can capture the number of misses on the second probe.

More data when I get it done.
Great. Be careful, though, that normally you would do a write to the cache line when you were done, and probing the line would make sure that the write after it hits the cache. So I hope you can distingush the write misses from the read misses...

hgm · Post by **hgm** » Thu Aug 23, 2007 10:49 pm

bob wrote:But now you are on the right path. Depending on the hash to control this is simply not viable in an SMP environment.

Well, I am not aware that I changed path, it seems to me this is what I proposed from the beginning. The only change was that I proposed to lean even more on the hash table for passing messages, and you don't seem to like that.

What I propose seems to be more about implementation than actual strategy. PVS, YBW and DTS differ in where they allow split points to occur. Assigning processors to split points seems an independent matter.

What I proposed was basically letting idle processors run up and down the lines that are currently being searched in order to find something to do. It knows the line from the hash, otherwise it isn't doing much. The 'open' (=active) nodes form a tree at any time during the parallel search, and as soon as a possible split point occurs somewhere (according to whatever strategy you like) the owner of that split point will indicate it in the hash entry of the open node, which was already in its cache. The CPUs looking for work will stumple on it there (causing some cache flushing, but a message had to be passed through memory to hand it work anyway, and this might well be that message). The tree of active nodes then gets 'lose ends', and if one of the CPUs hunting for work will hit upon such a lose end, it will have found work without the need for further message passing.

If they happen to pick a move that another CPU did also picked, (which is a small probability if they can pick from 7 moves), they will just continue hunting along the line followed by that CPU for work, which they might very well find. If not, they are quickly returning from the line, and try another move at the split point, or move to another split point. If they randomly decide what to do they will diffuse out over the active part of the tree, which quickly makes the chances that any two of them will do the same thing negligily small. During most of the search the active part of the tree will have many lose ends. A CPU that becomes idle because the search of a node it owned did finish, will not have to travel very far along the tree to find a new lose end.

As remarked by Zach, what is lacking is a way to abort other searches in case of cutoffs.

hgm · Post by **hgm** » Thu Aug 23, 2007 11:00 pm

bob wrote:First let's try a match between your non-real program and my real-program, to see if there is any advantage to a program that is optimized to fit L1 rather than optimized to play the best possible game of chess. That's my point about "real programs". To play the game of chess well, you need a pretty complex search and evaluation, and then when you toss in multiple threads, the hope for fitting that into L1 is nil, and it is a challenge to get the important parts to fit into L2 in fact...

There has to be something to go along with the "speed" or it is not worthwhile.

This is why I had expected a difinition in terms of strength, rather than in terms of size.

I would be more than willing to take up your challenge if your program, irrespective of its size, was also only 9 months old. As it is, too many essential parts are still missing from Joker to make such a match meaningful, other than that you would learn that a program without evaluation would quickly lose to a program with a finely tuned evaluation including psitional terms. Not very interesting.

I will let you know when I have written an evaluation.

The solution is known, ...

But it is expensive, and in an operating, expensive is bad. That's why it isn't done in production systems.

Well, my solution was cheap, and just as effective.

bob · Post by **bob** » Fri Aug 24, 2007 2:46 am

hgm wrote:
bob wrote:First let's try a match between your non-real program and my real-program, to see if there is any advantage to a program that is optimized to fit L1 rather than optimized to play the best possible game of chess. That's my point about "real programs". To play the game of chess well, you need a pretty complex search and evaluation, and then when you toss in multiple threads, the hope for fitting that into L1 is nil, and it is a challenge to get the important parts to fit into L2 in fact...

There has to be something to go along with the "speed" or it is not worthwhile.
This is why I had expected a difinition in terms of strength, rather than in terms of size.

I would be more than willing to take up your challenge if your program, irrespective of its size, was also only 9 months old. As it is, too many essential parts are still missing from Joker to make such a match meaningful, other than that you would learn that a program without evaluation would quickly lose to a program with a finely tuned evaluation including psitional terms. Not very interesting.

I will let you know when I have written an evaluation.

That was my point. By the time you have "written an evaluation" you are not going to be talking about your main engine (kernel) fitting into L1 cache. It will begin to stress L2 in fact. And then things change with respect to longevity of hash entries in L2.

The solution is known, ...

But it is expensive, and in an operating, expensive is bad. That's why it isn't done in production systems.

Well, my solution was cheap, and just as effective.

It was far from practical. Allocating 256K blocks of memory is a non-starter. It leads to massive memory wastage, and makes no sense with respect to paging, since we page in single pages, not huge blocks. Any good O/S book will discuss the page size delimma. Today's page sizes of 2K-4K did not happen by random choice...

But one thing is for sure, when I try to malloc() a small object, I don't want 256K (or larger as L2 grows) chunks...

hgm · Post by **hgm** » Fri Aug 24, 2007 9:59 am

Well, the eval I have in mind doesn't really use much memory at all. 32KB is huge. Adding a few Piece-Square Tables for each piece type (16 x 64 bytes = 1KB) would not kill me. But playing a match without them will just tell you next to nothing.

About paging:

When you allocate a small object, you are a mere user, and it will be allocated within your user address space with whatever granularity you like. When your total demand on memory exceeds a 256KB boundary through the allocation, the OS will just allocate a new 256KB chunk to you, which will be able to satisfy your demand for small objects for a long time.

The idea that allocating in 256KB chunks would 'waste' memory is seriously flawed thinking. Trying to prevent this 'wastage' by packing the used fraction densely and collecting all the unused memory into one contiguous block would still leave the memory unused, and thus just as wasted. That you _could_ run other programs in that space is worth zilch if you don't have other tasks that _want_ to run. You are just trying to 'save' memory in order to leave it idle...

If I look in my task manager, it turns out that almost any task uses about 2MB of memory, and there are only 5 (of 32) that use less than 1MB (the smallest using 168KB). It doesn't seem a significant fraction of memory would be wasted by rounding everything up to 256KB multiples.

We don't have 4KB pages for nothing? Are you claiming that the optimal page size is independent of total memory size and hard-disk I/O characteristics? That a page size that is optimal for a 4MB memory is also optimal for a 4GB memory? And that you would like to swap in equal chunks to a disk of 40MB as to a disk of 320GB? This seems rather unlikely, and I don't believe it for a minute!

hgm · Post by **hgm** » Fri Aug 24, 2007 3:05 pm

Something entirely different:

If I am searching in parallel, and one of my active searchers follows a move that was not taken by any other thread, but hits on a position that is already being searched by other threads...

What do you do? Just 'duplicate' the work? The result might not really be the same, as both searchers came apparently to the node through a different path, so rep-draw effects might be different. But you wouldn't worry on that when the other search had already finished, and you would learn its result through a hash hit. So why would you worry about it now?

You cannot wait for the other search to finish, (i.e. find other work to do in the mean time), because that might cause deadlocks.

One way out could be through IID: never allow a node to be open for an N-ply search before the result of an (N-1)-ply search is known and stored in the hash table (where it will remain stored during the N-ply search). This would prevent loops of nodes waiting for each other (provided 4 extensions of a full move in a row are not possible).

re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel