Suggestion for hash tables and analysis.

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Zenmastur
Posts: 919
Joined: Sat May 31, 2014 8:28 am

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by Zenmastur »

bob wrote:2mb pages don't require any shutdowns in the world of Linux. Nor does 1gb pages. Only problem is, currently, 1gb pages have to be established at boot time and they are forever 1gb pages. But no fragmentation issues that require restarts, still.

I am not convinced that a special table for just the last few plies is worthwhile.
bob wrote:If you have a heavily oversubscribed tt, I am also not sure that this is the answer, either. Seems to me that the best approach is always a good overwrite decision more than something simpler such as a separate tt.
Well, you said yourself that this is where most of the nodes are. The problem is these nodes overwrite entries. You said that you always write data to the TT at every node. So by your own admissions you will be overwriting, and therefor losing, large quantities of data in a heavily oversubscribed TT. There is no replacement policy that can prevent this from happening other than to NOT WRITE new nodes to the TT. If you discard them they do you no good at all. You and several other posters seem to think that there is some policy that can remove the deleterious effects of this “forced” loss of data. And yet, no one has offered any suggestions as to what this policy might be. “Being careful” is not a usable policy. It would be a simple matter to post the policy criteria or even sample code if such a policy existed.

My personal opinion is, no one has offered this information because they don't know of any policy that has these properties, and the reason no one knows of such a policy, is because it doesn't exist.

I'm not saying that commonly used policies don't help, but rather, the nature of the help they provide is linear or multiplicative while the nature of the problem is exponential. Of course, there are linear solution to exponential problems that work but only if the problem size is small enough. So, while current policies work for node counts that exceed the TT entry count, as the over-subscription factor increases they are less an less effective at mitigating the effects of data loss. You may not be sure my approach is correct, but I'm positive that a policy change will not be a member of the solution set for the size of problems I'm interested in. But, if you or anyone else, has a miracle policy that you've been hiding all these years please feel free to post it in this thread.
bob wrote:I can transpose near a tip, with a position near the root, just as easily as I can transpose between branches near the tip.
I'm not sure you can transpose just as easily from root to tip. Long transpositions, while clearly possible, seem harder. There are several different types of transpositions. (e.g. transpositions that are of different lengths, transpositions that simply reorder some or all the moves, transpositions in which few or none of the moves in one variation are found in other variations, etc.) Different types occur more frequently in different parts of the game. Some types are easier to construct than others, so I suppose they are more prevalent than other types. Certain types of moves restrict transpositions. Even the complexity of the position influences the occurrence of transpositions.

My problem is that I'm not familiar enough with graph theory and how it relates to chess tree's to know or prove that your statement is true or false. My intuitive sense is that your statement is true in some circumstances and false in others. In late middle-game and endgame positions I think your statement is probably true, AND many, many of these transpositions are actually playable. Since so many of them are sound they will be searched and will appear in the TT in very large numbers. This is one of the reasons, although not the only reason, that the TT hit rate goes up so much during endgames. It's not that there is more of them in an endgame, it's that many become playable and even important in an endgame.

The opening and middle game are a completely different story. The types of moves that restrict transpositions seem more prevalent in the opening and middle game. Although there are many transpositions here as well they are much more constrained. I think one of the difference is the complexity of the positions inhibit most transpositions that are possible from being playable. If they're unsound LMR (or other methods) will see they never appear or at least defer them until much deeper searches are being made. I think but can't prove that long transpositions in the opening (later part) and middle game have less influence because they're fewer in number, still fewer are sound, therefore playable, so fewer will be found in the TT.

The most important point:

The most important transposition parameter is not it's length. It's the distance from the ply that completes the transposition to the end of the search. This has a direct impact on me not really caring about long transposition. So, lets take a 45 ply search from the most complicated position we can find, with the transposition starting at ply 1 and ending at ply 42. Ply 42 is the first ply that would be directed into a split TT that has the last four plies of the search directed to it. If this transposition is not found it will cost you, at most, 483 nodes of searching (see the last column of the table I posted earlier in the discussion, these node counts are from the most complicated “real” position I could find on short notice. It's from a correspondence game.) 483 nodes is not a big deal. Now lets shorten it by one ply. The ending ply of this shorter transposition will be found in the L2 TT and it will save about 2,173 nodes of searching. Now, if either of these searches were conducted using a “normal” TT that is heavily over subscribed, say 250 to 1 (I chose this number because any correspondence player that lets their computer run over night or when they're at work will have this happen all most every day of the week that's also why I chose 45 plies as the depth of search.) what's the chances of either of them being found. My suspicion is that it would be less than 1% for each of them. (i.e. these nodes are at the bottom of the search tree and nodes there will be over written as fast as you can process nodes.) So (483 + 2,173 ) * 1% = ~27 nodes saved on average. Now with a split TT the over subscription is reduced. In this case by a factor of 2,173. So, we go from a highly oversubscribed state to an under subscribed state. We will be filling the L2 TT to about 11.5% of capacity. There will still be some overwriting but it will be a fraction of a %. I'm going to take your position and assume that ALL transpositions that cross the L1 TT boundary are missed. So, (0+2,173) * 99% = 2,151 nodes saved on average. Now, if we start moving the end node of the transposition towards the root (i.e. shortening it one ply at a time) the number of nodes saved if the transposition if found goes up by a factor that is approximately the EBF of the engine. The hit rate of the L2 TT will stay about the same regardless because it's under subscribed. The “normal” TT's hit rate will go up as the end node approaches the root because these nodes are given preferential treatment due to search depth. At ply 3 the the savings difference between the two will be almost zero. It's hard to say at which ply the largest savings will occur because it will change with search length, type of position, and complexity. It will tend to be in the middle of the search. For this search it should be between ply 15 and ply 30. The worst savings will be in the least complex positions. The best will be in the most complicated positions like this example.
bob wrote:My response was primarily related directly to the question(s) asked of me, starting with NUMA. Which is a completely different issue from what gets overwritten or not...
Yeah... I should have given you a little intro before inviting you in. I was interested in your opinion on NUMA since I knew you had extensive knowledge of the subject. I figured if it works for a single CPU system it would be killer on a NUMA system. We got a bit side tracked on “will it work”.

In the end, I think it's complicated enough that with out trying it there is no way to know for sure. I don't have any illusions about this working on the first attempt. If it's to work as I think it could, it will have to be well thought out. But you opened my eyes on a few details that I hadn't consider, so that's good. I have better ideas about certain aspects, things to look out for, and new things to consider. So, all in all, I think it was constructive. One of the reasons I asked you specifically was I wanted an unbiased opinion from someone that had reason to know these types of things and I think that's what I got. So I'm good with it.
syzygy wrote:
Zenmastur wrote:You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT.
What do you mean by "last part of the search" ?

Do you mean the last N ply of the search?
Precisely!
syzygy wrote:That's an enormous amount of nodes. The L1 would cover only a uselessly small fraction of it. You'll have practically no hits.
Most of the nodes that will be searched are in last N-plies of the search. That is why I chose to split it where I did.
The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made.
syzygy wrote:The search tree is a tree, not a single branch.
And your point is what exactly?
hgm wrote:Something accessed as infrequently as a hash-table entry would never survive in L1 anyway. There is so much data in a chess engine that is accessed more frequently than once per node, and it would just flush it out of L1. Even PST and Zobrist key tables will have multiple accesses per node (from- and to-square, victim) focused in a smaller area.
The number of accesses to ANY array is almost irrelevant. The only thing that matters is the number of cache-lines that will be dragged in from memory. This specifically excludes references to cache-lines already in the cache.

The only way to tell is to actually look at the code. Which I haven't done yet. But I'm getting the feeling that you're making a mountain out of a mole hill. If you can give me a specific example of an array or structure that you believe will be accessed enough to drag in sufficient cache-lines it would narrow the search when I go to see whats' what.

Regards,

Forrest
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by hgm »

Zenmastur wrote:The number of accesses to ANY array is almost irrelevant.
Not really, because most caches use a pseudo-LRU replacement policy. This effectively protects frequently-accessed data from being replaced, as they will never be "least-recently used". This will focus all infrequently used data to one cache way. So what matters is the 'access density', i.e. the access frequency per table element.
But I'm getting the feeling that you're making a mountain out of a mole hill. If you can give me a specific example of an array or structure that you believe will be accessed enough to drag in sufficient cache-lines it would narrow the search when I go to see whats' what.
Zobrist keys (a 2 x 6 x 64 x 8 = 6KB table in Chess) are accessed in different locations once per node for the to-square of the various moves, even if we ignore the access for from-square and victim. That is a higher access density than one probe per node in a 32KB L1-TT, so the Zobrist table would flush the latter out of the cache.

If you use magic bitboards, move generation for a Rook would access tables for the attack boards with mostly 2K or 1K entries (i.e. 16KB or 8KB) for the Rook in the current location (because that location does not change nearly as fast as the occupancy). That is also a higher access density then you could expect foran L1 hash, and the tables are expected to be used for other purposes than move generation alone. And you could have two Rooks.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by bob »

Zenmastur wrote:
bob wrote:2mb pages don't require any shutdowns in the world of Linux. Nor does 1gb pages. Only problem is, currently, 1gb pages have to be established at boot time and they are forever 1gb pages. But no fragmentation issues that require restarts, still.

I am not convinced that a special table for just the last few plies is worthwhile.
bob wrote:If you have a heavily oversubscribed tt, I am also not sure that this is the answer, either. Seems to me that the best approach is always a good overwrite decision more than something simpler such as a separate tt.
Well, you said yourself that this is where most of the nodes are. The problem is these nodes overwrite entries. You said that you always write data to the TT at every node. So by your own admissions you will be overwriting, and therefor losing, large quantities of data in a heavily oversubscribed TT. There is no replacement policy that can prevent this from happening other than to NOT WRITE new nodes to the TT. If you discard them they do you no good at all. You and several other posters seem to think that there is some policy that can remove the deleterious effects of this “forced” loss of data. And yet, no one has offered any suggestions as to what this policy might be. “Being careful” is not a usable policy. It would be a simple matter to post the policy criteria or even sample code if such a policy existed.

My personal opinion is, no one has offered this information because they don't know of any policy that has these properties, and the reason no one knows of such a policy, is because it doesn't exist.

I'm not saying that commonly used policies don't help, but rather, the nature of the help they provide is linear or multiplicative while the nature of the problem is exponential. Of course, there are linear solution to exponential problems that work but only if the problem size is small enough. So, while current policies work for node counts that exceed the TT entry count, as the over-subscription factor increases they are less an less effective at mitigating the effects of data loss. You may not be sure my approach is correct, but I'm positive that a policy change will not be a member of the solution set for the size of problems I'm interested in. But, if you or anyone else, has a miracle policy that you've been hiding all these years please feel free to post it in this thread.
bob wrote:I can transpose near a tip, with a position near the root, just as easily as I can transpose between branches near the tip.
I'm not sure you can transpose just as easily from root to tip. Long transpositions, while clearly possible, seem harder. There are several different types of transpositions. (e.g. transpositions that are of different lengths, transpositions that simply reorder some or all the moves, transpositions in which few or none of the moves in one variation are found in other variations, etc.) Different types occur more frequently in different parts of the game. Some types are easier to construct than others, so I suppose they are more prevalent than other types. Certain types of moves restrict transpositions. Even the complexity of the position influences the occurrence of transpositions.

My problem is that I'm not familiar enough with graph theory and how it relates to chess tree's to know or prove that your statement is true or false. My intuitive sense is that your statement is true in some circumstances and false in others. In late middle-game and endgame positions I think your statement is probably true, AND many, many of these transpositions are actually playable. Since so many of them are sound they will be searched and will appear in the TT in very large numbers. This is one of the reasons, although not the only reason, that the TT hit rate goes up so much during endgames. It's not that there is more of them in an endgame, it's that many become playable and even important in an endgame.

The opening and middle game are a completely different story. The types of moves that restrict transpositions seem more prevalent in the opening and middle game. Although there are many transpositions here as well they are much more constrained. I think one of the difference is the complexity of the positions inhibit most transpositions that are possible from being playable. If they're unsound LMR (or other methods) will see they never appear or at least defer them until much deeper searches are being made. I think but can't prove that long transpositions in the opening (later part) and middle game have less influence because they're fewer in number, still fewer are sound, therefore playable, so fewer will be found in the TT.

The most important point:

The most important transposition parameter is not it's length. It's the distance from the ply that completes the transposition to the end of the search. This has a direct impact on me not really caring about long transposition. So, lets take a 45 ply search from the most complicated position we can find, with the transposition starting at ply 1 and ending at ply 42. Ply 42 is the first ply that would be directed into a split TT that has the last four plies of the search directed to it. If this transposition is not found it will cost you, at most, 483 nodes of searching (see the last column of the table I posted earlier in the discussion, these node counts are from the most complicated “real” position I could find on short notice. It's from a correspondence game.) 483 nodes is not a big deal. Now lets shorten it by one ply. The ending ply of this shorter transposition will be found in the L2 TT and it will save about 2,173 nodes of searching. Now, if either of these searches were conducted using a “normal” TT that is heavily over subscribed, say 250 to 1 (I chose this number because any correspondence player that lets their computer run over night or when they're at work will have this happen all most every day of the week that's also why I chose 45 plies as the depth of search.) what's the chances of either of them being found. My suspicion is that it would be less than 1% for each of them. (i.e. these nodes are at the bottom of the search tree and nodes there will be over written as fast as you can process nodes.) So (483 + 2,173 ) * 1% = ~27 nodes saved on average. Now with a split TT the over subscription is reduced. In this case by a factor of 2,173. So, we go from a highly oversubscribed state to an under subscribed state. We will be filling the L2 TT to about 11.5% of capacity. There will still be some overwriting but it will be a fraction of a %. I'm going to take your position and assume that ALL transpositions that cross the L1 TT boundary are missed. So, (0+2,173) * 99% = 2,151 nodes saved on average. Now, if we start moving the end node of the transposition towards the root (i.e. shortening it one ply at a time) the number of nodes saved if the transposition if found goes up by a factor that is approximately the EBF of the engine. The hit rate of the L2 TT will stay about the same regardless because it's under subscribed. The “normal” TT's hit rate will go up as the end node approaches the root because these nodes are given preferential treatment due to search depth. At ply 3 the the savings difference between the two will be almost zero. It's hard to say at which ply the largest savings will occur because it will change with search length, type of position, and complexity. It will tend to be in the middle of the search. For this search it should be between ply 15 and ply 30. The worst savings will be in the least complex positions. The best will be in the most complicated positions like this example.
bob wrote:My response was primarily related directly to the question(s) asked of me, starting with NUMA. Which is a completely different issue from what gets overwritten or not...
Yeah... I should have given you a little intro before inviting you in. I was interested in your opinion on NUMA since I knew you had extensive knowledge of the subject. I figured if it works for a single CPU system it would be killer on a NUMA system. We got a bit side tracked on “will it work”.

In the end, I think it's complicated enough that with out trying it there is no way to know for sure. I don't have any illusions about this working on the first attempt. If it's to work as I think it could, it will have to be well thought out. But you opened my eyes on a few details that I hadn't consider, so that's good. I have better ideas about certain aspects, things to look out for, and new things to consider. So, all in all, I think it was constructive. One of the reasons I asked you specifically was I wanted an unbiased opinion from someone that had reason to know these types of things and I think that's what I got. So I'm good with it.
syzygy wrote:
Zenmastur wrote:You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT.
What do you mean by "last part of the search" ?

Do you mean the last N ply of the search?
Precisely!
syzygy wrote:That's an enormous amount of nodes. The L1 would cover only a uselessly small fraction of it. You'll have practically no hits.
Most of the nodes that will be searched are in last N-plies of the search. That is why I chose to split it where I did.
The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made.
syzygy wrote:The search tree is a tree, not a single branch.
And your point is what exactly?
hgm wrote:Something accessed as infrequently as a hash-table entry would never survive in L1 anyway. There is so much data in a chess engine that is accessed more frequently than once per node, and it would just flush it out of L1. Even PST and Zobrist key tables will have multiple accesses per node (from- and to-square, victim) focused in a smaller area.
The number of accesses to ANY array is almost irrelevant. The only thing that matters is the number of cache-lines that will be dragged in from memory. This specifically excludes references to cache-lines already in the cache.

The only way to tell is to actually look at the code. Which I haven't done yet. But I'm getting the feeling that you're making a mountain out of a mole hill. If you can give me a specific example of an array or structure that you believe will be accessed enough to drag in sufficient cache-lines it would narrow the search when I go to see whats' what.

Regards,

Forrest
You should test your idea with fine position # 70. That has zillions of transpositions from root to tips and everywhere in between...

As far as data goes, however, what about the following:

(1) bit board data.

(2) repetition list

(3) magic move generation tables (there are multiple tables accessed per individual sliding piece move generation

(4) zobrish random number hash signature values

(5) the giant stack caused by a recursive search which includes all the local data used by each recursive instantiation.

(6) history heuristic counters (1K x 32 bits)

(7) counter moves (4K x 32 bits)

(8) Etc... (many more smaller arrays and scalar values, which are used repeatedly to evaluate a single position).

A hash entry has no chance of surviving very long, nor should it since a specific hash entry is highly unlikely to be needed for the very next node searched.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by syzygy »

Zenmastur wrote:
syzygy wrote:
Zenmastur wrote:You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT.
What do you mean by "last part of the search" ?

Do you mean the last N ply of the search?
Precisely!
syzygy wrote:That's an enormous amount of nodes. The L1 would cover only a uselessly small fraction of it. You'll have practically no hits.
Most of the nodes that will be searched are in last N-plies of the search. That is why I chose to split it where I did.
And so in 99.9% of the nodes you'll be probing a TINY TT table giving you practically no hits. You may as well disable TT probing altogether in those last N plies.
syzygy wrote:
The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made.
The search tree is a tree, not a single branch.
And your point is what exactly?
It seems you are thinking that the "earlier part of the search" are the nodes from the current node being searched up to the root.

I guess I'm a bit at a loss as to what your mental image of the search process could possibly be.

Anyway, just implement your idea in SF and see if it works. Will take a lot less time than convincing anyone here...
Zenmastur
Posts: 919
Joined: Sat May 31, 2014 8:28 am

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by Zenmastur »

bob wrote: You should test your idea with fine position # 70. That has zillions of transpositions from root to tips and everywhere in between...
I have considered what happens in the end game and specifically fine #70 since it's well know and exercises the cache like few other positions. I expect this cache to do slightly worse on endgame positions as a whole. It's very hard to over subscribe a TT with fine #70 unless you have a very small TT. So, I don't expect it to have stellar performance in this type of position, UNLESS the position can over subscribe the TT by about 25 to 1, but who cares? These types of positions aren't a problem to start with.
bob wrote:As far as data goes, however, what about the following:

(1) bit board data.

(2) repetition list

(3) magic move generation tables (there are multiple tables accessed per individual sliding piece move generation

(4) zobrish random number hash signature values

(5) the giant stack caused by a recursive search which includes all the local data used by each recursive instantiation.

(6) history heuristic counters (1K x 32 bits)

(7) counter moves (4K x 32 bits)

(8) Etc... (many more smaller arrays and scalar values, which are used repeatedly to evaluate a single position).

A hash entry has no chance of surviving very long, nor should it since a specific hash entry is highly unlikely to be needed for the very next node searched.
I had considered 2, 3, and 4 at some length and some of the others to a lessor extent. The one I missed completely was 5. I don't use any recursively re-entrant code but this would be a problem for those that do. Which probably means most engines.

So, lets assume for a moment, I'm wrong and none of the L1 TT is ever in any cache except right before it's used and then it's overwritten before the data is needed again. This is exactly what happens to a large but otherwise “normal” TT. We haven't lost anything, we just didn't avoid the main memory access time. We still gain quite a bit. The TLB will NOT be thrashing like a wild man who's been set on fire. The TLB page table walk time is actually more than the main memory access time, we saved a lot of NUMA traffic when running on a NUMA machine, and we have taken a heavily over subscribed TT and improved its performance. The last item is the only one I really care about. If it comes with some perks, all the better! If not, I don't really care as long as I solve the overloaded TT problem.

There are actually some advantages to not having to restrict the size of the L1 TT so it “could” fit into the L1 or L2 TT, but I have no doubt that this will lead to yet another sidetrack. So, to avoid getting sidetracked on something that is much less important, I'm going to ignore them.

hgm wrote:
Zenmastur wrote:The number of accesses to ANY array is almost irrelevant.
Not really, because most caches use a pseudo-LRU replacement policy. This effectively protects frequently-accessed data from being replaced, as they will never be "least-recently used". This will focus all infrequently used data to one cache way. So what matters is the 'access density', i.e. the access frequency per table element.
But I'm getting the feeling that you're making a mountain out of a mole hill. If you can give me a specific example of an array or structure that you believe will be accessed enough to drag in sufficient cache-lines it would narrow the search when I go to see whats' what.
Zobrist keys (a 2 x 6 x 64 x 8 = 6KB table in Chess) are accessed in different locations once per node for the to-square of the various moves, even if we ignore the access for from-square and victim. That is a higher access density than one probe per node in a 32KB L1-TT, so the Zobrist table would flush the latter out of the cache.

If you use magic bitboards, move generation for a Rook would access tables for the attack boards with mostly 2K or 1K entries (i.e. 16KB or 8KB) for the Rook in the current location (because that location does not change nearly as fast as the occupancy). That is also a higher access density then you could expect foran L1 hash, and the tables are expected to be used for other purposes than move generation alone. And you could have two Rooks.
The caches are also 8-way set associative and anything in the L1 or L2 cache is also in the L3 cache. Anything evicted from L1 or L2 will still be in the L3 cache. Even a 4 core CPU like the E5-1630 has a 10MB L3 cache and the 4-core E7-8893 has 45MB. I understand that the last level cache on Skylake CPU's is 128MB of edram. While not as fast as static ram it's still faster than going to main memory. But, in the end I don't really care since it will have no effect on the over subscription rate of the L2 TT which is the sole purpose for doing this.

The Zorbist array is too small to dominate any cache. The magics are larger than L1 and L2 but easily fit in L3. They're not randomly accessed or modified, shared between all cores, so should always be in the L3 cache at worst. Their access pattern is restricted by the piece locations on the board. These will not change greatly during a 4-ply search. They will change more during the Q-search but I don't think even this will cause them to dominate the L1 and/or the L2 cache.

I don't think its worth the amount of effort required to argue the point one way or the other.

Regards,

Forrest
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by bob »

Zenmastur wrote:
bob wrote: You should test your idea with fine position # 70. That has zillions of transpositions from root to tips and everywhere in between...
I have considered what happens in the end game and specifically fine #70 since it's well know and exercises the cache like few other positions. I expect this cache to do slightly worse on endgame positions as a whole. It's very hard to over subscribe a TT with fine #70 unless you have a very small TT. So, I don't expect it to have stellar performance in this type of position, UNLESS the position can over subscribe the TT by about 25 to 1, but who cares? These types of positions aren't a problem to start with.
bob wrote:As far as data goes, however, what about the following:

(1) bit board data.

(2) repetition list

(3) magic move generation tables (there are multiple tables accessed per individual sliding piece move generation

(4) zobrish random number hash signature values

(5) the giant stack caused by a recursive search which includes all the local data used by each recursive instantiation.

(6) history heuristic counters (1K x 32 bits)

(7) counter moves (4K x 32 bits)

(8) Etc... (many more smaller arrays and scalar values, which are used repeatedly to evaluate a single position).

A hash entry has no chance of surviving very long, nor should it since a specific hash entry is highly unlikely to be needed for the very next node searched.
I had considered 2, 3, and 4 at some length and some of the others to a lessor extent. The one I missed completely was 5. I don't use any recursively re-entrant code but this would be a problem for those that do. Which probably means most engines.

So, lets assume for a moment, I'm wrong and none of the L1 TT is ever in any cache except right before it's used and then it's overwritten before the data is needed again. This is exactly what happens to a large but otherwise “normal” TT. We haven't lost anything, we just didn't avoid the main memory access time. We still gain quite a bit. The TLB will NOT be thrashing like a wild man who's been set on fire. The TLB page table walk time is actually more than the main memory access time, we saved a lot of NUMA traffic when running on a NUMA machine, and we have taken a heavily over subscribed TT and improved its performance. The last item is the only one I really care about. If it comes with some perks, all the better! If not, I don't really care as long as I solve the overloaded TT problem.

There are actually some advantages to not having to restrict the size of the L1 TT so it “could” fit into the L1 or L2 TT, but I have no doubt that this will lead to yet another sidetrack. So, to avoid getting sidetracked on something that is much less important, I'm going to ignore them.

hgm wrote:
Zenmastur wrote:The number of accesses to ANY array is almost irrelevant.
Not really, because most caches use a pseudo-LRU replacement policy. This effectively protects frequently-accessed data from being replaced, as they will never be "least-recently used". This will focus all infrequently used data to one cache way. So what matters is the 'access density', i.e. the access frequency per table element.
But I'm getting the feeling that you're making a mountain out of a mole hill. If you can give me a specific example of an array or structure that you believe will be accessed enough to drag in sufficient cache-lines it would narrow the search when I go to see whats' what.
Zobrist keys (a 2 x 6 x 64 x 8 = 6KB table in Chess) are accessed in different locations once per node for the to-square of the various moves, even if we ignore the access for from-square and victim. That is a higher access density than one probe per node in a 32KB L1-TT, so the Zobrist table would flush the latter out of the cache.

If you use magic bitboards, move generation for a Rook would access tables for the attack boards with mostly 2K or 1K entries (i.e. 16KB or 8KB) for the Rook in the current location (because that location does not change nearly as fast as the occupancy). That is also a higher access density then you could expect foran L1 hash, and the tables are expected to be used for other purposes than move generation alone. And you could have two Rooks.
The caches are also 8-way set associative and anything in the L1 or L2 cache is also in the L3 cache. Anything evicted from L1 or L2 will still be in the L3 cache. Even a 4 core CPU like the E5-1630 has a 10MB L3 cache and the 4-core E7-8893 has 45MB. I understand that the last level cache on Skylake CPU's is 128MB of edram. While not as fast as static ram it's still faster than going to main memory. But, in the end I don't really care since it will have no effect on the over subscription rate of the L2 TT which is the sole purpose for doing this.

The Zorbist array is too small to dominate any cache. The magics are larger than L1 and L2 but easily fit in L3. They're not randomly accessed or modified, shared between all cores, so should always be in the L3 cache at worst. Their access pattern is restricted by the piece locations on the board. These will not change greatly during a 4-ply search. They will change more during the Q-search but I don't think even this will cause them to dominate the L1 and/or the L2 cache.

I don't think its worth the amount of effort required to argue the point one way or the other.

Regards,

Forrest
Only thing I will add is that in Intel, what is in L1 is guaranteed to be in L2 and L3. NOT with AMD. It uses an exclusive cache model where nothing is replicated anywhere...

In my case, I also do not hash in the q-search at all, so I don't see nearly the memory traffic one would see if hash probing was done there...
Zenmastur
Posts: 919
Joined: Sat May 31, 2014 8:28 am

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by Zenmastur »

bob wrote:Only thing I will add is that in Intel, what is in L1 is guaranteed to be in L2 and L3. NOT with AMD. It uses an exclusive cache model where nothing is replicated anywhere...
I think I've convinced myself that it makes no difference if the L1 TT resides in a cache or not. After all, ¾ of the savings comes from TLB page table walks. So even if you are running on an AMD CPU you're still likely to see a significant savings. These savings are not critical to an L1 TT improving performance in an oversubscribed TT. It's going to work regardless.
bob wrote:In my case, I also do not hash in the q-search at all, so I don't see nearly the memory traffic one would see if hash probing was done there...
This may be true, but consider this. Crafty 25.x processes nodes at a very high rate. So high, in fact, that if it were participating in stage 1x of TCEC with a 1 billion entry TT (i.e. 16GB) it would be processing nodes at about 100Mnps. This is sufficient to oversubscribe the TT by 10 to 1 in 100 seconds. Which I believe is enough to see benefits from. So some benefit may be had at tournament time controls and any thing longer than this strongly favors this type of TT.
syzygy wrote:
Zenmastur wrote:
syzygy wrote:
Zenmastur wrote:You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT.
What do you mean by "last part of the search" ?

Do you mean the last N ply of the search?
Precisely!
syzygy wrote:That's an enormous amount of nodes. The L1 would cover only a uselessly small fraction of it. You'll have practically no hits.
Most of the nodes that will be searched are in last N-plies of the search. That is why I chose to split it where I did.
And so in 99.9% of the nodes you'll be probing a TINY TT table giving you practically no hits. You may as well disable TT probing altogether in those last N plies.
I guess I'm at a bit of a loss as to what your thinking. It appears as if you're grossly underestimating the power of small TT's. Before systems had gigabytes of ram small TT's where the norm. Their performance is highly disproportionate to their size. Or to put it another way, current TT's operate in the zone of highly diminished returns compared to the resources afforded them. In short, I think your barking up the wrong tree.
synergy wrote:
The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made.
The search tree is a tree, not a single branch.
And your point is what exactly?
syzygy wrote:It seems you are thinking that the "earlier part of the search" are the nodes from the current node being searched up to the root.

I guess I'm a bit at a loss as to what your mental image of the search process could possibly be.
Hmmm...

Well, if you want to reword the bolded statement so a third party has a reasonable chance of deciphering what you mean I might be able to respond. It seems like you are trying to say that not all branches in the earlier part of the search will lead to the node currently being searched. If that's not what you meant then you need to clarify your thoughts. If that's what you're saying, then yes, I'm well aware of this fact. But this gives me no clue as to your point.

If your confused by something I wrote, or I failed to explain some aspect with sufficient clarity for you to understand it, you can ask, I'll clarify my statements, and then you'll no longer have reason to be at a loss.
syzygy wrote:Anyway, just implement your idea in SF and see if it works. Will take a lot less time than convincing anyone here...
I'm not trying to convince anyone. I'm trying to do a mental walk-through in public. This way if I miss something important or obvious other people will, no doubt, point it out in no uncertain terms.

I figure if someone from the Stockfish development team thinks this is worth trying they have sufficient resources to do it without my assistance. They're also much more familiar with SF's code than I.

Their testing framework doesn't seem conducive to this type of patch. A very cursory analysis: A 10 second game with 0.1 second increment doesn't lead to oversubscribed TT's. So they would be very unlikely to see any benefit from this. Even if they reduced the TT size to force over subscription it wouldn't be the same. 4-plies is ~10% of a 45-ply search. You need 3-plies minimum for a transposition so your transposition space is ~38 plies. 38/45 = ~84% of the search depth. I haven't looked at how long the searches would be at 10+0.10 but lets assume 12-plies. 12-4-3=5-plies or ~41% of the search depth. The number of nodes saved by transpositions is exponential with depth. So for SF, 1.6^5 / 1.6^38 = 0.000000183. The number of transpositions in the shorter search will, of course, will be a much smaller number as well. So to have a reasonable test the games need to have much deeper searches. This seems unlikely under the current testing framework.

Regards,

Zen
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by syzygy »

Zenmastur wrote:If your confused by something I wrote, or I failed to explain some aspect with sufficient clarity for you to understand it, you can ask, I'll clarify my statements, and then you'll no longer have reason to be at a loss.
Do I really need to repeat your statement?
The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made.
That is only true if "earlier part" is "branch from root to current node".

A more sensible definition of "earlier part" for the purpose of TT is "the part of the tree searched before the current node". That part may contain lots of occurrences of the current position even through pawn moves/unmoves and captures/uncaptures.
I figure if someone from the Stockfish development team thinks this is worth trying they have sufficient resources to do it without my assistance.
Sure, they are always looking for vague ideas by people not willing or able to implement them and that nobody else is convinced by.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by hgm »

Zenmastur wrote:After all, ¾ of the savings comes from TLB page table walks.
I am not so sure about that. The page tables can be cached too, like any memory. Especially the higher levels could reside in L3 permanently. With a million pages the page table itself would be 8MB, which still is large. But the intermediate-level page-table directory would then be only 16KB, and be accessed once per node to service the TLB miss from the hash probe.A once-per-node probe into 16KB should be enough to preserve it in L2.

So you would typically need two DRAM accesses per hash probe: one for the page table, and one for the data.

Fairly weak key focusing could be sufficient to confine the hash probes to a working set small enough for the associated page table to settle itself in a cache. 64KB of page table would correspond to 8K pages, i.e. 32MB of memory. And a once-per-node access to 64KB should be enough to preserve it in the L3 cache. 32MB is a sizable TT in itself, so you would not lose much performance from increased collision rates when the accesses are focused on it a little bit more than optimally.

E.g. for an 8GB table, you could zero 10 bits of the high index part of the Zobrist keys for the white pieces, and 10 other bits for all black pieces. That would confine sub-trees with only white moves (black null-moving) to 8MB, and you would leave the 32MB working set only after more than 4 black moves. Whether this is optimal depends on the ratio of black and white moves in a typical sub-tree.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Ok, Maybe I didn't make it clear in previous posts...

Post by syzygy »

Let's consider what in the very best case could be won by all of this.

Crafty does not probe in the qsearch. If I remember right, Bob once mentioned that probing in the qsearch reduces nps by about 10% and reduces the size of the tree by about 10%. So probing in the qsearch would neither help nor hurt in the case of Crafty.

Stockfish does probe in the qsearch. I don't know if this has been tested extensively at different TT sizes, but it seems quite unlikely that probing in the qsearch significantly hurts Stockfish (most likely it helps).

If an engine probes in the qsearch, then almost all slowdown caused by probing the TT will be incurred in the qsearch. Apparently, disabling probing there only speeds up the engine by about 10%. So replacing the qsearch probing with something amazingly clever that works just as well in terms of tree size reduction could only speed up the engine by those 10%. That would be worth something, but not a lot.

In reality, there is no such amazingly clever trick. Limiting qsearch probing to a TT so tiny that it could fit in the L1 (if it would stay there at all) would at the same time limit the effectiveness of the TT.