Suggestion for hash tables and analysis.

Zenmastur · Post by **Zenmastur** » Tue Jun 07, 2016 11:58 pm

bob wrote:I tried several options. but the basic one-level hash table has worked best for me. Of course, I don't hash in the q-search, which significantly reduces the total number of probes...

And then there are large and huge pages if you are concerned about the TLB..

So what I get from your comments are:

1. Storing the Q-search info in the TT is probably a bad idea.
2. You have tried split TT's.
3. There is no time advantage to be had by having an L1 TT fit inside the CPU's L1 or L2 cache thereby avoiding most memory accesses, most TLB misses, and the TLB thrashing that always occur even when using a modest size TT (i.e. 256MB or 512MB).
4. Generating 2 to 3 orders of magnitude less NUMA traffic would have no effect on the scaling of chess engines on NUMA machines.

Are these correct?

Regards,

Forrest

bob · Post by **bob** » Wed Jun 08, 2016 1:17 am

Zenmastur wrote:
bob wrote:I tried several options. but the basic one-level hash table has worked best for me. Of course, I don't hash in the q-search, which significantly reduces the total number of probes...

And then there are large and huge pages if you are concerned about the TLB..
So what I get from your comments are:

1. Storing the Q-search info in the TT is probably a bad idea.
2. You have tried split TT's.
3. There is no time advantage to be had by having an L1 TT fit inside the CPU's L1 or L2 cache thereby avoiding most memory accesses, most TLB misses, and the TLB thrashing that always occur even when using a modest size TT (i.e. 256MB or 512MB).
4. Generating 2 to 3 orders of magnitude less NUMA traffic would have no effect on the scaling of chess engines on NUMA machines.

Are these correct?

Regards,

Forrest

I can't answer your last question as the NUMA boxes I have access to are only 2 or 4 nodes total. On those, I see no problem. What happens on a 64 node or beyond machine is unknown however.

Using two TTs seems to always have a harmful effect.

(1) if you probe both, you will actually have more overhead when the first (small) fails.

(2) if you ONLY probe the local table, you miss a LOT of possible transpositions and that causes the tree to expand.

TLB misses will always be a problem. However, that being said, there is little you can do to get rid of them, other than reducing the size of the table. My experiences suggest that it is more important to make the table as useful as possible, rather than worrying about what it costs to access...

It is unlikely you can fit any sized TT in L1. I've not seen one processor with more than 64K of L1 cache. With minimal 8 byte entries, that is just 8K entries and you have a LOT of other data that has to be accessed as well. So that seems to be impossible. L2? IBM's power 8 has 512K of L2, which is about the biggest I recall. That is only 64K 8 byte ttable entries assuming you don't access any other data. Very small still. Not sure it is worth discussing trying to keep a hash table there. You have a much better chance of making this work with pawn hashing...

syzygy · Post by **syzygy** » Wed Jun 08, 2016 2:30 am

bob wrote:It is unlikely you can fit any sized TT in L1. I've not seen one processor with more than 64K of L1 cache. With minimal 8 byte entries, that is just 8K entries and you have a LOT of other data that has to be accessed as well.

Exactly.

And the L1 is anyway way way way too fast for something that is probed only once per node. It is much better to have instructions and the board representation in L1. Of course that is what happens anyway simply because of how CPU caches work.

Suppose the programmer could control what is in L1 (it wouldn't be a cache anymore but a scratchpad memory). The TT (i.e. an extremely tiny part of it) should be the last thing the programmer should think of placing there. It's like reserving some CPU registers to hold the TT.

So that seems to be impossible. L2? IBM's power 8 has 512K of L2, which is about the biggest I recall. That is only 64K 8 byte ttable entries assuming you don't access any other data. Very small still. Not sure it is worth discussing trying to keep a hash table there. You have a much better chance of making this work with pawn hashing...

Yep, and the beautiful thing is that the processor will figure this out all by itself.

Of course low-level knowledge of how caches work is still important for stuff like alignment, false sharing, etc. Also for memory access patterns, but I don't think that is very relevant for chess engines. (Well... if you are going to have two TTs that both have to be accessed, then at least make sure that you interleave them so as to have corresponding entries in the same cache line.)

Zenmastur · Post by **Zenmastur** » Wed Jun 08, 2016 5:30 am

bob wrote:
Zenmastur wrote:
bob wrote:I tried several options. but the basic one-level hash table has worked best for me. Of course, I don't hash in the q-search, which significantly reduces the total number of probes...

And then there are large and huge pages if you are concerned about the TLB..
So what I get from your comments are:

1. Storing the Q-search info in the TT is probably a bad idea.
2. You have tried split TT's.
3. There is no time advantage to be had by having an L1 TT fit inside the CPU's L1 or L2 cache thereby avoiding most memory accesses, most TLB misses, and the TLB thrashing that always occur even when using a modest size TT (i.e. 256MB or 512MB).
4. Generating 2 to 3 orders of magnitude less NUMA traffic would have no effect on the scaling of chess engines on NUMA machines.

Are these correct?

Regards,

Forrest
I can't answer your last question as the NUMA boxes I have access to are only 2 or 4 nodes total. On those, I see no problem. What happens on a 64 node or beyond machine is unknown however.

Using two TTs seems to always have a harmful effect.

(1) if you probe both, you will actually have more overhead when the first (small) fails.

(2) if you ONLY probe the local table, you miss a LOT of possible transpositions and that causes the tree to expand.

A 4 node NUMA machine (assuming the NUMA radius is 2) should see a sizable drop in NUMA traffic since the last N-plies of the search (N being 4 in the examples I gave) won't produce any NUMA traffic from TT probes.

In the last N-plies only probes to the local L1 TT is allowed. So no double probes. No excess overhead.

While I expect the tree to expand somewhat from some missed transpositions, I expect most transpositions to be found. I haven't given many details due to lack of time. I don't have the time to give all details but some of the more pertinent ones follow.

The L1 TT should have a high number of entries per bucket. Even if this makes the bucket require 2 cache-lines. Since these will be in the CPU's L1 or L2 cache accessing two them will still be an order of magnitude or more faster than a main memory access. If you don't like this arrangement, an alternative is to have the shallowest ply occupy a different area of the TT since this avoids buckets larger than a single cache-line. The point is to reserve extra space in each bucket for the shallowest entries. These are the entries that before they are overwritten will be transferred to the L2 TT. These are also the entries that will catch many of the transpositions that are started higher in the tree but have not yet transposed. (i.e. more moves are required to complete the transposition). By allowing the TT more entries than is required for an N-ply search and allowing the shallowest plies more space than they would normally occupy, they will be preserved longer in the L1 TT and will, when they are about to be replaced, be entered in the L2 TT so that on the next deeper search they will be available in the L2 TT.

So, by allowing a larger L1 TT than necessary to hold an N-ply search and reserving extra space for the shallowest entries, which are, in fact, the fewest in number, I believe that most transposition will be caught. For those that are not caught, they may expand the tree, but this part of the tree will be entirely contained in the L1 TT. So few main memory access or TLB misses or NUMA delays before it completes. The limiting factor on the speed up to this part of the search beside the difference between the speed of main memory and cache is the speed of the Q-search and the evaluation function. These are the slowest part of the last N-plies of a search so they will dominate the search times. The faster these are, the more the search will benefit from the an unloaded main memory bus, a much smaller TLB miss fraction and quicker access to NUMA data when needed.

bob wrote:TLB misses will always be a problem. However, that being said, there is little you can do to get rid of them, other than reducing the size of the table. My experiences suggest that it is more important to make the table as useful as possible, rather than worrying about what it costs to access...

TLB misses will be much less of a problem if main memory access is decreased by orders of magnitude, and more importantly if the TLB entries that are needed haven't been completely overwritten because more than 1,000 nodes have been processed since they were used last. Which is what happens when using current TT's.

bob wrote:It is unlikely you can fit any sized TT in L1. I've not seen one processor with more than 64K of L1 cache. With minimal 8 byte entries, that is just 8K entries and you have a LOT of other data that has to be accessed as well. So that seems to be impossible. L2? IBM's power 8 has 512K of L2, which is about the biggest I recall. That is only 64K 8 byte ttable entries assuming you don't access any other data. Very small still. Not sure it is worth discussing trying to keep a hash table there. You have a much better chance of making this work with pawn hashing...

Well, if you look at the number of nodes processed by Stockfish in an N-ply search, were N is small, (see the table in my first post on split TT's), then you can see that short searches process a relatively small number of nodes. A 4-ply searches, even from massively complicated positions, DO NOT generate huge trees. The last column is for a very complicate middle game position. A 4-ply search from that position only visited 2,173 nodes.

Even if the bucket size is 1, which is the worst size possible, and every probe into the TT is unique data (i.e. worst case scenario) and will therefor replace any existing data in the TT, 2,048 writes into a TT with 2,048 entries will only overwrite ~37% of the data. If the bucket size is increased to 8 the amount of data overwritten drops precipitously. If only 90% of the probes contain previously unseen position, (this represents a low TT hit rate) guess what? The amount of data overwritten become almost non-existent. It's a very small percentage of all data written. I conclude from this that small searches will fit into a small TT. You can, of course, easily test these statements if you like. In any case, with an 8-byte entry, a TT with 2,048 entries will fit in the L1 data cache. What percentage will be in the L1 data cache at any particular time depends on how much data not already in the L1 data cache is used between L1 TT probes.

As far as the data in the L1 TT being pushed out of the CPU caches, I will point out that two arrays of equal size in all respect, one of which is accessed with a spatial or temporal access pattern (or both), and the other is accessed equally often in a pseudo-random pattern, the one accessed in the random pattern will dominate the contents of a cache if the arrays are large enough to do so. So my question would be what other large data structures are accessed significantly more often than an L1 TT, and DO NOT have either a temporal or spacial access pattern. I can't think of one. Can you? If so what is it?

So, I think you are wrong about being able to store a useful size TT inside the CPU's L1 data and L2 caches. Storing these in the CPU's L3 cache is undesirable because an L3 miss incurs at least the minimum memory latency and could incur many times that if a TLB miss occurs. So, I suspect that a 64KB TT would be small enough that it would reside principally in the L2 cache with maybe 5% to 10% in the L1 data cache. Some will undoubtedly be pushed into the L3 cache and a small fraction of lines may even be evicted from the L3 requiring access to main memory. I don't expect this to happen often enough to cause a huge slow down though.

syzygy wrote:
bob wrote:It is unlikely you can fit any sized TT in L1. I've not seen one processor with more than 64K of L1 cache. With minimal 8 byte entries, that is just 8K entries and you have a LOT of other data that has to be accessed as well.
Exactly.

And the L1 is anyway way way way too fast for something that is probed only once per node. It is much better to have instructions and the board representation in L1. Of course that is what happens anyway simply because of how CPU caches work.

On Intel desktop and server CPU's the L1 has two parts. The L1i which hold decoded instructions and the L1d which holds only data. Both are 32Kb in size so, instructions will not be part of the equation when considering the L1 data cache. I wasn't aware that the board representation requires 32KB of data.

I think there is more than enough room for the board to be represented in the L1 data cache without dominating other data stored there. The other data, like magics may be larger than a small L1 TT like I'm advocating, but they're not used often enough and there access pattern isn't random. The Zorbist array isn't large enough and doesn't have a random access pattern. So it' seems an unlikely candidate to dominate any cache even if every key is constructed from scratch. How much data is used in the Q-search?

Which leaves the evaluation function. How much data is accessed on each entry to this function and how much of it is already in one of the caches. Have I forgot or am I missing something here?

syzygy wrote:...(Well... if you are going to have two TTs that both have to be accessed, then at least make sure that you interleave them so as to have corresponding entries in the same cache line.)

I'm not. So this won't be an issue.

Regards

Forrest

Zenmastur · Post by **Zenmastur** » Wed Jun 08, 2016 7:57 am

The main or L2 TT is accessed as normal for most of the depth of a search. The only time the L1 TT is accessed instead of the L2 TT is in the last N-plies of the search. So, at MOST plies the L2 or main TT is the only one being accessed. Since it IS "normal" size it is used to resolve all transpositions except those in the last few plies of the search. Most transpositions will occur and be identified by this much larger TT. The L2 TT should actually be better at identifying transpositions than a "normal" TT of the same size since much fewer of it's entries will be overwritten by the virtually worthless (for identifying important transpositions) nodes at the end the search. In fact, this is the whole point to doing it this way. The most important transpositions occur higher in the search tree and using a split TT will stop/reduce the rate these important entries are overwritten by the worthless nodes found at the end of the search.

There is only one way a double probe can occur. It occurs when an entry that contains the shallowest possible ply in the L1 TT is going to be overwritten. In this ONE case, which does not happen often compared to other accesses, the L2 TT will be probed (to bring that entry it into the cache) and then immediately updated with the data that is to be overwritten in the L1 TT. The more plies directed to the L1 TT, the less often a "double" probe will occur.

I hope this reduces the apparent confusion about which TT is used at any point in the search and why I think tree expansion will not be a problem.

Regards,

Forrest

bob · Post by **bob** » Wed Jun 08, 2016 9:29 am

Zenmastur wrote:The main or L2 TT is accessed as normal for most of the depth of a search. The only time the L1 TT is accessed instead of the L2 TT is in the last N-plies of the search. So, at MOST plies the L2 or main TT is the only one being accessed. Since it IS "normal" size it is used to resolve all transpositions except those in the last few plies of the search. Most transpositions will occur and be identified by this much larger TT. The L2 TT should actually be better at identifying transpositions than a "normal" TT of the same size since much fewer of it's entries will be overwritten by the virtually worthless (for identifying important transpositions) nodes at the end the search. In fact, this is the whole point to doing it this way. The most important transpositions occur higher in the search tree and using a split TT will stop/reduce the rate these important entries are overwritten by the worthless nodes found at the end of the search.

There is only one way a double probe can occur. It occurs when an entry that contains the shallowest possible ply in the L1 TT is going to be overwritten. In this ONE case, which does not happen often compared to other accesses, the L2 TT will be probed (to bring that entry it into the cache) and then immediately updated with the data that is to be overwritten in the L1 TT. The more plies directed to the L1 TT, the less often a "double" probe will occur.

I hope this reduces the apparent confusion about which TT is used at any point in the search and why I think tree expansion will not be a problem.

Regards,

Forrest

That is what most refer to as a "thread local TT approach". It really doesn't work. If you only probe the small TT, you get very few hits, as it contains no entries from the early part of the search. This means that the last few plies of the search will miss out on the cutoffs, the exact scores, and more importantly, the best move to search first...

As I said, the idea has been tested with pawn hashing also, and there it does work a bit better since in many branches of the tree out in the q-search, you don't move a pawn so you get the pawn scoring information via frequent hits. But again, for my program at least, a single shared pawn hash table is still far more effective and provides the best performance.

I think that worrying about TLB problems is better handled via large and huge pages (IE on my 20 core box, 2mb pages are used completely automatically and transparently with no program changes. using 1gb pages requires a bit of programming effort via mmap(). But with such pages, you certainly aren't going to stress the TLB very much with so few pages.

As far as your "much better due to not overwriting" that depends on whether you try to be careful with what you overwrite. But separating transpositions within the first M plies and within the last N plies simply doesn't seem to be the right idea. And it is absolutely going to increase tree size. The only possible compensation would be in terms of less TLB pressure since that late-ply TT would be smaller. But it will never sit in L1. And probably not in L2 for most programs.

Another statement doesn't quite mesh with the tree search. Suppose you search to depth=30 and probe the small TT at the last 5 plies. That is by far the biggest part of the tree and that will be the majority of the total nodes searched / probed. The q-search alone can be 90% of the total nodes searched.

Finally, at least in my code, EVERY node that is searched causes a probe, AND a store when that node is completed. No matter what the outcome, it is stored.

Based on the exponential growth of the tree, I don't see how this is going to be of any benefit, OTHER than to reduce TLB pressure. And I don't see how TLB misses is going to cost anywhere near as much as the discontinuity between the two hash tables and the information that is normally passed throughout the tree, but now does not cross over that boundary very well.

I do not follow your double-probe explanation. Not sure why you would update both TTs, that seems to completely lose the benefit.

Zenmastur · Post by **Zenmastur** » Wed Jun 08, 2016 11:48 pm

bob wrote:
Zenmastur wrote:The main or L2 TT is accessed as normal for most of the depth of a search. The only time the L1 TT is accessed instead of the L2 TT is in the last N-plies of the search. So, at MOST plies the L2 or main TT is the only one being accessed. Since it IS "normal" size it is used to resolve all transpositions except those in the last few plies of the search. Most transpositions will occur and be identified by this much larger TT. The L2 TT should actually be better at identifying transpositions than a "normal" TT of the same size since much fewer of it's entries will be overwritten by the virtually worthless (for identifying important transpositions) nodes at the end the search. In fact, this is the whole point to doing it this way. The most important transpositions occur higher in the search tree and using a split TT will stop/reduce the rate these important entries are overwritten by the worthless nodes found at the end of the search.

There is only one way a double probe can occur. It occurs when an entry that contains the shallowest possible ply in the L1 TT is going to be overwritten. In this ONE case, which does not happen often compared to other accesses, the L2 TT will be probed (to bring that entry it into the cache) and then immediately updated with the data that is to be overwritten in the L1 TT. The more plies directed to the L1 TT, the less often a "double" probe will occur.

I hope this reduces the apparent confusion about which TT is used at any point in the search and why I think tree expansion will not be a problem.

Regards,

Forrest
That is what most refer to as a "thread local TT approach". It really doesn't work. If you only probe the small TT, you get very few hits, as it contains no entries from the early part of the search.

You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT. There is no way for a node from this part of the search to be written to the L2 or main TT with the sole exception of when an entry containing the shallowest ply in the L1 TT is going to be over written and even this could be deferred until after the search is done if you don't mind a little more complexity. The reason it's easy to defer these until later is that there number is at most equal to the number of legal moves in the position for which the N-ply search is being conducted. Since the L1 TT will be considerably larger than this number they could be flagged as do not overwrite. There are other ways to accomplish the same thing of course. After the search is over the entries flagged as "do not overwrite" can be written to the L2 TT.

The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made. This narrows down the possibilities quite a bit. The only way that I can think of this happening is with a repetition of positions. Maybe there is some special case where this could happen without it being a repetition, but I can't think what it would be. But if anyone has an example position I would like to see it.

bob wrote:This means that the last few plies of the search will miss out on the cutoffs, the exact scores, and more importantly, the best move to search first...

I guess I'm going to need an example because I don't see any of that happening.

bob wrote:As I said, the idea has been tested with pawn hashing also, and there it does work a bit better since in many branches of the tree out in the q-search, you don't move a pawn so you get the pawn scoring information via frequent hits. But again, for my program at least, a single shared pawn hash table is still far more effective and provides the best performance.

That's kind of funny. I would never bother doing anything that complicated to a pawn hash because of the massive hit rate. It doesn't seem like it would be worth the effort.

bob wrote:I think that worrying about TLB problems is better handled via large and huge pages (IE on my 20 core box, 2mb pages are used completely automatically and transparently with no program changes. using 1gb pages requires a bit of programming effort via mmap(). But with such pages, you certainly aren't going to stress the TLB very much with so few pages.

The main problem I'm trying to solve has nothing to do with TLB misses. It has to do with the entries in the TT being overwritten because, by definition, we are dealing with a heavily overloaded TT. I have my own ideas and definition for what a heavily overloaded TT means which is probably not the same as yours. So to keep it simple, I'll just say that a very large fraction of the TT is overwritten before the next iteration begins. This is what I'm trying to avoid.

As far as 2mb pages are concerned, they are problematic for automated testing or anytime memory becomes fragmented. If you have to schedule a shut down a week in advance this is not really an option. My systems get shut down maybe once a month for updates. The rest of the time they're busy and shutting them down would usually lose several days of work. Huge pages are a non-starter for me.

Even though TLB misses are not the main problem, they are an issue. Having to wait 500 to 1,500 CPU cycles for a TLB miss to be resolved on every TT probe is a limiting factor.

bob wrote:As far as your "much better due to not overwriting" that depends on whether you try to be careful with what you overwrite. But separating transpositions within the first M plies and within the last N plies simply doesn't seem to be the right idea. And it is absolutely going to increase tree size. The only possible compensation would be in terms of less TLB pressure since that late-ply TT would be smaller. But it will never sit in L1. And probably not in L2 for most programs.

Well, I've tried all of the top programs and every one of them has the problem described by the OP. This includes all version of crafty that I've tried. As I already mentioned we are dealing with a heavily overloaded TT by definition. So “being careful” with the replacement policy doesn't seem to be a solution. If it is, no one has yet discover the proper replacement policy. Which is no great surprise since with current TT's there is no direct way to measure the potential savings contained in the TT entries. I believe the primary reason for this is that insufficient information is stored in the individual entries to make this possible. This makes it very time consuming to test different replacement policies.

bob wrote:Another statement doesn't quite mesh with the tree search. Suppose you search to depth=30 and probe the small TT at the last 5 plies. That is by far the biggest part of the tree and that will be the majority of the total nodes searched / probed...

I agree completely. That is precisely why I chose to split the TT at that point. I wanted most of the nodes to be handled by the L1 TT and therefor not inundate the L2 with a bunch of nodes that are virtually worthless for finding important transpositions.

bob wrote:The q-search alone can be 90% of the total nodes searched.

My very first chess program had a separate cache for the Q-search. It performed very well even with the scant resources I could afford to give it at the time.

bob wrote:Finally, at least in my code, EVERY node that is searched causes a probe, AND a store when that node is completed. No matter what the outcome, it is stored.

Is there some other way to do it?

There doesn't seem to be much point to having a TT unless you are going to probe it, and it certainly won't do you much good to probe it if there is nothing stored in it!

bob wrote:Based on the exponential growth of the tree, I don't see how this is going to be of any benefit, OTHER than to reduce TLB pressure. And I don't see how TLB misses is going to cost anywhere near as much as the discontinuity between the two hash tables and the information that is normally passed throughout the tree, but now does not cross over that boundary very well.

Well, as I explained earlier, I don't expect to have much discontinuity between the two TT's. This is one of the points to transferring info from the L1 TT to the SHARED L2 TT. Even so, there will be some since the L1 TT is local but if, for instance, the last 6 to 8 plys of a search are handled by a single thread the number of transposition that cross threads, (i.e. the information needed to find the transposition isn't in it's own local L1 TT but IS found in another threads local L1 TT) isn't likely to be a high percentage of all transpositions. These transpositions will be missed. Since we are overwriting the L2 TT more than two orders of magnitude less often, it's much more likely to catch transpositions that would otherwise be missed. Remember that if this TT were "normal" it would be highly overloaded by definition, so there is a good chance the needed entry has already been overwritten. Since the transpositions found in the L2 TT are at least 1 ply higher in the tree, and could be much higher than this, they are much more valuable in regards to reducing search effort. If the EBF of the engine is 2, then finding a transposition N-plies higher in the tree is 2^N times as effective at reducing search effort. Since we are dealing with a "normal" TT that is heavily overloaded and a split TT that is ~2 orders of magnitude less overloaded, the search tree is very likely to contract.

bob wrote:I do not follow your double-probe explanation. Not sure why you would update both TTs, that seems to completely lose the benefit.

Well, I originally didn't intend to do this. I talked my self into it the other day because I thought there would be a one ply gap between what was in the L1 and L2 TT's when the next deeper iteration came around. Of course the entire contents wouldn't need to be written to the L2 TT just the first ply or a fraction there of. If these are to be written on the fly then the actual number would be about square root of the number of legal moves in the position being searched. (say 6 or 7). If the writes are to be deferred until the N-ply search is complete a little more consideration about which positions are transferred to the L2 TT can be given. (e.g. if the number of nodes in the sub-search is unusually small it may be skipped. If a move generated an unusually high node count then it could be included even if it's not a particularly good move.) I guess, in the end it's at your discretion which if any nodes are transferred to the L2 TT. I think that transferring some of these nodes would be beneficial and not transferring ANY of them would hurt performance. So yes, you would give up some of the gains made. If this is deemed unacceptable then you could simply increase the number of plies directed to the L1 TT by one ply. This would allow you to transfer the entire first ply of the L1 TT while still maintaining the same advantage of a one ply small search in which nothing is transferred between L1 an L2 TT's.

I'm trying to make these idea's as clear as possible, but sometimes my wording seems to hurt more than it helps. I guess, I need a little more practice.

Regards,

Zen

bob · Post by **bob** » Thu Jun 09, 2016 12:09 am

2mb pages don't require any shutdowns in the world of Linux. Nor does 1gb pages. Only problem is, currently, 1gb pages have to be established at boot time and they are forever 1gb pages. But no fragmentation issues that require restarts, still.

I am not convinced that a special table for just the last few plies is worthwhile. I can transpose near a tip, with a position near the root, just as easily as I can transpose between branches near the tip. If you have a heavily oversubscribed tt, I am also not sure that this is the answer, either. Seems to me that the best approach is always a good overwrite decision more than something simpler such as a separate tt.

My response was primarily related directly to the question(s) asked of me, starting with NUMA. Which is a completely different issue from what gets overwritten or not...

syzygy · Post by **syzygy** » Thu Jun 09, 2016 12:11 am

Zenmastur wrote:You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT.

What do you mean by "last part of the search" ?

Do you mean the last N ply of the search? That's an enormous amount of nodes. The L1 would cover only a uselessly small fraction of it. You'll have practically no hits.

Do you mean the last M visited nodes with M small enough for them to fit in the L1? That's a uselessly small number of nodes. You'll have practically no hits.

The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made.

The search tree is a tree, not a single branch.

hgm · Post by **hgm** » Thu Jun 09, 2016 7:33 am

Something accessed as infrequently as a hash-table entry would never survive in L1 anyway. There is so much data in a chess engine that is accessed more frequently than once per node, and it would just flush it out of L1. Even PST and Zobrist key tables will have multiple accesses per node (from- and to-square, victim) focused in a smaller area.

Suggestion for hash tables and analysis.

Re: Bob Hyatt

Re: Bob Hyatt

Re: Bob Hyatt

Re: Bob Hyatt

Ok, Maybe I didn't make it clear in previous posts...

Re: Ok, Maybe I didn't make it clear in previous posts...

Re: Ok, Maybe I didn't make it clear in previous posts...

Re: Ok, Maybe I didn't make it clear in previous posts...

Re: Ok, Maybe I didn't make it clear in previous posts...

Re: Ok, Maybe I didn't make it clear in previous posts...