bob wrote:Zenmastur wrote:The main or L2 TT is accessed as normal for most of the depth of a search. The only time the L1 TT is accessed instead of the L2 TT is in the last N-plies of the search. So, at MOST plies the L2 or main TT is the only one being accessed. Since it IS "normal" size it is used to resolve all transpositions except those in the last few plies of the search. Most transpositions will occur and be identified by this much larger TT. The L2 TT should actually be better at identifying transpositions than a "normal" TT of the same size since much fewer of it's entries will be overwritten by the virtually worthless (for identifying important transpositions) nodes at the end the search. In fact, this is the whole point to doing it this way. The most important transpositions occur higher in the search tree and using a split TT will stop/reduce the rate these important entries are overwritten by the worthless nodes found at the end of the search.
There is only one way a double probe can occur. It occurs when an entry that contains the shallowest possible ply in the L1 TT is going to be overwritten. In this ONE case, which does not happen often compared to other accesses, the L2 TT will be probed (to bring that entry it into the cache) and then immediately updated with the data that is to be overwritten in the L1 TT. The more plies directed to the L1 TT, the less often a "double" probe will occur.
I hope this reduces the apparent confusion about which TT is used at any point in the search and why I think tree expansion will not be a problem.
Regards,
Forrest
That is what most refer to as a "thread local TT approach". It really doesn't work. If you only probe the small TT, you get very few hits, as it contains no entries from the early part of the search.
You only probe the L1 TT in the last part of the search because all the data from the last part of the search is written to this TT. There is no way for a node from this part of the search to be written to the L2 or main TT with the sole exception of when an entry containing the shallowest ply in the L1 TT is going to be over written and even this could be deferred until after the search is done if you don't mind a little more complexity. The reason it's easy to defer these until later is that there number is at most equal to the number of legal moves in the position for which the N-ply search is being conducted. Since the L1 TT will be considerably larger than this number they could be flagged as do not overwrite. There are other ways to accomplish the same thing of course. After the search is over the entries flagged as "do not overwrite" can be written to the L2 TT.
The only way a position from an earlier part of the search can match a position in the later part of the search, is if no pawn moves or captures have been made. This narrows down the possibilities quite a bit. The only way that I can think of this happening is with a repetition of positions. Maybe there is some special case where this could happen without it being a repetition, but I can't think what it would be. But if anyone has an example position I would like to see it.
bob wrote:This means that the last few plies of the search will miss out on the cutoffs, the exact scores, and more importantly, the best move to search first...
I guess I'm going to need an example because I don't see any of that happening.
bob wrote:As I said, the idea has been tested with pawn hashing also, and there it does work a bit better since in many branches of the tree out in the q-search, you don't move a pawn so you get the pawn scoring information via frequent hits. But again, for my program at least, a single shared pawn hash table is still far more effective and provides the best performance.
That's kind of funny. I would never bother doing anything that complicated to a pawn hash because of the massive hit rate. It doesn't seem like it would be worth the effort.
bob wrote:I think that worrying about TLB problems is better handled via large and huge pages (IE on my 20 core box, 2mb pages are used completely automatically and transparently with no program changes. using 1gb pages requires a bit of programming effort via mmap(). But with such pages, you certainly aren't going to stress the TLB very much with so few pages.
The main problem I'm trying to solve has nothing to do with TLB misses. It has to do with the entries in the TT being overwritten because, by definition, we are dealing with a heavily overloaded TT. I have my own ideas and definition for what a heavily overloaded TT means which is probably not the same as yours. So to keep it simple, I'll just say that a very large fraction of the TT is overwritten before the next iteration begins. This is what I'm trying to avoid.
As far as 2mb pages are concerned, they are problematic for automated testing or anytime memory becomes fragmented. If you have to schedule a shut down a week in advance this is not really an option. My systems get shut down maybe once a month for updates. The rest of the time they're busy and shutting them down would usually lose several days of work. Huge pages are a non-starter for me.
Even though TLB misses are not the main problem, they are an issue. Having to wait 500 to 1,500 CPU cycles for a TLB miss to be resolved on every TT probe is a limiting factor.
bob wrote:As far as your "much better due to not overwriting" that depends on whether you try to be careful with what you overwrite. But separating transpositions within the first M plies and within the last N plies simply doesn't seem to be the right idea. And it is absolutely going to increase tree size. The only possible compensation would be in terms of less TLB pressure since that late-ply TT would be smaller. But it will never sit in L1. And probably not in L2 for most programs.
Well, I've tried all of the top programs and every one of them has the problem described by the OP. This includes all version of crafty that I've tried. As I already mentioned we are dealing with a heavily overloaded TT by definition. So “being careful” with the replacement policy doesn't seem to be a solution. If it is, no one has yet discover the proper replacement policy. Which is no great surprise since with current TT's there is no direct way to measure the potential savings contained in the TT entries. I believe the primary reason for this is that insufficient information is stored in the individual entries to make this possible. This makes it very time consuming to test different replacement policies.
bob wrote:Another statement doesn't quite mesh with the tree search. Suppose you search to depth=30 and probe the small TT at the last 5 plies. That is by far the biggest part of the tree and that will be the majority of the total nodes searched / probed...
I agree completely. That is precisely why I chose to split the TT at that point. I wanted most of the nodes to be handled by the L1 TT and therefor not inundate the L2 with a bunch of nodes that are virtually worthless for finding important transpositions.
bob wrote:The q-search alone can be 90% of the total nodes searched.
My very first chess program had a separate cache for the Q-search. It performed very well even with the scant resources I could afford to give it at the time.
bob wrote:Finally, at least in my code, EVERY node that is searched causes a probe, AND a store when that node is completed. No matter what the outcome, it is stored.
Is there some other way to do it?
There doesn't seem to be much point to having a TT unless you are going to probe it, and it certainly won't do you much good to probe it if there is nothing stored in it!
bob wrote:Based on the exponential growth of the tree, I don't see how this is going to be of any benefit, OTHER than to reduce TLB pressure. And I don't see how TLB misses is going to cost anywhere near as much as the discontinuity between the two hash tables and the information that is normally passed throughout the tree, but now does not cross over that boundary very well.
Well, as I explained earlier, I don't expect to have much discontinuity between the two TT's. This is one of the points to transferring info from the L1 TT to the SHARED L2 TT. Even so, there will be some since the L1 TT is local but if, for instance, the last 6 to 8 plys of a search are handled by a single thread the number of transposition that cross threads, (i.e. the information needed to find the transposition isn't in it's own local L1 TT but IS found in another threads local L1 TT) isn't likely to be a high percentage of all transpositions. These transpositions will be missed. Since we are overwriting the L2 TT more than two orders of magnitude less often, it's much more likely to catch transpositions that would otherwise be missed. Remember that if this TT were "normal" it would be highly overloaded by definition, so there is a good chance the needed entry has already been overwritten. Since the transpositions found in the L2 TT are at least 1 ply higher in the tree, and could be much higher than this, they are much more valuable in regards to reducing search effort. If the EBF of the engine is 2, then finding a transposition N-plies higher in the tree is 2^N times as effective at reducing search effort. Since we are dealing with a "normal" TT that is heavily overloaded and a split TT that is ~2 orders of magnitude less overloaded, the search tree is very likely to contract.
bob wrote:I do not follow your double-probe explanation. Not sure why you would update both TTs, that seems to completely lose the benefit.
Well, I originally didn't intend to do this. I talked my self into it the other day because I thought there would be a one ply gap between what was in the L1 and L2 TT's when the next deeper iteration came around. Of course the entire contents wouldn't need to be written to the L2 TT just the first ply or a fraction there of. If these are to be written on the fly then the actual number would be about square root of the number of legal moves in the position being searched. (say 6 or 7). If the writes are to be deferred until the N-ply search is complete a little more consideration about which positions are transferred to the L2 TT can be given. (e.g. if the number of nodes in the sub-search is unusually small it may be skipped. If a move generated an unusually high node count then it could be included even if it's not a particularly good move.) I guess, in the end it's at your discretion which if any nodes are transferred to the L2 TT. I think that transferring some of these nodes would be beneficial and not transferring ANY of them would hurt performance. So yes, you would give up some of the gains made. If this is deemed unacceptable then you could simply increase the number of plies directed to the L1 TT by one ply. This would allow you to transfer the entire first ply of the L1 TT while still maintaining the same advantage of a one ply small search in which nothing is transferred between L1 an L2 TT's.
I'm trying to make these idea's as clear as possible, but sometimes my wording seems to hurt more than it helps. I guess, I need a little more practice.
Regards,
Zen
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.