Suggestion for hash tables and analysis.

Zenmastur · Post by **Zenmastur** » Sun May 29, 2016 8:00 am

hgm wrote:Well, it is pretty stupid to use that for testing modifications in the hashing scheme, as you know that the hash-table size only weakly affects speed, and speed only weekly affects Elo.

I know that current TTs have had only the crudest attempts at optimizations. They use the crudest implementations possible and most have even cruder replacement algorithms. They are consequently resource hogs compared to the job(s) they perform.

An example is the common attitudes displayed about TT in the MANY DISCUSSIONS of TT size vs nps. This has been a problem for a while. And yet no one seems to know how to fix it or if they do they are doing a good job of keeping it a secret. Most people seem to just want to wait until the hardware release and hope that fixes the problem. Some weak attempts at trying to fix this problem revolve around increasing the page size to limit the number of TLB misses. This is a solution thought up by someone too lazy to actually think about the problem. The other solution that seems popular is to limit the TT size. This is simply an admission that the programmers either don't really care what the end users need/want or are too inept or too lazy to find a solution. i.e. it's easier for them to ignore the problem.

hgm wrote:A doubling of the Hash size typically produces only 6% speedup (the 12th root of 2), which corresponds to 6 Elo. A 10% more efficient use of the hash table would only cause 0.8 Elo. It would take >100,000 games to see that, while the speedup could already be measured accurately from a few hundred test positions (the equivalent of 10 games or so).

I've seen you post these same statements numerous times. I think these statements are misleading since we are talking about, to quote you, “when you really do a single search that hugely overloads the hash table (e.g. because you run it for days)”. There are probably true in many circumstance involving short time controls, with small TT's, and “average” positions. Under longer time control, doubling or quadrupling your TT size can be the difference between spending a day on a position and spending two or more weeks on it and still not finding the “right” move. It's not all about speed or ELO. In many cases it's about storing information in the TT long enough that you can move around in the line of play while analyzing and still have that information available multiple hours later when it's needed. None of the current TT implementations allow this to happen with out a HUGE TT. These TT's would need to be so large in fact, that most people can't afford a computer that is capable of handling that much memory much less affording the memory itself.

hgm wrote:Volkswagen also took a lot of risk when they had their cars tested for emission of pollutants, because when they would score badly on that count they would lose a lot of business on people that wanted to buy environmentally friendly cars and governments that offered subsidies for those. So they had the engine run in a different mode during tests as it would run in for the purpose of doing what people actually bought it for: driving on roads from one place to another.

So now they are widely known as frauds and cheaters, who swindled people out of their money with false promises, because that would have better 'economic consequences' for them.

What you propose is that the Komodo developers should likewise defraud their customers, by using methods to beef up their test performance by methods they know would fail badly in actual use. I would not recommend that method of doing business.

I don't know who taught you how to read but apparently reading comprehension isn't one of your strong points if you think what I wrote equates to the bold part of your statements. In short, I'm not impressed with your delusional analogy. My personal opinion is Larry and Mark pay closer attention to what their customers want than most other developers. I don't know where you get off trying to put words into my mouth or twist the meaning of what I wrote, but I assure you that will no be an easy feat.

Zenmastur wrote: You get Larry Kaufman to post in this thread that he could care less how Komodo does in TCEC then I might take this point seriously. The fact is winning tournaments sells Komodo better than any advertisement ever could. So, I don't think your going to convince me or many other people that “no cares” about tournament reults.

hgm wrote:I was talking about users....

So was I, since those are the people that buy Komodo and therefore support Larry & crew. Therefore Larry cares about the TCEC performance of Komodo because he knows that the more Komodo wins, the more likely previously hesitant customers are to buy Komodo. Now, that wasn't so hard. Was it?

hgm wrote: Of course a supplier often has interests opposite to customers...

Yeah...OK. I think every adult has an understanding of how the market place works. Thanks anyway.

hgm wrote:And what I propose is a simple, safe, fundamentally correct, and robust method to solve the analysis problem and improve tournament performance at the same time.

You make it sound SO PERFECT! You must have recently taken a propaganda class. These are sometimes referred to as Marketing classes at universities.

You and I obviously have a different idea about what the phrase “fundamentally correct” denotes.
So, explain why you believe a linear distribution of of entries by depth is the “fundamentally correct” way to represent a distribution the is exponential with depth. LOL

hgm wrote:As I said, there is little to gain from improving hash performance, so for me this doesn't have a very high priority even for my own engines...

I don't have the same beliefs as you and I don't recall asking you to make changes to any engine, yours included.

hgm wrote:What you believe or not is of course entirely your business...

I agree. It is.

hgm wrote: You could believe that the Earth is flat, for all I care...

You can believe you are omniscient for all I care, just don't expect me to believe it!

I think I've heard enough of your condescending diatribe. Thanks, but no thanks.

Regards,

Zen

Zenmastur · Post by **Zenmastur** » Mon May 30, 2016 10:41 am

jwes wrote:Many of us have seen this problem when analyzing. You analyze down one variation, back out, analyze down a second variation, back out again and the hash entries for the first variation are overwritten.
A simple way to solve this problem is: if in analysis mode, when deciding which hash entries to be overwritten, treat all entries with exact values as being current. This should fix the problem in virtually all cases. Another idea is to never overwrite exact entries with depths greater than some number in the range of, say, 15-30.

In the last day or so I have done a little work on this problem. I think I may have found a “real” solution.

The are several contributing factors to this problem.
1. Better engines can produce huge numbers of nodes in relatively short periods of time.
2. Most computers have insufficient memory to store these for more than a few minutes worth of computations. e.g. at 10,000knps Stockfish needs about 100 MB/sec to store these in a TT. So 32GB will allow about 5 or 6 minutes worth of storage before serious overwriting begins. 100% overwritting takes much longer, perhaps 2 hours.
3. There are exponentially more nodes deeper in the tree, so the last few plies of a search will dominate the contents of the TT if left unchecked. The nodes from the last few plies are, in general, the least valuable to someone doing deep analysis. The reason for this is they are the ones most likely to be the result of an error in the move sequence due to the shallow searches used to produce them.

So the solutions is to not allow these nodes to dominate the TT. The best way to do this is not, as you might expect, to limit them in some way. This might work but to do it with maximum affect you need a large statistical base of information that is specific to each particular program. So, every single developer would have to generate the statistics for their own program and even then there in no guarantee that the results will be optimal.

It's much better to not allow them to enter the TT in the first place. This also means keeping the SEE data out of the TT. All of the deepest and LEAST reliable data still being generated, stored, and accessed like before but it needs to be placed in a separate area.

An example of how this might work. First the TT is split or replicated if you will. The main TT which is what most people will think about when you mention transposition tables, will remain the same in all respects except for what goes into it. The same data structures, same everything. We can call this the L2 TT for Layer 2.

Now we need a new structure to handle just nodes from the deepest plies of the search. We'll call this the L1 TT. So we'll need a little more information about this part of the tree and about hashes in general. I used a development version of SF to generate a few numbers so I could find a good split point and then size the L1 TT properly. So a few numbers:

Code: Select all

Ply       SQRT&#40;Perft nodes&#41;           SF Nodes       Complex SF Nodes
 1                  5                      21                141        
 2                 20                      58                318
 3                 94                     143                483
 4                444                     232               2173
 5               2206                    1715               3626
 6              10912                    3851               4432
 7              56532                    7468               9288
 8             291546                   12079              50709

The first column is the depth of search. The second column is the sqrt(perft(ply)) for the standard starting position. The third column is the number of nodes a recent dev version of Stockfish produced from the root position. And the last column is the number of nodes dev SF needed from a very complicated middle-game position that was not blocked and had all pieces still on the board. The second and third columns are for reference purposes and the last I used to gauge in a rough way the number of nodes that would be stored in a TT based on search depth.

If the last N plies from a search are not added to the L2 TT, but are instead placed in and probed for in the L1 TT described below, we can expect more probes than the number given in the last column given above. The reason is the table above was generated with a very large TT so no entries were overwritten so the tree size would be a minimum for that version of SF.

I sized the L1 TT such that a search of N plies will just fill an empty TT. And then I took the raw number of buckets require and rounded up to the next larger power of 2.

To store a four ply search, with more than 2173 stores will statistically just fill a TT of 343 entries. With 10 entries per 64 byte cache line this requires 35 buckets. The next higher power of 2 gives 64 buckets for a total of 640 entries and statistically could be filled with as little as 4,455 unique stores. This TT only requires 4K bytes of ram. More importantly it will fit entirely inside (with plenty of room to spare) the L1 data cache of many modern desktop /server CPU's. e.g. Sandybridge, Ivybridge, Haswell, Skylake, Broadwell.

Now when using a large TT, one that has no chance of fitting inside the cache of even the most powerful CPU's, the wait time from the request of the data until it arrives can be 250-400 cpu cycles OR MORE. The reason for this is almost every request requires going to main memory to be satisfied. After being created the L1 TT is accessed so often and is so small a major fraction of it will be found in the CPU's L1 data cache. The “normal” latency of the L1 cache is about 5 cycles. An L2 cache hit has a latency of 12 cycles and on an L3 hit has a latency of 36 cycles. The L1 and L2 cache latency's are more than an order of magnitude less than those of main memory and therefore a probe into a multi-GB TT. This should provide a substantial increase in the number of cycles available to other functions like evaluations and SEE that occur at the bottom of the search tree. This will be the case even if major fractions of the L1 TT are pushed out of the L1 cache to either the L2 or L3 caches.

Because the part of the tree being searched is quite small it's very likely it will only be searched by one thread even if many threads are being used, therefore the data contained in the L1 TT will have little value to other threads. So I see little reason to share the L1 TT across multiple threads. Instead each thread will be assigned it's own L1 TT.

I chose a 4-ply search because it seemed large enough to have a major impact on performance, had a very small resource footprint in each core, and seemed least likely to require major changes in the search routine. The effects will vary somewhat from engine to engine and from one type of CPU to the next.

I could have easily made the L1 TT capable of handling a 6 or 8 ply search. It would of course be larger and probably would reside mostly in the 256kb L2 cache. There may be advantage to this even if only the data from the last 4 plies are stored there. The advantage is that the data isn't over written nearly as fast and so enjoy longer persistence. The only way to know what is best is to test various configuration.

The effects on the main TT is dramatic as far as retention of useful data is concerned. In complicated middle game positions the L2 or main TT will be overwritten about 3 orders of magnitude less often. So the 32GB TT table given in the example at the top of this post at a search rate of 10Mnps would be able to search for 10,000 to 12,000 minutes (160 to 200 hours, 6-8 days) in a complex position before serious overwritten begins. That's with the last four plies of the search(s) directed to a small secondary TT called the L1 TT. With an 8GB TT, which seems more common on desktops and laptops, and a search speed of 10Mnps the search could run a day or two before serious overwriting begins.

If that last 6 or 8 plies were directed at a secondary TT the numbers become quite impressive even for small main TT's.

Anyway that's the basics of the idea with out the menutiae needed to actually make it work.

The other possible advantage is, of course, all the extra CPU cycles that will be available by not accessing main memory on every/most TT probes.

Regards,

Zen

Cardoso · Post by **Cardoso** » Sat Jun 04, 2016 8:03 pm

Very interesting and refreshing ideas on TTs Forrest, have you tried implementing it?
If so do you have some data showing the improvements?
I tried HGM idea of equi-distributed-draft replacement about a month or so in my checkers engine.
It got always worse results, the node count was considerably higher in all test positions I tried.

regards,
Alvaro

Zenmastur · Post by **Zenmastur** » Sat Jun 04, 2016 10:05 pm

Cardoso wrote:Very interesting and refreshing ideas on TTs Forrest, have you tried implementing it?
If so do you have some data showing the improvements?
I tried HGM idea of equi-distributed-draft replacement about a month or so in my checkers engine.
It got always worse results, the node count was considerably higher in all test positions I tried.

regards,
Alvaro

I haven't. I only came up with the idea a few days ago and I'm still working on various aspects of it. The reason that I bothered working on it at all was I had similar experiences to the OP and HGM's responces. That's why there isn't many details given. I've since solved a few problems that aren't mentioned in my original post and have a few more thoughts I'll share when time permits.

Checkers will be a little different since it has a smaller branching factor. I think the number of plies directed to the L1 TT will need to be increased significantly. This might not be a problem. If it is, I think there will be easy fixes. So, I think it's worth a try.

So, stay tuned, I'll post other thoughts soon.

Regards,

Forrest

hgm · Post by **hgm** » Sat Jun 04, 2016 11:08 pm

Zenmastur wrote: Under longer time control, doubling or quadrupling your TT size can be the difference between spending a day on a position and spending two or more weeks on it and still not finding the “right” move. It's not all about speed or ELO. In many cases it's about storing information in the TT long enough that you can move around in the line of play while analyzing and still have that information available multiple hours later when it's needed.

Isn't that simply because they all use aging during analysis? At current engine speeds and memory sizes long searches will always strongly overload hash tables. So no practically achievable hash size can protect you from loss of entries that you declared to be useless.

Note that my statement of course does not apply to any conceivable implementation of a transposition table. Just to a good one, which keeps valuable entries as well as it can. Overwriting deep entries for positions to which you want to return during analysis does not fit that description.

I don't know who taught you how to read but apparently reading comprehension isn't one of your strong points if you think what I wrote equates to the bold part of your statements. In short, I'm not impressed with your delusional analogy. My personal opinion is Larry and Mark pay closer attention to what their customers want than most other developers. I don't know where you get off trying to put words into my mouth or twist the meaning of what I wrote, but I assure you that will no be an easy feat.

My statement that people would not care the slightest about TCEC results if they knew they were in no way related to the performance of the same engine in analysis, and that making them believe they are when you know they are not is fraud still stands, however...

You and I obviously have a different idea about what the phrase “fundamentally correct” denotes.
So, explain why you believe a linear distribution of of entries by depth is the “fundamentally correct” way to represent a distribution the is exponential with depth. LOL

Supposing that with 'linear' you mean 'flat' (i.e. homogeneous) here, that seems quite elementary. The number of leave nodes of the search tree is multiplied by the miss fraction at any level. To catch all transpositions of the last N moves of one side you will need to store the unique leave positions of the entire 2N+1 ply sub-tree. The size needed for this grows exponentially with N. The number of transpositions you catch will thus only increase logarithmically with the number of entries used for that tree level. E.g. in a tree with branching factor 30 and all independent moves you would need 450 entries to catch the transposition of the moves in the last two all-nodes (the cut-node do not branch, and there usually is nothing to transpose there). So for a move sequence A-B-C-D you would then catch A-B-D-C. But it would take 9,000 entries to also catch A-C-D-B for sure, as you would have to wait for the second move to change. And 202,500 entries to catch B-C-D-A. (Note that all others of the 24 permutations should have already been caught at an earlier level.)

So assigning more entries N only decreases the 'transmission factor' to the next level as 1/log(N), in a way that does not really depend very much on the level. It determines just how many levels you can look back towards the root to catch transpositions with moves there. The total transmission factor of the tree is proportional to 1/PRODUCT(log(N_i)) where N_i is the number of hash entries used for level i. For the sum of N_i fixed (namely the total hash-table size) this is minimized for all N_i equal. (log(a+h)*log(a-h) for infinitesimal h equals (log(a)+h/a-0.5*(h/a)^2)*(log(a)-h/a-0.5*(h/a)^2) = log(a)^2 - (log(a)+1)*(h/a)^2, so letting one of the N_i grow at the expense of the other will always increase the transmission factor compared to where they were equal, as the factor in front of the h^2 term is negative.)

So in short, the time you have to wait at any level for yet another transposition to occur increases exponentially, as it depends on progressively more removed levels to change their move. The number of extra entries needed to catch an extra transposition at a certain level will always be many times larger than what you would need to make another level that does not catch as many transpositions yet catch up. E.g. if one level catches 3 transpositions (only the first requiring search, the others hash cutoffs), another level 4, for a total transmission of 1/3*1/4, it would require some 30 times fewer extra entries to get it to 1/4*1/4 (by letting that first level also get 4 traspositions) than to get it to 1/3*1/5 (by letting the second catch 5). While 1/4*1/4 = 1/16 is in fact even smaller than 1/3*1/5 = 1/15.

I don't have the same beliefs as you and I don't recall asking you to make changes to any engine, yours included.

A small refresher then: actually you said something like: "If it is so easy, why don't you do it?"

Zenmastur · Post by **Zenmastur** » Tue Jun 07, 2016 8:40 am

I didn't spend much time on this so it's probably full of spelling, grammatical, factual, and logic errors. It's also based on using late model Intel server and desktop CPU's so the numbers given will likely change for other models or brands of CPU's.

A little more food for thought:

One of the advantage of splitting the TT into 2 parts, one large and one small is the speed advantage gained by not accessing main memory on every TT probe. If 4 kb pages are used a 8-gigabyte TT will occupy 2 million pages. These won't fit into a 1k entry L2 TLB. Since the probes into the TT will occur in pseudo-random locations almost all probes will miss the CPU caches and will require an access to main memory. The page lookup will miss the L1 DTLB and L2 TLB 99.9% of the time, which will require a page table walk. On a system with a four level page directory this will likely require 4 additional accesses to main memory before the virtual address can be translated to a physical address. Each of these is likely to take an additional 250-400 cpu cycles. On CPU's prior to Skylake there is only one page table miss handler available. There can be more than one outstanding page table miss. This means that there is no guarantee a TLB miss will be handled in a timely fashion and therefore a TT probe could wait several thousand cpu cycles before the required information is made available. This may not happen all the time but it must have been happening too often because Intel add a second page miss handler to Skylake CPU's plus increased the L2 TLB to 1.5K entries. In any case, decreasing the thrashing of the L2 TLB can only have a positive effect on engine performance regardless of the CPU used.

I have no clue why it isn't displaying this image and I don't really care, if you want to see I guess you'll have to go to the URL, sorry but I don't have the time to figure it out.

This graphic shows what happens to memory access times in multi-threaded applications access and modify an array elements in random order. This effect varies with array or working set size. When the array size exceeds the size of the CPU's caches access times go through the roof. So much so that they easily exceed the raw memory access time. This also shows the effect of TLB thrashing as the working set sizes causes the number of working pages to exceed the number that can be stored in TLB's . This graphic wasn't the one I had planned to use since the test was specifically design to show access times in single CPU multi-threaded applications based on limited memory bus speeds. But, I couldn't find the graph I was looking for and the working set sizes covered in this graph are large enough to exceed the capacity of most TLB (i.e. 512MB when using 4K pages requires 128K pages). So this graphic isn't perfect for what I'm using it for, but it does show that random access and modification to large arrays can take much longer than memory access times alone can explain. This graph should give a rough idea of what happens to access times of TT's as their size is increased.

By splitting the TT into a small cache-able part and a larger part the rate at which main memory needs to be accessed can be controlled to some extent. Even more importantly it seems, the TLB thrashing that takes place when using large TT's will be greatly mitigated. This will have a substantial impact on node processing.

There are some other considerations as well.

Giving each thread it's own L1 TT which IS NOT shared by other threads has several implications.

The first is that the cache entries “owned” by a a particular thread will follow that thread if it moves from one core to another. (i.e. it's interrupted so another process can run on its core, (hyper-threading should be turned off) when it's rescheduled that core may still be busy, in which case it will be run on a different core.) This could present a performance problem if this happen a lot. It's probably not a huge problem for a single CPU system because the data that's in the cores local L1 Data and L2 caches is also stored in the CPU's L3 cache. Further this data can be forwarded. So if the thread is moved to a different core, and not too much time has passed, the data that was in its L1 TT will likely still be found in the CPU's L3 cache and so won't require going to main memory when it's access from the new core. On a NUMA system the new core may be located on a different CPU. If the data is still in the original CPU's cache it can still be access but will create a lot of QPI traffic which is not good. If it's not in the L3 cache of the previous CPU it will still be available through main memory but access will be at main memory speed or worse until this info is cached on CPU the thread is executing on. So, it would be better to have the threads pinned to a particular core if possible.

A related issue is cache line state changes. A cache line on an a Haswell CPU has five states. They are Modified, Exclusive, Shared, Invalid and Forwarding (MESIF). If a cache line is Exclusive this means that no other cache has a copy of this data. Which means when you want to modify the data it can be done with out any delays. If it's shared a RFO to invalidate all other copies must be issued first. This effectively makes the cache-line Exclusive. An RFO must be broadcast on the CPU's local bus and on a NUMA system across the QPI links to other CPU's. This takes time and slows down the process. On a NUMA system the wait times can be excessive and the extra traffic between CPU's will not be appreciated. These buses are usually a major bottleneck and reducing traffic on them is highly desirable. So exclusive use of the L1 TT's saves additional time and helps keep the QPI links as lightly loaded as possible. The more CPU's in a NUMA system the more important it is to eliminate this un-needed traffic.

One of the slowest cache operations is the state change to Invalid. Issuing RFO's by the tens of millions per second is bad because every other cache in the system has to invalidate the line in question. If that cache line is needed again by any of the cores that had to invalidate it, it will have to be reloaded. This may create more QPI traffic in a NUMA system if the requesting and forwarding cores aren't on the same CPU. So a single RFO will likely create more QPI traffic than just that created by an RFO. This will have an impact on all NUMA machines. The more CPU's the system has the higher the impact.

Possible L1 TT sizes:

There are several reasons for wanting a larger L1 TT. One is to hold more entries so the TT is overwritten less often. A second reason is so the layout of the individual entries can be made as close as possible to the entries in the L2 TT. This allows more code re-use and less effort implementing it, if you are modifying an existing engine. And the last reason is so more, more accurate, and more useful information can be stored in the entries.

A natural questions to ask is how large can the L1 portion of a split TT be before it becomes a problem.

My answer is, I don't know for sure. It depends on the CPU(s) in use and the other code and data that will be accessed. From looking at the graphic above, I believe that, at a minimum, 64kb will still preserve a large fraction of its access time advantage compared to a large L2 TT and the number could be much higher, say 256kb. For numbers larger than this the advantages would rapidly begin to wain. The example I gave in the previous post where the L1 TT was only 4kb in size is probably a bit on the anemic side.

More and larger entries are possible. 2,048 or 4,096 entries of 16-bytes each seems easily doable. This would allow the L1 TT entries to closely mimic the data stored in the L2 TT in most engines. The larger number of entries would allow some information to be retained from a large number of previous 4-ply (or possibly larger) searches.

On a slightly different tack and this infor can be used on any TT not just a split TT:

The main purpose of a “tradition” TT is to save effort. How much effort is save depends on the size of the TT and it replacement policies. Most of the TT implementations I've seen uses depth of search and age as the metrics used in the replacement policy. I'm sure these are NOT the best metrics that can be found for this purpose. Someone, long ago, programming for a computer with more limited resources than most desktops have today, spent a little time to develop a TT replacement policy. It worked, and everyone copied it. Things have change since then, resource are more plentiful, but few have spent any time determining if the old replacement policies can be improved.

An example might help. When playing a game and the opponents last move was a pawn capture or promotion the number of pawns on the board is decrease. It's not possible to ever have more pawns on the board than the number from a previous ply. This means that any TT entry for a position that has more pawns than the current board position is completely worthless. It will never save any effort even if it remains in the TT until the end of the game. The depth of search (effort) used to produce the entry are irrelevant. Likewise, its age is irrelevant. If this information was stored explicitly in the TT entry, it would be an easy matter to determine if any entry is still useful. If it's not then it should be replaced before any other entry in that bucket. This way entries that are still useful will not be displaced in favor of those that have no possibility of ever being useful. Will this save effort? No doubt it will. The idea is sound, or as HGM would say “fundamentally correct”, but that doesn't mean it's useful. Will it be worth the 4 or 6 bits used to encode this information in the TT entries? I have no idea. I haven't tried it. It probably won't make a huge difference. It may not even be noticeable. The only way to know for sure is to test it.

Another example. Depth of search is needed to determine if the entry can be used in the current search. It's also often used as a metric in the replacement policy. The problem with this use is that not all n-ply search represent the same amount of effort. One 8-ply search might have required only 1,00 nodes to be searched while another may have required over 50,000 nodes. Current TT's don't contain enough information to make a distinction between the two. An entry that required searching 10,000 nodes that has been sitting unused is less valuable than an entry that took 3,000 nodes to produce but has produced 10 cut-offs. In order to make these distinctions requires more information than most TT entries contain. Further more, a node that took 1,000,000 nodes to produce and has cause 1,000 cut-offs can be discarded immediately if it has too many pawns.

There is one more aspect to placing more information in each entry and it's likely that it's more important. There is currently no quick and easy way to determine if one cache replacement policy is better than another. The current method could require 200,000 test games to determine which is better. This is a lot of work. With the proper information stored in each entry it should be easy enough to calculate the total number of nodes the TT represents and get an idea of how much effort has already been saved and therefore how effect the cache is likely to be in the near future. The procedure is quite simple.

1. Run the engine on a set of test positions
2. After the node count reaches 30 times the number of TT entries stop and do a node count and count of cut-off*Node count of all valid TT entries.
3. Change the replacement policy
4. Re-run the test positions with the new replacement policy.
5. Compare the results.

Depending on the size of the test suite the test can be run many times in less time than a set of test games can be played.

Increasing the size of each entry will affect the number of entries that can be stored in a TT of a given size. But it's easy enough to test if this is a worth while trade once you have a much more direct way to measure how the TT is performing. This testing method can even be used to evaluate the effects of the split TT I've been advocating.

Regards,

Forrest

Zenmastur · Post by **Zenmastur** » Tue Jun 07, 2016 9:46 pm

On a NUMA machine the huge reduction in memory access caused by L1 TT's can cut the NUMA traffic by the same factor or greater. Greater in machines where the radius is greater than 1. This should increase the scaling factor of high core machines compared to using a traditional TT.

The other thing I for got to mention is the L1 TT can populate the L2 TT with entries without using huge amounts of memory band width. Its not hard to do if you give it a little though and plan ahead.

Regards,

Forrest

Zenmastur · Post by **Zenmastur** » Tue Jun 07, 2016 10:40 pm

I would also be interested in hearing what Bob Hyatt has to say on the subject, if he would be so kind.

Regards,

Forrest

bob · Post by **bob** » Tue Jun 07, 2016 11:10 pm

Zenmastur wrote:I would also be interested in hearing what Bob Hyatt has to say on the subject, if he would be so kind.

Regards,

Forrest

I tried several options. but the basic one-level hash table has worked best for me. Of course, I don't hash in the q-search, which significantly reduces the total number of probes...

And then there are large and huge pages if you are concerned about the TLB..

hgm · Post by **hgm** » Tue Jun 07, 2016 11:12 pm

I brought up the idea of a 'TT cache' fitting in a cache level close to the CPU before. But it is not so easy to find a function for it that does not simply duplicates what the cache-controller hardware is already doing. One application would be to store a depth there that would not go into the main table. E.g. if with a single table you would not store QS nodes, you could send these nodes to a small second table, and probe there from QS to.

Another technique to reduce TLB misses could be 'key focusing'. This is a technique that can easily backfire, though. The idea is to make the key part from which the page number is derived less random, so that 'similar' positions (i.e. from the same search tree) tend to map to the same page. E.g. if the page to which the index maps would only depend on the white pieces (all black Zobrist keys would have zeros there), positions where white null-moves would stay in the same page. This doesn't help in partd sof the tree where black nul-moves, however. But you could also make the page only dependent on the Pawn structure, so that you stay in the same page as long as there are no Pawn moves. Or in the same group of pages, small enough to only need a fraction of the TLB cache.

Problem is that when you focus too much, you will be using only a small fraction of the TT. You want to focus so much that a typical search tree will cover the entire TT, but that many sub-trees in it stay confined to a small sub-section of it, so that there is a clear working set much smaller than the entire TT.

Suggestion for hash tables and analysis.

Re: Suggestion for hash tables and analysis.

Re: Suggestion for hash tables and analysis.

Re: Suggestion for hash tables and analysis.

Re: Suggestion for hash tables and analysis.

Re: Suggestion for hash tables and analysis.

Re: Suggestion for hash tables and analysis.

Re: A couple of things I forgot to mention...

Re: Bob Hyatt

Re: Bob Hyatt

Re: Suggestion for hash tables and analysis.