hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.
If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.
Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).
bob wrote:
One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
The situation didn't change, though. Maybe the compiler did align the address properly already.
...
There is a problem in your code. When you add 63 and then and off the rightmost 6 bits, you are essentially rounding up. You will still use the ttablel_size number of bytes, which will now cause you to access memory just beyond the end of the newly-aligned ttable address. Up to 63 bytes beyond depending on how far off the initial address is (it is supposedly guaranteed to be on an 8 byte boundary according to POSIX).
You need to fudge the size by + 63 when you do the malloc().
I'm a bit confused. My code above allocates 64 bytes more than needed. I believe you took ttable->size for the number of bytes, but it is the number of entries, i.e. the power of two that fits the desired TT size in MB.
Given that information, my code should do that same as the below suggested:
It would be wise to print out the original address, to see if you should expect an effect. Of course the code to align the address is a no-op if the original allocation was already aligned. Having the code there makes the engine more robust, though: it is very possible that the same executable which gives aligned allocation on your machine does give unaligned allocation when copied to an drun on another machine! Because another DLL is installed there that contains the malloc.
If your eval is light-weight, it is very likely that nothing is wrong and this is just it. Hash probing is expensive. Micro-Max 1.6 (a simple alpha-beta without hash) does 6Mnps, and its big brother micro-Max 4.8 (which does have a hash table) only does 1Mnps on the same machine. The only way to save on hash probing is then to do it less frequently (e.g. like Crafty, which does not probe in QS).
bob wrote:
One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
The situation didn't change, though. Maybe the compiler did align the address properly already.
...
There is a problem in your code. When you add 63 and then and off the rightmost 6 bits, you are essentially rounding up. You will still use the ttablel_size number of bytes, which will now cause you to access memory just beyond the end of the newly-aligned ttable address. Up to 63 bytes beyond depending on how far off the initial address is (it is supposedly guaranteed to be on an 8 byte boundary according to POSIX).
You need to fudge the size by + 63 when you do the malloc().
I'm a bit confused. My code above allocates 64 bytes more than needed. I believe you took ttable->size for the number of bytes, but it is the number of entries, i.e. the power of two that fits the desired TT size in MB.
I misinterpreted the "size" idea, so disregard my comment.
Given that information, my code should do that same as the below suggested:
bob wrote:
One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
The situation didn't change, though. Maybe the compiler did align the address properly already.
It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.
With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.
Are you sure that sizeof(TTEntry) is a multiple of 64 or divisor of 64? If not, then aligning the base address of the table is not going to help at all.
E.g. if sizeof(TTEntry) were 16 bytes, then exactly 4 of them would fit in each 64-bit cache line. But if sizeof(TTEntry) were 20 bytes, then each cache line would hold 3.2 of them and every time you probe one, you'd have something like a 30% chance of touching two cache lines instead of one.
On the other hand, if sizeof(TTEntry) were 60 bytes, it would be even worse: 14 out of every 16 entries would cross a cache line boundary. Rounding it up to 64 bytes would fix the problem and only waste 6.25% of the memory. If it were 48 bytes then 2 out of every 4 entries would cross a cache line boundary. Etc.
bob wrote:
One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
The situation didn't change, though. Maybe the compiler did align the address properly already.
It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.
With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.
Are you sure that sizeof(TTEntry) is a multiple of 64 or divisor of 64? If not, then aligning the base address of the table is not going to help at all.
E.g. if sizeof(TTEntry) were 16 bytes, then exactly 4 of them would fit in each 64-bit cache line. But if sizeof(TTEntry) were 20 bytes, then each cache line would hold 3.2 of them and every time you probe one, you'd have something like a 30% chance of touching two cache lines instead of one.
On the other hand, if sizeof(TTEntry) were 60 bytes, it would be even worse: 14 out of every 16 entries would cross a cache line boundary. Rounding it up to 64 bytes would fix the problem and only waste 6.25% of the memory. If it were 48 bytes then 2 out of every 4 entries would cross a cache line boundary. Etc.
It looks like this, which should 4 x 16 = 64 bytes, unless the compiler messes it up.
hgm wrote:
If your eval is light-weight, it is very likely that nothing is wrong and this is just it. Hash probing is expensive. Micro-Max 1.6 (a simple alpha-beta without hash) does 6Mnps, and its big brother micro-Max 4.8 (which does have a hash table) only does 1Mnps on the same machine. The only way to save on hash probing is then to do it less frequently (e.g. like Crafty, which does not probe in QS).
That sounds like a plausible explanation. I might have some room for optimization inside TTableQuery(), but I will assume that I've got no general problem until I have a larger evaluation. That'll be the time to revisit this issue.
hgm wrote:
If your eval is light-weight, it is very likely that nothing is wrong and this is just it. Hash probing is expensive. Micro-Max 1.6 (a simple alpha-beta without hash) does 6Mnps, and its big brother micro-Max 4.8 (which does have a hash table) only does 1Mnps on the same machine. The only way to save on hash probing is then to do it less frequently (e.g. like Crafty, which does not probe in QS).
That sounds like a plausible explanation. I might have some room for optimization inside TTableQuery(), but I will assume that I've got no general problem until I have a larger evaluation. That'll be the time to revisit this issue.
You might also consider pre-fetching to L1, and doing some register intensive computations while "waiting" for the probe.
hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.
If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.
Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).