Is a querying the hash tables such a huge bottleneck?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Is a querying the hash tables such a huge bottleneck?

Post by hgm »

Well, actually you said a bit more. For instance that you were not wrong, while you obviously were...

Apparently "long" is 32 bit on 64-bit Windows. Because "64-bit Windows" is of course MicroSoft Windows + MicroSoft Visual C++. So what you can get on Windows by using ported software like gcc, or running Linux in a virtual box, etc., is really not relevant at all. I can use cross compilers on a 64-bit windows for whicl "long" is 16 bits, if I want to...
User avatar
Kempelen
Posts: 620
Joined: Fri Feb 08, 2008 10:44 am
Location: Madrid - Spain

Re: Is a querying the hash tables such a huge bottleneck?

Post by Kempelen »

What about using cpu prefetch instruction when entering the node, some instructions before tt-probe call? does anybody test if this save time?
Fermin Serrano
Author of 'Rodin' engine
http://sites.google.com/site/clonfsp/
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Is a querying the hash tables such a huge bottleneck?

Post by hgm »

The problem with prefetch i always: what would you want to do between the prefetch and the actual probe that needs to be done anyway if the probe is a hit that causes a cutoff? In all nodes you could issue prefetches for the _next_ move you want to search, so that you can search the current move (of which the hash entry was already prefetched on the move before).

Problem is that normal memory reads have priority over prefetches, and the number of outstanding prefetches is very limited. So if memory bandwidth is heavily used, the prefetches might never be executed.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

hgm wrote:Well, actually you said a bit more. For instance that you were not wrong, while you obviously were...

Apparently "long" is 32 bit on 64-bit Windows. Because "64-bit Windows" is of course MicroSoft Windows + MicroSoft Visual C++. So what you can get on Windows by using ported software like gcc, or running Linux in a virtual box, etc., is really not relevant at all. I can use cross compilers on a 64-bit windows for whicl "long" is 16 bits, if I want to...
Again, who says "windows = windows + msvc"??? Does that mean "linux = linux +gcc only"??? If so, why, and who made the "rule".

I'm not using a cross-compiler version of gcc. Just a native compiler for X86_64. If there is _one_ example of a poor technical decision (long = 32 bits on a 64 bit architecture) that's ok. But it _is_ a poor decision. It would be just as poor as using int=16 bits on a 32 bit box.

But, for the record, again, I can find _no_ example of a 64 bit architecture + 64 bit operating system + 64 bit compiler, where long = 32 bits, with the sole qualification of not having tested any version of MSVC. If there is that one single exception, out of dozens of alternatives, that's fine. But _clearly_ long != 32 bits under _windows_. long might be equal to 32 bits under windows + MSVC. But I did not see any MSVC qualification in the original statement, and my quick test showed that on my vista-64 windows box at home, long = 64 bits as expected... And there are lots of folks here using various ports of gcc under windows, they pipe up regularly with questions.

Not sure what else I could add beyond the facts I gave...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

hgm wrote:The problem with prefetch i always: what would you want to do between the prefetch and the actual probe that needs to be done anyway if the probe is a hit that causes a cutoff? In all nodes you could issue prefetches for the _next_ move you want to search, so that you can search the current move (of which the hash entry was already prefetched on the move before).

Problem is that normal memory reads have priority over prefetches, and the number of outstanding prefetches is very limited. So if memory bandwidth is heavily used, the prefetches might never be executed.
I think there is more to it than that.

(1) when you prefetch, you must, by definition, dislodge something already in cache. And depending on your program, you might not need the prefetched data. For example, in crafty, I do a repetition check first, for obvious reasons. If that gets a hit, I don't need to do a hash probe. If I did the pre-fetch as early as possible (first instruction upon entering Search()) then I dislodge something that was already in cache, and it was there because it was used previously, and replace it with something that might not be needed at all.

(2) there is _always_ a savings when doing a pre-fetch, when you actually need the data. You might not reduce the latency to zero, but you do reduce it some, which is a win. But you also lose if you don't need the data as previously mentioned.

(3) the _real_ savings is harder to obtain, where you would like to drive the latency to zero. Memory is so slow, you need to do the prefetch early enough to ensure it is already in cache by the time you need it. For hashing, this has a significant potential cost. Random access means many TLB misses. Which translates into up to 4 extra memory references to translate virtual to real. So you need to move that prefetch not just back to the previous ply, but much farther back. And then you have to hope that (a) your pre-fetch doesn't dislodge something you need and (b) that the pre-fetched data survives in cache long enough so that you get a hit when you need it. We had a similar facility on the older Crays, and it was not that uncommon to do a pre-fetch, then access something else that dislodges the pre-fetched data due to a cache miss, followed by the actual reference causing another cache miss. In short, two _extra_ cache misses, so that rather than eliminating one, you add one.

For timing analysis, Crafty, on my 2.5ghz core-2 searches roughly 4M nodes per second (one cpu). Or about 632 cycles per node (250 ns per node). About 90% of Crafty's nodes are in the q-search where there is no hashing, and where lots of extra work is done (evaluate() mainly). In the normal search, a node takes more like 25ns. Memory latency ends up in the 100-300ns range depending on tlb misses, which becomes a bit painful because you have to do the pre-fetch far enough back in the tree that there is no way to know what needs to be pre-fetched, and you would greatly exceed the max number of prefetches pending in any case since this is a tree...

Making pre-fetch work in any real application has proven to be quite difficult, overall. You certainly can't trust the compiler to pull this off except for toy programs.
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: Is a querying the hash tables such a huge bottleneck?

Post by Carey »

bob wrote:
Carey wrote:
bob wrote:DId you _read_ my last comment? Windows vista 64 bits + 64 bit gcc gives the expected sizeof(long) = 8. So I _did_ test windows. I don't use the microsoft C++ compiler so I have no idea what they do. But anyone that uses sizeof(long)=anything_other_than_8 on a 64 bit architecture is certainly brain-dead. But MS has never let that stop them in the past...

BTW the crafty code you quoted has been there since 1995 or so. More recent versions use long on HAS_64BITS architectures with Linux. I've never changed the windows stuff since __int64 has been around since 1995 as well...
Bob,

You might want to read the C standard documents some times. Especially the C99 one which actually talk about 64 bit integers.
I actually have looked at this. Many times. And yes, I am aware that the standard does not say what "long" means, exactly, just that it means >= int in all cases.

However, for as long as I can remember, "long" has always been "the longest integer data type the hardware supports." Which is both logical and rational. Cray has always had int = long since it is a 64 bit architecture. With the quirk that pointers were always 32 bits because the a-registers (used for address computations) were 32 bits (eventually, early on they were 24 bits).
In the old days of 32 bit systems, that was fairly true. But it was more convention than requirement because C didn't mandate anything back then.

But when you go beyond 32 bits, the traditions break down and that's why C99 made some rules.
But for "long" on true 64 bit systems, I have yet to find a single case where long is < 64 bits, except, apparently some version of the microsoft C compiler (I have not verified which versions, if any, or if all, follow this rather odd practice). I did, long ago, verify that at least for linux, on any 64 bit platform I have tried, long = 64 bits. For X86_64, both gcc and icc follow this. And on my home windows vista 64 box, gcc does the same.

Not much more to say on the subject. Yes, the "standard" is lousy in many ways, such as why is an int always signed by default, while a char may be signed or unsigned at the whim of the compiler writer? "long long" was a non-standard extension that I used in 1995 in the first versions of Crafty when running on a 64 bit sun Sparc as well as on 32 bit X86/linux boxes.
Again, tradition. When the C89 standard was developed, their mandate was to "codify existing practice". They had to take what was already being done and make it law. And they had to break as little existing code as possible. They didn't have much room to be creative or even fix flaws in K&R C.

singed vs. unsigned char again goes back to the existing implementations they had to deal with back then.

(Also, as a side point, *ONLY* unsigned char is guaranteed to copy all the unmodified bits of a byte. Signed char doesn't because it can normalize +0 and -0 on hardware that doesn't know that zero doesn't actually have a sign. And only UChar is guaranteed to actually copy all the bits where as signed char can ignore some. As I said... C89 had to deal with a lot of weird existing implementations and hardware, and trying to do that resulted in some weird wording in the standard.)



short is 16 bits.

int is defined as the natural word size. It is *not* required to be 32 or 64 bits. It can be 16 bits, even. It's up to whatever the compiler writer chooses. Implementation defined. On a 32 bit compiler, it should be 32 bits, of course. On a 64 bit system, 64 bit int's would be the more appropriate size.

long is defined as at least 32 bits. On a 64 bit system, it can be either 32 or 64 bits.

long long is defined as at least 64 bits.


So it is entirely possible for 'int' to be 16 bits and 'long' at 32 bits and 'long long' at 64 bits. Not likely, of course. But possible and standard conforming.


That's why the C99 standard set up the stdint.h header to provide a nice portable way the programmer can be guaranteed of having the right size integers when they need it.

For C89, it doesn't even know about 64 bit integers, and the compiler writer is on his own and the programmer should actually check to make sure. In this case, 'int' is often 32 and 'long' 64 bits. But that violates the C convention / requirement that 'int' is the machine word size.


Of course, we've had this discussion before, and I'm well aware you don't like the C99 standard because not all compilers (<cough> Microsoft <cough>) support it.



But don't go saying that 'long' *must* be 64 bits because it doesn't.

I do not believe I said "it must be 64 bits". I rather said that it has been 64 bits on _every_ 64 bit architecture and compiler combination I have tried. You can look at the Crafty Makefile to see just how comprehensive that list of architectures and O/S choices is. Apparently there is / was / could-be some MS compiler that is/was brain-dead...
Maybe you didn't say 'must', but it sounded like it. It certainly sounded like 'always', though.

But I will admit that integer sizes etc. are something that really annoys me. C says what they are and it seems like people are still going around assuming they are still what they were 20 years ago on 16 bit PC.

People are still casting pointers to 'int' or 'long' even though C99 provides a guaranteed data type to hold it. It's stupid, but people do it. Just habit. (I guess that really says it. Just like when people moved from 16 bit DOS with 'huge' pointers to 32 bits, things were a bit painful. Now that everybody is comfortable with 32 bits, they are starting to take those habits to 64 bits.)

It's just such a habit to use 'short', 'int', and 'long' that they just don't bother to use data types that are actually guaranteed to be the size they need, even if there is a chance that a too small or big size could crash the program.

(You've mentioned Cray... I remember reading a few articles from Cray's programmers complaining about the idiotic programming of 32 bit programmers. Such as cases where a 32 bit int with all bits set being equated with -1, which of course fails when Cray tried to port those programs to the 64 bit Cray systems. Or shifting bits around and expecting them to fall off the end, and then comparing them to a constant.)



And you wont get any argument from me about the arrogance of the MS Compiler writers. They still refuse to support C99 even. Not can't, but just flatly refuse. I think that if they thought they could get away with it, they'd write a whole new proprietary language and then call it C and claim it was always that way.
And that would even violate tradition and break a large number of programs that expect it at 32 bits. (Yes yes, I know programmer's shouldn't write a program that expects 'int' or 'long' to be a certain size without first checking, but many programs have done so in the past 25 years. Very few people these days actually bother to check to make sure a var type is the right size. They just assume 'int' is 32 bits and 'long' is probably 32 bits too.)


I'm not saying you should change your code because it's wrong or ugly or doesn't work or anything else. I'm just saying that 'long' does *NOT* have to be 64 bits. The C standard is pretty clear about that.

If you want to say that 'long' is 64 bits on the compilers you use and so that's what you program for, then that's a perfectly fine statement.
Actually, I think that _is_ what I said. :)
Actually.... no. You didn't.

The problem is that by saying 'long' is 64 bits, you leave a gap at 32 bits.

C has always had the concept that 'int' is the natural word size of the program. Whether it's a 16 bit, 32 bit, 36 bit, 60 bit, 64 bit, etc. processor. As long as it's at least 16 bits.

If you make 'int' 32 bits on a 64 bit system, then you just violate 30 years of C tradition. It is allowed, but it's not.... 'expected' behavior. It's like making a compiler where 'int' is 27 bits on a 32 bit system. Sure you can do it, but you shouldn't.

'int' is supposed to be a floating size that is determined by the whatever is most appropriate for the hardware. The natural register word size that the cpu operates on most efficiently.

That's why C99 developed stdint.h. So the programmer can depend on the sizes of integer types.

64 bit C compilers started showing up (from Cray etc.) before C99 was ratified. So many compiler writers just came up with their own solution. It would have been nice if C99 was C90 and avoided a lot of that, but of course that didn't happen.

The C99 standard provides these things to guarantee sizes to the programmer, so that no matter what compiler or what hardware is being used, they know for sure exactly what they are getting. Why not use them???

Sure, you can go by convention (ie: I tried it on the sytems I have, so it'll work everywhere!) or you use the explicitly sized types provided by C99 and guarantee it'll work and at the same time make sure that the readers understand that the var you just defined is supposed to be a certain size and not whatever the compiler writer chose.


As I've said before, this is just something that really irritates me. C gives you the tools to do it right, but most don't bother. Most don't even bother to explicitly indicate what bit size is needed for their vars. If a 'long' is good enough then they don't bother to document the bit size or do any checks to make sure it's not too small or too big.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

Carey wrote:
bob wrote:
Carey wrote:
bob wrote:DId you _read_ my last comment? Windows vista 64 bits + 64 bit gcc gives the expected sizeof(long) = 8. So I _did_ test windows. I don't use the microsoft C++ compiler so I have no idea what they do. But anyone that uses sizeof(long)=anything_other_than_8 on a 64 bit architecture is certainly brain-dead. But MS has never let that stop them in the past...

BTW the crafty code you quoted has been there since 1995 or so. More recent versions use long on HAS_64BITS architectures with Linux. I've never changed the windows stuff since __int64 has been around since 1995 as well...
Bob,

You might want to read the C standard documents some times. Especially the C99 one which actually talk about 64 bit integers.
I actually have looked at this. Many times. And yes, I am aware that the standard does not say what "long" means, exactly, just that it means >= int in all cases.

However, for as long as I can remember, "long" has always been "the longest integer data type the hardware supports." Which is both logical and rational. Cray has always had int = long since it is a 64 bit architecture. With the quirk that pointers were always 32 bits because the a-registers (used for address computations) were 32 bits (eventually, early on they were 24 bits).
In the old days of 32 bit systems, that was fairly true. But it was more convention than requirement because C didn't mandate anything back then.

But when you go beyond 32 bits, the traditions break down and that's why C99 made some rules.
But for "long" on true 64 bit systems, I have yet to find a single case where long is < 64 bits, except, apparently some version of the microsoft C compiler (I have not verified which versions, if any, or if all, follow this rather odd practice). I did, long ago, verify that at least for linux, on any 64 bit platform I have tried, long = 64 bits. For X86_64, both gcc and icc follow this. And on my home windows vista 64 box, gcc does the same.

Not much more to say on the subject. Yes, the "standard" is lousy in many ways, such as why is an int always signed by default, while a char may be signed or unsigned at the whim of the compiler writer? "long long" was a non-standard extension that I used in 1995 in the first versions of Crafty when running on a 64 bit sun Sparc as well as on 32 bit X86/linux boxes.
Again, tradition. When the C89 standard was developed, their mandate was to "codify existing practice". They had to take what was already being done and make it law. And they had to break as little existing code as possible. They didn't have much room to be creative or even fix flaws in K&R C.

singed vs. unsigned char again goes back to the existing implementations they had to deal with back then.

(Also, as a side point, *ONLY* unsigned char is guaranteed to copy all the unmodified bits of a byte. Signed char doesn't because it can normalize +0 and -0 on hardware that doesn't know that zero doesn't actually have a sign. And only UChar is guaranteed to actually copy all the bits where as signed char can ignore some. As I said... C89 had to deal with a lot of weird existing implementations and hardware, and trying to do that resulted in some weird wording in the standard.)
That is a new one on me. I can't think of a single platform I have not worked on in the past 40 years, and none of them had issues with a -0, because that doesn't exist in 2's complement. In the days of packed decimal, yes. But for an 8-bit char (and of course not all machines even had 8 bit chars, univac/cdc used 6 bits as an example).

There were machines that would "sign extend" when moving a character to a word, and some that could not do so without explicit programming steps, which was what I had assumed led to the great signed/unsigned debacle.



short is 16 bits.

int is defined as the natural word size. It is *not* required to be 32 or 64 bits. It can be 16 bits, even. It's up to whatever the compiler writer chooses. Implementation defined. On a 32 bit compiler, it should be 32 bits, of course. On a 64 bit system, 64 bit int's would be the more appropriate size.

long is defined as at least 32 bits. On a 64 bit system, it can be either 32 or 64 bits.

long long is defined as at least 64 bits.


So it is entirely possible for 'int' to be 16 bits and 'long' at 32 bits and 'long long' at 64 bits. Not likely, of course. But possible and standard conforming.


That's why the C99 standard set up the stdint.h header to provide a nice portable way the programmer can be guaranteed of having the right size integers when they need it.

For C89, it doesn't even know about 64 bit integers, and the compiler writer is on his own and the programmer should actually check to make sure. In this case, 'int' is often 32 and 'long' 64 bits. But that violates the C convention / requirement that 'int' is the machine word size.


Of course, we've had this discussion before, and I'm well aware you don't like the C99 standard because not all compilers (<cough> Microsoft <cough>) support it.



But don't go saying that 'long' *must* be 64 bits because it doesn't.

I do not believe I said "it must be 64 bits". I rather said that it has been 64 bits on _every_ 64 bit architecture and compiler combination I have tried. You can look at the Crafty Makefile to see just how comprehensive that list of architectures and O/S choices is. Apparently there is / was / could-be some MS compiler that is/was brain-dead...
Maybe you didn't say 'must', but it sounded like it. It certainly sounded like 'always', though.

But I will admit that integer sizes etc. are something that really annoys me. C says what they are and it seems like people are still going around assuming they are still what they were 20 years ago on 16 bit PC.

People are still casting pointers to 'int' or 'long' even though C99 provides a guaranteed data type to hold it. It's stupid, but people do it. Just habit. (I guess that really says it. Just like when people moved from 16 bit DOS with 'huge' pointers to 32 bits, things were a bit painful. Now that everybody is comfortable with 32 bits, they are starting to take those habits to 64 bits.)

It's just such a habit to use 'short', 'int', and 'long' that they just don't bother to use data types that are actually guaranteed to be the size they need, even if there is a chance that a too small or big size could crash the program.

(You've mentioned Cray... I remember reading a few articles from Cray's programmers complaining about the idiotic programming of 32 bit programmers. Such as cases where a 32 bit int with all bits set being equated with -1, which of course fails when Cray tried to port those programs to the 64 bit Cray systems. Or shifting bits around and expecting them to fall off the end, and then comparing them to a constant.)



And you wont get any argument from me about the arrogance of the MS Compiler writers. They still refuse to support C99 even. Not can't, but just flatly refuse. I think that if they thought they could get away with it, they'd write a whole new proprietary language and then call it C and claim it was always that way.
Fortunately, I don't live in "that" world. :)
And that would even violate tradition and break a large number of programs that expect it at 32 bits. (Yes yes, I know programmer's shouldn't write a program that expects 'int' or 'long' to be a certain size without first checking, but many programs have done so in the past 25 years. Very few people these days actually bother to check to make sure a var type is the right size. They just assume 'int' is 32 bits and 'long' is probably 32 bits too.)


I'm not saying you should change your code because it's wrong or ugly or doesn't work or anything else. I'm just saying that 'long' does *NOT* have to be 64 bits. The C standard is pretty clear about that.

If you want to say that 'long' is 64 bits on the compilers you use and so that's what you program for, then that's a perfectly fine statement.
Actually, I think that _is_ what I said. :)
Actually.... no. You didn't.
Check this out:
bob wrote: I am using both Intel and GCC on my 64 bit linux boxes, and I can _guarantee_ you that a "long" is 64 bits:

Same code on windows produces the same results for 64 bit Windows Vista running gcc (my home box).. I don't have any 32 bit machines but did in the past, and there "long" was 32 bits as was "int"...
The original quote I responded to was this:
long is 32-bit even with a 64-bit version of Windows.
I certainly gave an example where that is false. Because I tested a 64 bit versions of windows vista with gcc and long = 64 as expected.

The problem is that by saying 'long' is 64 bits, you leave a gap at 32 bits.

C has always had the concept that 'int' is the natural word size of the program. Whether it's a 16 bit, 32 bit, 36 bit, 60 bit, 64 bit, etc. processor. As long as it's at least 16 bits.

If you make 'int' 32 bits on a 64 bit system, then you just violate 30 years of C tradition. It is allowed, but it's not.... 'expected' behavior. It's like making a compiler where 'int' is 27 bits on a 32 bit system. Sure you can do it, but you shouldn't.
Unfortunately, this is a dead horse, since X86-64 already violates this since int is still 32 while long is 64 (excepting msvc apparently).


'int' is supposed to be a floating size that is determined by the whatever is most appropriate for the hardware. The natural register word size that the cpu operates on most efficiently.
Which would be 64 bits on x86-64 of course, but we don't have that sane approach anywhere. I have not tested Cray compilers on x86-64 architectures, would be interesting to see what they did, since on the original cray-1 architecture (and successors thru T90) int = long = 64.

That's why C99 developed stdint.h. So the programmer can depend on the sizes of integer types.

64 bit C compilers started showing up (from Cray etc.) before C99 was ratified. So many compiler writers just came up with their own solution. It would have been nice if C99 was C90 and avoided a lot of that, but of course that didn't happen.

The C99 standard provides these things to guarantee sizes to the programmer, so that no matter what compiler or what hardware is being used, they know for sure exactly what they are getting. Why not use them???

Sure, you can go by convention (ie: I tried it on the sytems I have, so it'll work everywhere!) or you use the explicitly sized types provided by C99 and guarantee it'll work and at the same time make sure that the readers understand that the var you just defined is supposed to be a certain size and not whatever the compiler writer chose.


As I've said before, this is just something that really irritates me. C gives you the tools to do it right, but most don't bother. Most don't even bother to explicitly indicate what bit size is needed for their vars. If a 'long' is good enough then they don't bother to document the bit size or do any checks to make sure it's not too small or too big.
One reason for not using them is compatibility. There are a zillion compilers in existence today, and most are not C99 compatible, which would mean some ugly #ifdef stuff (Crafty already has enough just to deal with the MS C++ flaws) not to mention different systems have different libraries with different #includes needed.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Is a querying the hash tables such a huge bottleneck?

Post by Gerd Isenberg »

bob wrote:
hgm wrote:The problem with prefetch i always: what would you want to do between the prefetch and the actual probe that needs to be done anyway if the probe is a hit that causes a cutoff? In all nodes you could issue prefetches for the _next_ move you want to search, so that you can search the current move (of which the hash entry was already prefetched on the move before).

Problem is that normal memory reads have priority over prefetches, and the number of outstanding prefetches is very limited. So if memory bandwidth is heavily used, the prefetches might never be executed.
I think there is more to it than that.

(1) when you prefetch, you must, by definition, dislodge something already in cache. And depending on your program, you might not need the prefetched data. For example, in crafty, I do a repetition check first, for obvious reasons. If that gets a hit, I don't need to do a hash probe. If I did the pre-fetch as early as possible (first instruction upon entering Search()) then I dislodge something that was already in cache, and it was there because it was used previously, and replace it with something that might not be needed at all.

(2) there is _always_ a savings when doing a pre-fetch, when you actually need the data. You might not reduce the latency to zero, but you do reduce it some, which is a win. But you also lose if you don't need the data as previously mentioned.

(3) the _real_ savings is harder to obtain, where you would like to drive the latency to zero. Memory is so slow, you need to do the prefetch early enough to ensure it is already in cache by the time you need it. For hashing, this has a significant potential cost. Random access means many TLB misses. Which translates into up to 4 extra memory references to translate virtual to real. So you need to move that prefetch not just back to the previous ply, but much farther back. And then you have to hope that (a) your pre-fetch doesn't dislodge something you need and (b) that the pre-fetched data survives in cache long enough so that you get a hit when you need it. We had a similar facility on the older Crays, and it was not that uncommon to do a pre-fetch, then access something else that dislodges the pre-fetched data due to a cache miss, followed by the actual reference causing another cache miss. In short, two _extra_ cache misses, so that rather than eliminating one, you add one.

For timing analysis, Crafty, on my 2.5ghz core-2 searches roughly 4M nodes per second (one cpu). Or about 632 cycles per node (250 ns per node). About 90% of Crafty's nodes are in the q-search where there is no hashing, and where lots of extra work is done (evaluate() mainly). In the normal search, a node takes more like 25ns. Memory latency ends up in the 100-300ns range depending on tlb misses, which becomes a bit painful because you have to do the pre-fetch far enough back in the tree that there is no way to know what needs to be pre-fetched, and you would greatly exceed the max number of prefetches pending in any case since this is a tree...

Making pre-fetch work in any real application has proven to be quite difficult, overall. You certainly can't trust the compiler to pull this off except for toy programs.
Ok, when you don't hash in q-search, I agree. Otherwise, you can use prefetch intrinsic even after rep-code, if you have something else to do. I do attack-generation and sets of legal move targets, and even some eval preparation in SSE2, register intensive kogge-stone stuff, up to 14 128-bit xmm registers were used with C++ wrapper, no lookups, only a little stack traffic. Then, assuming hashing always, prefetch boosts.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

Gerd Isenberg wrote:
bob wrote:
hgm wrote:The problem with prefetch i always: what would you want to do between the prefetch and the actual probe that needs to be done anyway if the probe is a hit that causes a cutoff? In all nodes you could issue prefetches for the _next_ move you want to search, so that you can search the current move (of which the hash entry was already prefetched on the move before).

Problem is that normal memory reads have priority over prefetches, and the number of outstanding prefetches is very limited. So if memory bandwidth is heavily used, the prefetches might never be executed.
I think there is more to it than that.

(1) when you prefetch, you must, by definition, dislodge something already in cache. And depending on your program, you might not need the prefetched data. For example, in crafty, I do a repetition check first, for obvious reasons. If that gets a hit, I don't need to do a hash probe. If I did the pre-fetch as early as possible (first instruction upon entering Search()) then I dislodge something that was already in cache, and it was there because it was used previously, and replace it with something that might not be needed at all.

(2) there is _always_ a savings when doing a pre-fetch, when you actually need the data. You might not reduce the latency to zero, but you do reduce it some, which is a win. But you also lose if you don't need the data as previously mentioned.

(3) the _real_ savings is harder to obtain, where you would like to drive the latency to zero. Memory is so slow, you need to do the prefetch early enough to ensure it is already in cache by the time you need it. For hashing, this has a significant potential cost. Random access means many TLB misses. Which translates into up to 4 extra memory references to translate virtual to real. So you need to move that prefetch not just back to the previous ply, but much farther back. And then you have to hope that (a) your pre-fetch doesn't dislodge something you need and (b) that the pre-fetched data survives in cache long enough so that you get a hit when you need it. We had a similar facility on the older Crays, and it was not that uncommon to do a pre-fetch, then access something else that dislodges the pre-fetched data due to a cache miss, followed by the actual reference causing another cache miss. In short, two _extra_ cache misses, so that rather than eliminating one, you add one.

For timing analysis, Crafty, on my 2.5ghz core-2 searches roughly 4M nodes per second (one cpu). Or about 632 cycles per node (250 ns per node). About 90% of Crafty's nodes are in the q-search where there is no hashing, and where lots of extra work is done (evaluate() mainly). In the normal search, a node takes more like 25ns. Memory latency ends up in the 100-300ns range depending on tlb misses, which becomes a bit painful because you have to do the pre-fetch far enough back in the tree that there is no way to know what needs to be pre-fetched, and you would greatly exceed the max number of prefetches pending in any case since this is a tree...

Making pre-fetch work in any real application has proven to be quite difficult, overall. You certainly can't trust the compiler to pull this off except for toy programs.
Ok, when you don't hash in q-search, I agree. Otherwise, you can use prefetch intrinsic even after rep-code, if you have something else to do. I do attack-generation and sets of legal move targets, and even some eval preparation in SSE2, register intensive kogge-stone stuff, up to 14 128-bit xmm registers were used with C++ wrapper, no lookups, only a little stack traffic. Then, assuming hashing always, prefetch boosts.
That doesn't fit my "design" very well. For example, next thing after rep check is hash probe. I don't want to do any move generation stuff prior to that since if I get a hash move, I search it before generating anything. I would agree that if you can move some stuff up so that you do the prefetch, then the moved code, then the hash probe, the cost of the moved code becomes zero of course. Makes the overall design a bit messier also. Cray Blitz used this extensively and it made the code look ridiculous in places, although it was certainly fast as all hell. :)
CThinker
Posts: 388
Joined: Wed Mar 08, 2006 10:08 pm

Re: Is a querying the hash tables such a huge bottleneck?

Post by CThinker »

bob wrote: DId you _read_ my last comment? Windows vista 64 bits + 64 bit gcc gives the expected sizeof(long) = 8. So I _did_ test windows. I don't use the microsoft C++ compiler so I have no idea what they do. But anyone that uses sizeof(long)=anything_other_than_8 on a 64 bit architecture is certainly brain-dead. But MS has never let that stop them in the past...
This cannot be correct. No compiler for Windows will be useful if it treats long as 64-bit because there are plenty of Win32 APIs that have long parameters and expect them to be 32-bit. Whatever that test that you did is definitely flawed.

Here is the output of the 64-bit gcc (mingw64) and 64-bit LCC running on 64-bit Windows.

Code: Select all

C&#58;\Temp>type test.c
#include "stdio.h"
int main&#40;)&#123;
#ifndef _WIN64
error&#58; not 64-bit
#endif
printf&#40;"%i",sizeof&#40;long&#41;);
&#125;

C&#58;\Temp>\etc\mingw64\bin\gcc.exe test.c

C&#58;\Temp>a.exe
4
C&#58;\Temp>\etc\lcc\bin\lcc64.exe -IC&#58;\Etc\lcc\include64 test.c

C&#58;\Temp>\etc\lcc\bin\lcclnk64.exe test.obj

C&#58;\Temp>test.exe
4
C&#58;\Temp>
And here is for the 64-bit Intel C:

Code: Select all

C&#58;\Temp>type test.c
#include "stdio.h"
int main&#40;)&#123;
#ifndef _WIN64
error&#58; not 64-bit
#endif
printf&#40;"%i",sizeof&#40;long&#41;);
&#125;

C&#58;\Temp>icl test.c
Intel&#40;R&#41; C++ Intel&#40;R&#41; 64 Compiler Professional for applications running on Intel
&#40;R&#41; 64, Version 11.1    Build 20100806 Package ID&#58; w_cproc_p_11.1.067
Copyright &#40;C&#41; 1985-2010 Intel Corporation.  All rights reserved.

test.c
Microsoft &#40;R&#41; Incremental Linker Version 9.00.21022.08
Copyright &#40;C&#41; Microsoft Corporation.  All rights reserved.

-out&#58;test.exe
test.obj

C&#58;\Temp>test.exe
4
C&#58;\Temp>
So there you go. 64-bit GCC, 64-bit LCC and 64-bit Intel-C. all have 32-bit long in Windows.

This should be very easy for anyone to validate. Intel-C is free for 30 days here:
http://software.intel.com/en-us/article ... valuation/

64-bit GCC for Windows is free:
http://mingw-w64-dgn.googlecode.com/fil ... 0101009.7z

64-bit LCC is free:
http://www.q-software-solutions.de/pub/lccwin64.exe