Mispredicted branch VS cache miss

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

xmas79
Posts: 286
Joined: Mon Jun 03, 2013 7:05 pm
Location: Italy

Mispredicted branch VS cache miss

Post by xmas79 »

Hello, question of the day: pick one, what would you pick for your engine?

A real problem is: the PSTs usually are a [64] values array. In the case of a symmetrical (left/right) eval, (my engine was designed as such) these could be cut down to a vector of [32], lowering the pressure on the cache. However, a bit of bit twiddling is needed to ensure the square index is mirrored, introducing some hard-to-predict branches (I suppose). I'm reworking a bit on the eval in the hope to start a tuning session, and I would like know if is it worth to cut down by half the size of PSTs. I know it's a balance... A cache miss on each PST access VS a mispredicted branch + a (hoped) cache hit on each PST access. In the worst case I would also have the cache miss, so this seems to be a non-winner to me...



I probably already know all your answers: "use a profiler......", no?
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Mispredicted branch VS cache miss

Post by jdart »

The cache is much bigger than your PSTs. So probably you will not find that most of your cache misses are due to PST access.

And yes, you should use a profiler. If you are on Linux, I have had pretty good experience with oprofile (http://oprofile.sourceforge.net/news/). Intel VTune is also good if you are on the Intel architecture (exists for Windows & Linux, is free for non-commercial use on Linux).

--Jon
xmas79
Posts: 286
Joined: Mon Jun 03, 2013 7:05 pm
Location: Italy

Re: Mispredicted branch VS cache miss

Post by xmas79 »

Yes, cache is much bigger than any PST, and we have a bunch of data that is accessed at every make/unmake/eval, so I don't think everything will fit in any level of cache. E.g. zobrist keys: on each make/unmake a random memory access is performed, so I was thinking that reducing cache misses on the PST (which is per-se a good thing) could be beneficial somewhere else for something else that can then stay in cache. Of course, once the accessed data is big enough, then everything is guaranteed to be evicted from cache before being reused a second time. Then it becomes an obvious no no... But I'm still on "it depends on how much data you 'usually' access", that's why I asked.

I tried VTune but unfortunately I couldn't find a way to look at cache misses for my architecture (option was grayed out). Unfortunately (again), NGN is still on the Windows-side of the Force. However, I've already planned to work together with Obi-Wan to bring it back to the Linux-side...
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Mispredicted branch VS cache miss

Post by hgm »

Mispredictions are very costly, like 12-14 clock cycles, where the CPU can do 3-4 instructions per cycle. For cache misses it depends on which cache you are talking about. L1 misses are still nearly free. Also, using more cache doesn't always mean more cache misses. If your working set was still smaller than the cache size, using extra memory would not cost anything at all.

In Shokidoki I use overlapping zobrist keys, so that they only take 1 byte each, instead of 8. (Holding 81 keys for 2 x 14 piece types starts to become really annoying. I also use bytes for the PST, as you don't need separate end-game and opening scores when all material stays on the board always.)

Note that it also isn't necessary to use branches for conditionally flipping the rank. E.g. if you use 0-63 numbering, you could write

PST[-(sqr > 31) & 070 ^ sqr]

or

PST[(sqr > 31)*070 ^ sqr]

to use tables with only 32 elements, neither of which would require any branches at all.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Mispredicted branch VS cache miss

Post by bob »

xmas79 wrote:Hello, question of the day: pick one, what would you pick for your engine?

A real problem is: the PSTs usually are a [64] values array. In the case of a symmetrical (left/right) eval, (my engine was designed as such) these could be cut down to a vector of [32], lowering the pressure on the cache. However, a bit of bit twiddling is needed to ensure the square index is mirrored, introducing some hard-to-predict branches (I suppose). I'm reworking a bit on the eval in the hope to start a tuning session, and I would like know if is it worth to cut down by half the size of PSTs. I know it's a balance... A cache miss on each PST access VS a mispredicted branch + a (hoped) cache hit on each PST access. In the worst case I would also have the cache miss, so this seems to be a non-winner to me...



I probably already know all your answers: "use a profiler......", no?
If you worry about cache, use 64 one-byte values. The instruction to access them is not as efficient (has to do sign extension assuming the numbers are signed and can be negative.) Given the choice you mention, I'd stick with the normal 64 entry PST and eliminate the extra instructions. The size of the PSTs is not going to be a big deal. And you can, as I mentioned, go to bytes if you prefer to take it even easier on cache, at the small cost of replacing mov by moves.

If you want to go the 32 entry way, you can index without branches there as well. Several use this to adjust a rank/file combo to the inverse for the other side...
xmas79
Posts: 286
Joined: Mon Jun 03, 2013 7:05 pm
Location: Italy

Re: Mispredicted branch VS cache miss

Post by xmas79 »

hgm wrote:...
PST[-(sqr > 31) & 070 ^ sqr]

or

PST[(sqr > 31)*070 ^ sqr]
Ah yes, good catch! I already use the 0x38 Xor for color flipping, and went straight to the lazy "if" approach (sq > 31) ? etc... in my mind... I didn't coded it yet!

The problem however is still open: how to estimate the working set? zobrist, magic, eval tables etc... how much cache they eat up? I suppose in a real search nothing stays in (fast) cache, no? So I presume that reducing cache pressure is always a good thing...
bob wrote:If you worry about cache, use 64 one-byte values. The instruction to access them is not as efficient (has to do sign extension assuming the numbers are signed and can be negative.) Given the choice you mention, I'd stick with the normal 64 entry PST and eliminate the extra instructions. The size of the PSTs is not going to be a big deal
Shouldn't I worry about cache trashing? Are you just saying that PST tables are not going to "fix" any cache problem in a chess engine? If I'm going to have a cache miss anyway then probably a 32 squares PST is just the same as a 64 PST (I expect that a few "neg" "and" "xor" to flip the square are going to be almost free), except that I trash the cache a bit less. No?

Thanks.
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Mispredicted branch VS cache miss

Post by hgm »

xmas79 wrote:Ah yes, good catch! I already use the 0x38 Xor for color flipping, and went straight to the lazy "if" approach (sq > 31) ? etc... in my mind... I didn't coded it yet!
Yet another way to code this came to mind, which would even be branch-free on architectures that do not support a SETcc type instruction:

Code: Select all

PST[(31-sqr)>>31 & 070 ^ sqr]
The fact that both constants are 31 here is entirely coincidental; the shift would always be 31, independent of the board size, and is intended to 'pull the sign curtain' over the value to create an all-ones or all-zeros mask.
The problem however is still open: how to estimate the working set? zobrist, magic, eval tables etc... how much cache they eat up?
Well, you know that, don't you? You declared the arrays yourself, with fixed sizes. And you know the size of the elementary data types they contain, like 8 bytes for int64 etc. With 13 board-size tables for int PST and long int Zobrist (including one for the empty square!) these together eat 13 x 64 x (4+8) = 9,984 bytes ~ 10KB.
I suppose in a real search nothing stays in (fast) cache, no? So I presume that reducing cache pressure is always a good thing...
Well, the cache size on Intel machines is 32KB, so 10KB easily fits, if that was the most space-consuming part of your engine's 'kernel data'. This would be different when you had 2x14 piece types and an 81-square board, which is why in Shokidoki I use the overlapping Zobrist trick (shrinking the tables from 29 x 81 x 8 = 18,792 b to 30 x 81 x 1 = 2,430 b), and one-byte PST (shrinking those from 29 x 81 x 4 = 9,396 b to 29 x 81 x 1 = 2,349 b). As the PST also contain the piece base values, this forced me to use a granularity of 4 centi-Pawn to make the value of the promoted Rook fit in an unsigned char, which was still baerable.

But I do like my engines to run entirely from L1, apart from the hash probe. As Intel as well as AMD caches have (pseudo-)least-recent-used replacement, and your frequently accessed data stays limited to less than half L1 (16KB in Intel, 32KB on AMD) it will remain cached always, and the other cache half will be used to cache infrequently used data like hash probes (in vain, as it will never be accessed again before being flushed by other such data). On Intel-machines, which nowadays have 8 cache ways, you can even use a larger fraction of the cache without spoiling this. Beware that the fequently accessed data does not only include your global tables, but also stack space for the recursive search routines, which might also contain a move list of 40 or 50 times 4 bytes, and the top 5-10 levels of that stack should also be counted as frequently accessed data. You don't care if nodes close to the root cause L1 cache misses with everything they do; there just are too few of those. But near the leaves the tree is thin because of QS, and the ply level flashes up and down very rapidly.
Volker Annuss
Posts: 180
Joined: Mon Sep 03, 2007 9:15 am

Re: Mispredicted branch VS cache miss

Post by Volker Annuss »

xmas79 wrote:I probably already know all your answers: "use a profiler......", no?
The profiler you need for this kind of problems is valgrind. In valgrind there is a tool cachegrind that does cache simulation and branch prediction simulation.

If I remember correctly, it was Gian-Carlo Pascutto who wrote here, that you have one last level cache miss per node searched.

Caches have grown since then and my engine Arminius does not use the hash table in qsearch. Cachegrind reports about 0.39 L3 hash misses, 0.72 L2 cache misses and 5.5 L1 cache misses per node searched for Arminius with 1 thread only on an i7 5820K.

I think that current processor's branch predictors are better than cachegrind's simulation. For details of branch predictors (and caches) in processors search for agner fogs optimization manual 3 (microarchitecture).