Why would you want to use in-register lookups? Fetching such simple things from a memory table should be very competative, if not faster. I noticed that these out-of-order CPUs often perform very poorly if you only use registers, it seems that there is some internal bottleneck on how many registers you can access per clock. Having some of the operands come from memory is usually more successful in keeping the ALUs busy.
On AMD machines load and stores are piggybacked on ALU makro ops, so there is no decoder penalty in using instructions with memory operands. uOp fusion in Intel machines now does the same (Pentium M and Core CPUs), while Pentium IV circumvented decoder limits by using a trace cache.
In addition, shifts are usually a bottleneck, as most CPUs have only one ALU capable of doing shifts, wich can become a bottleneck if your code uses them heavily. So better reserve them for what you really can't do without. (Plus, the i86 machine language requires a variable shift count to be in one specific register, leading to extra register-to-register moves to implement them.)
If you are worried about L1 footprint, you could pack several such tables in 64 byte. E.g. bits 0 and 1 the center distance, bit 2 the square color, etc.
Code: Select all
colorOfSquare = Data[sqr] & 4; // assuming it is used as a boolean, so 4=true
centerDistance = Data[sqr] & 3;
(Btw, Joker uses Wesley's method for square color, but I am considering to change that.)