two values in one integer

bob · Post by **bob** » Sat Jan 21, 2012 7:45 pm

Sven Schüle wrote:
Pierre Bokma wrote:hi friends,

I want to have two evaluations returned from my evaluation function one for the endgame and one for the middle game. By scaling these two values the value of the position will be calculated like fruit and stockfish do.

I am wondering, can this be done by using only one integer? eg the endgame value in bits 0-15 and the midgame value in the other bits? I have expirimented with this approach but i cannot get it to work. Especially negative values in least significant bits give problems.

any help appreaciated
It can be done correctly, as discussed in detail in the thread given by Richard. In my opinion a struct with two 16-bit values can be handled easier and with less "headache".
Code: Select all
struct Score {
    int16_t mgPart;
    int16_t egPart;
};

inline void setScore(Score & score, int16_t mg, int16_t eg) {
    score.mgPart = mg;
    score.egPart = eg;
}

inline void addScore(Score & score, int16_t mg, int16_t eg) {
    score.mgPart += mg;
    score.egPart += eg;
}

Score myScore = { 0, 0 };

addScore(myScore, RookOn7thRankBonusMG, RookOn7thRankBonusEG);
Hard to beat in terms of coding effort, correctness, and runtime performance.

Sven

I am missing "the point." The original idea of this particular "trick" was to replace the pair of adds (endgame and middlegame scores) with a single add that folds in both at the same time. The above code doesn't do that.

So are we now just talking about making that place in the source a little simpler (one proc call rather than 2 adds) by letting the compiler expand the proc call to two adds at compile time?

So, is the point "readability" or "performance"? Seems to not be the latter...

Personally, I think the performance issue is completely moot anyway. With the code used today, the hardware has plenty of trouble keeping multiple integer pipes full anyway, because of data and control dependencies, an extra add doesn't add enough to be able to measure... It just slips into one of those empty pipe slots invisibly...

Sven · Post by **Sven** » Sat Jan 21, 2012 11:13 pm

bob wrote:
Sven Schüle wrote:
Pierre Bokma wrote:hi friends,

I want to have two evaluations returned from my evaluation function one for the endgame and one for the middle game. By scaling these two values the value of the position will be calculated like fruit and stockfish do.

I am wondering, can this be done by using only one integer? eg the endgame value in bits 0-15 and the midgame value in the other bits? I have expirimented with this approach but i cannot get it to work. Especially negative values in least significant bits give problems.

any help appreaciated
It can be done correctly, as discussed in detail in the thread given by Richard. In my opinion a struct with two 16-bit values can be handled easier and with less "headache".
Code: Select all
struct Score {
    int16_t mgPart;
    int16_t egPart;
};

// SNIP
Hard to beat in terms of coding effort, correctness, and runtime performance.
I am missing "the point." The original idea of this particular "trick" was to replace the pair of adds (endgame and middlegame scores) with a single add that folds in both at the same time. The above code doesn't do that.

So are we now just talking about making that place in the source a little simpler (one proc call rather than 2 adds) by letting the compiler expand the proc call to two adds at compile time?

So, is the point "readability" or "performance"? Seems to not be the latter...

Personally, I think the performance issue is completely moot anyway. With the code used today, the hardware has plenty of trouble keeping multiple integer pipes full anyway, because of data and control dependencies, an extra add doesn't add enough to be able to measure... It just slips into one of those empty pipe slots invisibly...

"The point" was actually easy to extract from my post, and was almost equivalent to the outcome of your last paragraph: the extra "add" instruction of the simple "struct" solution does not cause any measurable slowdown but avoids any headache being introduced by implementing the "tricky folding" solution.

Using an inline function like my "addScore()" above can be done for both, and does not reveal which of these two implementations is actually being used, so that was not my key point.

If you really would like to give a name to that "point" then it is neither "readability" nor "performance", it is "simplicity". If you take a look at Stockfish, for instance (implementation can be found in "types.h", search for "Score" - but I guess you know that code already), then you will see why I think that the "tricky folding" solution is not so trivial - the source comments actually state so, and furthermore the code to extract the MG and EG parts as well as all the required C++ operator definitions are considerably complex when compared to the "struct" solution.

To avoid any misunderstanding: we can be quite sure that the "tricky folding" solution, as shown in Stockfish for instance, works correctly, at least for today's compilers. And I also believe that the way how it is solved in Stockfish is quite nice! I just say two things:
1) it is not something that I would recommend to a beginner, and
2) even for the "advanced learner" as well as the "professional" I don't see that it has any measurable advantage in terms of performance over the "struct" solution.

Sven

wgarvin · Post by **wgarvin** » Sun Jan 22, 2012 3:48 pm

Manipulating two or more packed fields in parallel is called SWAR ("SIMD Within A Register"). SIMD is "Single Instruction, Multiple Data" (the vector register stuff Evert mentioned).

SWAR is the term that describes doing SIMD things without using special SIMD instructions or SIMD vector registers to do them.

I wanted to post that there were good resources for this kind of stuff on the Internet, but to my surprise, with 5 minutes of googling I could not find any of them.

If you really want to try and implement this two-scores-in-one-integer trick, I guess the best way to learn how to do it is to study how Stockfish does it.

This page shows the kind of tricks you can do if you understand how 2's-complement integers are represented. Its very helpful to understand how positive and negative numbers are represented (all CPUs made in the last >25 years use 2's-complement representation for integers). In a nutshell, an N-bit signed integer is just like an unsigned integer except the 2nd half of the values (those that start with a 1-bit) are interpreted as (the unsigned value - 2^N).

Example: Unsigned 4 bit integers (2^N == 16):
0000 = zero, 0001 = one, 0010 = two ... 0110 = six, 0111 = seven, 1000 = eight, 1001 = nine, .... 1110 = fourteen, 1111 = fifteen.
But with signed 2's complement format, the value 1000 represents (eight - 2^N) == -8. 1001 would be (nine - 2^N) == -7. 1010 == -6, 1011 = -5, 1100 = -4, 1101 = -3, 1110 = -2, and 1111 = -1. If you add one more, it will wrap around to 0 again. When you convert from e.g. a 4-bit unsigned type to an 8-bit unsigned type, you just "zero-extend" : 1001 (nine) --> 00001001 (nine). To convert from e.g. a 4-bit SIGNED type to an 8-bit SIGNED type, you "sign-extend" by replicating the MSB (most-significant bit) to fill the new digits: 1001 (-7) --> 11111001 (-7). So when people say "sign bit", they are referring to the MSB of the signed type. You can tell if the number is negative or not by looking at this sign bit, and when you widen a type you need to sign-extend it. When you truncate a type to a smaller size (e.g. cast a 32-bit type down to a 16-bit type) you're throwing away the upper bits including the sign bit. One of the lower bits (in this case bit 15) will become the new sign bit, and so a number which used to be positive might suddenly become negative, and vice versa. But if the number is in the representable range of the smaller type (e.g. -32768..+32767 for a signed 16-bit integer) then that means the upper bits will already either all be zeroes (positive), or all be ones (negative) and you don't lose anything by discarding them. E.g. the 32-bit representation of -32768 is 11111111111111111000000000000000, so truncating that to 1000000000000000 does not change the value.

Okay, back to keeping two scores in one register!

You can use signed values and simply let the lower half borrow one from the upper half, when the lower half is negative. That way you can add signed values together ((hi << 16) + lo) without any masking, and you can add combined pairs of values together as a single 32-bit value without any extra instructions like XOR, and you can multiply two values by the same small scaling factor with one multiply, as long as you are careful to keep the values small enough that they don't overflow from the lower half into the upper half.

But like Bob said, this is a very minor savings... my guess is that if this has any real benefit, it will be that the compiled code uses fewer integer registers, so the compiler might do a better job optimizing the surrounding code.

Anyway, when you want to extract the two values, the lower half is easy (just truncate it at 16 bits and sign-extend) but the upper half may be one lower than expected (because of a borrow from a negative lower half). You need to correct for this borrowing.. IIRC Stockfish does something tricky to try and remain standards-compliant? The simplest way is probably something like ((v + (v&32768))>>16), to take bit 15 which is the sign bit of the lower half, and add it to itself... if the lower half was positive, this will add nothing to the result. if the lower half was negative, this will carry into bit 16, thus adding one to the final result (i.e. putting back the one that was "borrowed" from the upper half when the lower half became negative). Always be careful about what type an expression has (signed or unsigned? what size type?) before shifting right, and before/after other operations. >> with a signed type will sign-extend, while >> with an unsigned type will zero-extend. Casting from smaller to larger types will either sign- or zero-extend depending on the SOURCE type... to save yourself some headache, I recommend casting it to a small type of the desired signness, then casting that to a large type of the same signness. Truncating large types to small ones is easier.. just cast directly to the type you want.

Beware of automatic promotions of smaller types to 'int' when using them with operators (e.g. adding them together). You should typedef your own types with a known byte size (e.g. typedef unsigned UInt32;) to make future porting of code easier and (more importantly) to make it easier to reason about what your code does when reading it. But even if you did this, you can still get screwed by the automatic promotions, because the size of 'int' is compiler-dependent! It's 32-bits on all modern desktop compilers, but there exist systems where its smaller or larger--and on these systems, automatic promotions might cause your tricky SWAR code to not work the way you think it does (for example: if your reasoning about it assumed that int was 32 bits, and you did a >> 16 that was supposed to replicate the sign bit 16 times, but on this system int was actually 32 bits... then you'll get 16 bits of whatever junk was in the register, not 16 copies of the sign bit). Explicit casts to the smaller type can fix this, but only if you cast to a smaller type with the proper sign. So if you want to take a signed 32-bit value and >> 16 on it to produce a sign-extended 16-bit value in a 32-bit type, then make sure to cast it to a signed 32-bit type before doing the shift (always use brackets to ensure things happen in the order you intended, and avoid any confusion about operator precedence). That way, even if the compiler promotes it back into a 64-bit int type, your cast down to 32-bit signed type will convince the compiler to sign-extend the 32-bits you want through the rest of the 64-bit register (bits 32..63) when it promotes it back to int. Then your >> 16 will still give you 16 copies of bit 31, as you intended.

Bit-tricks are fun, but the previous paragraph might convince you that this stuff is more hassle than its worth.

And often it is. A good rule of thumb is "Do the simplest thing that could possibly work".

Don · Post by **Don** » Sun Jan 22, 2012 4:00 pm

bob wrote: I am missing "the point." The original idea of this particular "trick" was to replace the pair of adds (endgame and middlegame scores) with a single add that folds in both at the same time. The above code doesn't do that.

So are we now just talking about making that place in the source a little simpler (one proc call rather than 2 adds) by letting the compiler expand the proc call to two adds at compile time?

So, is the point "readability" or "performance"? Seems to not be the latter...

Personally, I think the performance issue is completely moot anyway. With the code used today, the hardware has plenty of trouble keeping multiple integer pipes full anyway, because of data and control dependencies, an extra add doesn't add enough to be able to measure... It just slips into one of those empty pipe slots invisibly...

I spent a couple of days a couple of years ago reworking the entire evaluation function to pack values in a way I could operate simultaneously on them and came out on the other side with less than 1% speedup.

The code benefited from this simply because I had to reorganize and simplify it and I gave it more structure. But I could have done that without the packing.

It seemed like a great idea at the time ....

wgarvin · Post by **wgarvin** » Sun Jan 22, 2012 4:10 pm

Doh.. too late to edit a typo in my post above:

(for example: if your reasoning about it assumed that int was 32 bits, and you did a >> 16 that was supposed to replicate the sign bit 16 times, but on this system int was actually 64 bits... then you'll get 16 bits of whatever junk was in the register, not 16 copies of the sign bit).

wgarvin · Post by **wgarvin** » Sun Jan 22, 2012 5:14 pm

wgarvin wrote:You need to correct for this borrowing.. IIRC Stockfish does something tricky to try and remain standards-compliant? The simplest way is probably something like ((v + (v&32768))>>16), to take bit 15 which is the sign bit of the lower half, and add it to itself... if the lower half was positive, this will add nothing to the result. if the lower half was negative, this will carry into bit 16, thus adding one to the final result (i.e. putting back the one that was "borrowed" from the upper half when the lower half became negative).

Ugh.. I guess this is what happens when I post first thing after waking up in the morning..

There's an even better way to fix the borrowing... ((v + 32768) >> 16).

It generates the same carry into bit 16, but is even simpler.

bob · Post by **bob** » Sun Jan 22, 2012 5:50 pm

Don wrote:
bob wrote: I am missing "the point." The original idea of this particular "trick" was to replace the pair of adds (endgame and middlegame scores) with a single add that folds in both at the same time. The above code doesn't do that.

So are we now just talking about making that place in the source a little simpler (one proc call rather than 2 adds) by letting the compiler expand the proc call to two adds at compile time?

So, is the point "readability" or "performance"? Seems to not be the latter...

Personally, I think the performance issue is completely moot anyway. With the code used today, the hardware has plenty of trouble keeping multiple integer pipes full anyway, because of data and control dependencies, an extra add doesn't add enough to be able to measure... It just slips into one of those empty pipe slots invisibly...
I spent a couple of days a couple of years ago reworking the entire evaluation function to pack values in a way I could operate simultaneously on them and came out on the other side with less than 1% speedup.

The code benefited from this simply because I had to reorganize and simplify it and I gave it more structure. But I could have done that without the packing.

It seemed like a great idea at the time ....

I looked at it early on in the EG/MG interpolation conversion. Seemed to be an elegant solution. But it also seemed to be problematic, and I couldn't see any potential gain in performance anyway...

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 22, 2012 11:18 pm

Arggh..that is why chess programmers will never contribute to real AI

* I mean bit tricks are all cool and stuff, but use them where they are needed. IMO a good place for bit tricks and innovations would be GPU (SIMD, memory constraints, caching etc), not some completely unnecessary mg&eg score combination. Say I decided to have three game states, I would have to do the useless trick again... Bitboards are the only thing that came out of chess bit tricks remotely useful in other domains.
sorry for being grumpy

* I mean NOW, the current generation.

Rebel · Post by **Rebel** » Sun Jan 22, 2012 11:29 pm

I am still missing the need for the packing.

The compiler will created code like this:

Code: Select all

  add     EDI,100                  // midgame
  add     ESI,50                   // endgame

Both instruction are done simultaneously in one cycle.

bob · Post by **bob** » Sun Jan 22, 2012 11:40 pm

Rebel wrote:I am still missing the need for the packing.

The compiler will created code like this:
Code: Select all
  add     EDI,100                  // midgame
  add     ESI,50                   // endgame 
Both instruction are done simultaneously in one cycle.

If you do it like stockfish, there is ONE add. It is marginally cleaner to not have two updates everywhere, although semantically, you are just doing a simd-like pair of adds with one single add. Worth it? Don't see how it would gain anything at all since most programs can't keep all integer pipes busy every cycle anyway.

two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer

Re: two values in one integer