Zobrist key random numbers

wgarvin · Post by **wgarvin** » Wed Jan 28, 2009 12:21 am

diep wrote:Note i see you work at IBM, did i google that correct?

You cannot compare the PC processors with HPC processors from IBM. Those architectures are *total* different. I'd love to have a try with diep at a power6 to see its IPC there. I'm betting it is a LOT better than anything in specint is.

Vincent

I did work at IBM in the past (doing Java software things, nothing related to processors), but these days I make console games. Yeah, IBM makes the CPUs for all of the current gen of consoles, which is kind of cool.

Xbox360 and Wii are PPC-based, and the PS3 uses a Cell processor whose PPU is basically the same dual-thread, in-order PPC core used in the Xbox360. Except that the 360 has three of them, while the PS3's Cell processor has one of them plus about 6 SPU processors. I like the 360 and PS3 both, but the in-order cores mean you have to really avoid pipeline stalls and cache misses (e.g. an L2 cache miss on the Xbox360 is about 610 cycles, and there's no automatic prefetching).

Of course a modern desktop CPU handles not-very-optimized code much more gracefully than the console CPUs do... which is a good thing, since there are so many different CPUs and motherboards out there. We try to optimize on a small number of representative chips, but for the consoles we can and do optimize for the specific CPUs.

diep · Post by **diep** » Wed Jan 28, 2009 8:32 pm

hgm wrote:Do you have any idea which code produces thes i-cache misses? E.g. is it rarely used evaluation features, or promotions in Make / UnMake?

It is the total package but especially of course Evaluation as that's most code during runtime.

Note that i forgot to mention that branches are real expensive as well: 12.2% branch misprediction overall at the oldie opteron.

cachegrind does not work for core2 misprediction statistics,
as the manner in which processors do branchprediction is more topsecret than how to build a nuke.

Maybe i should try vtune for that. Of course this gets caused by Diep's huge amount of patterns. It is of course relative easy with hard work to get rid of a lot of branches, but then code becomes total unreadable.

I am considering however to build a define for this and see where the ship lands.

As explained by Nalimov some while ago the big problem with optimizations is that compiler doesn't know hell of a lot of things we do about the engine.

We know a square never gets above 63 and never underneath 0. We know that we never have illegal references of a square reference into most arrays, whereas it conditionally must assume it is possible illegal so can't optimize it:

int sq = movelist[side][1];

....

if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..

That would remove a lot of instructions though from Diep as i assume
compiler not only reduces number of branches but also it is less instructions.

Further what causes a lot of instructions is the fact that i have optimized a lot of code over the years to faster code.

hgm wrote: I never payed much attention to L1i misses, as my code is too small to produce them anyway. Furthermore, L1i is usually equiped with prefetch logic that would prevent any actual misses when execution proceeds linearly. So it is likely that taken branchess are responsible for most of the misses.

I would argue that this is by definition true.

A lot of patterns only get taken when the pattern is the case in this position. So in case of a generic condition, say checking whether we are in a certain game phase or whether we have a queen on the board, if for a few nodes we are in endgame then we backtrack or promote to queen and we have the pattern again hence these patterns get triggered in eval once again.

hgm wrote: If the code is larger than the cache, there is of course nothing you can do (other than shrink the code). Although it might help a little if you would collect code of small if-clauses that are rarely used, but when they are used, are likely used together, in the same cache line.

You touch an interesting subject here. Yet we need to be careful here in case of diep. There is also the risk of big slowdown. If diep would take a look and evaluate every pattern, then Diep slows down factor 5 or so in nps?

You understand probably what i mean: generic 'if then else' patterns can skip multiple patterns and/or even a lot of function calls can get avoided.

Getting junk from L2 still is faster than evaluating everything.

[quote = "hgm"]
E.g. the extra code for castling in Make and UnMake could be located in the same cache line (assumingthey fit there together).
[/quote]

Yes i understood what you meant and it is a clever idea, of course my first attempt is going to try to get rid of some historic idiocy seeing whether that reduces the problem.

Note the L1i and L1d statistics are for core2, a processor with very little L1 cache. I didn't do the same compare for AMD. I should maybe do that.
Probably all this is no big deal at AMD. I do know the misprediction at AMD though, i suspect it is a lot better at intel.

hgm wrote: The compiler typically generate codes that branches to some far off place to do the conditional part, and then jumps back at the end of it. (This makes better use of the L1i pre-fetcher bandwidth, I suppose.) So if you Make a

Only gcc is doing this in order to hurt AMD a bit more than Intel and to slow down GCC.

Default GCC is faster at AMD than intel c++. By doing such jumps of course gcc screws itself for PGO considerations and therefore after pgo intel c++ / visual c++ are faster and a lot than GCC.

We must be open and honest here. GCC is a team of volunteers. It is easy to hurt GCC in manners the team doesn't soon notice, and some companies have $100 billion reasons to do so. Nations like Georgia get overrun and thousands of soldiers die, for much smaller reasons than that.

Linus Thorvalds sometimes posts something to wake up the GCC team, but it doesn't help. They even show the middle finger to him and tell the true reason: "but then it is slower on my intel P4". In 2007/2008 of course no good reason to rape AMD flags.

So the biggest reason why diep suffers such big branch misprediction penalty overall of 12.2% is because of the sabotage done to GCC to hurt AMD.

If AMD would be $100 billion company and Intel 10 times smaller, of course the above names would've been reversed now.

Make no illusions there.

hgm wrote: castling move, and it would normally be a miss ecause castling code does not belong to the code executed so frequently that it can be kept in cache always, then at least the UnMake, which usually will follow not too much thereafter, will produce an automatic hit, while it would have been another miss if it was located in an independnet cache line.

Note that on AMD the small associativity (2-way) might cause a problem, and lead to collisions of very frequently used code that could have been avoided by moving it around. A not-so-frequently-used piece of code that happens to map to the same set as two very frequently-used code parts will be flushed out, and almost always cause two misses when executed (one to bring it in, one to bring back the frequently-used code it replaced). And there might be sets were only one code part maps to, so that the other way in that set remains completely unused. I am not sure at all if compilers take this into account. On the Intel 8-way caches this should almost never be a problem.

What i am doing in Diep and i hadn't changed that yet in my evaluation is also indirect function calls. So i use a table of functions it can call dependant upon the value it encounters somewhere.

I know this hurts for old processors, but i do not know about modern processors like Nehalem and Phenom. Vaguely i remember a statement somewhere it always causes a branch mispredict doing that.

Is that the case?

It is a lot of work to test all this very consequent.

To show actual code of Diep from a part of evaluation doing this function call as now we touch really historic code of Diep:

Code: Select all

  MYKING   = PieceList[white][0];
  OPKING   = PieceList[black][0];
  MyAtt    = AttNum[white];
  OpAtt    = AttNum[black];
  MyPawn   = Pawns[white];
  OpPawn   = Pawns[black];
  TrfstndO = TorenAfstand[OPKING];
  AfstndM  = Afstand[MYKING];
  AfstndO  = Afstand[OPKING];
  #if IncSorting
  preval   = PreValue[white][0];
  dpiec    = DPiece[white][0];
  #endif

  #if Show
  VERB("AANVANG STUKEVALS SCORE TOTAAL IS",s,0,0);
  #endif

  /* assumptie: op PieceList[0][0] staat de witte koning  */
  psq = &PieceList[white][1];
  sq  = *psq++;
  do {
    if( sq != 64 ) {
      if( board[sq] == pawn ) {
        if( Afstand[sq][MYKING] <= 2 ) {
          s += (30-(5*stage));
          #if Show
          VERB("white pawn : pawndistance to white king (from White)",
           (30-(5*stage)),sq,sq);
          #endif
        }
        if( (OpAtt[sq]&15) >= (MyAtt[sq]&15)
         && ((MyAtt[sq]&ctlP) == 0 || (OpAtt[sq]&ctlP)) ) {
          s -= 5*(OpAtt[sq]&15)+5;
          s += 5*(MyAtt[sq]&15);
          #if Show
          VERB("white pawn : attacks to white pawn (from White)",
           (5*(MyAtt[sq]&15))-(5*(OpAtt[sq]&15)+5),sq,sq);
          #endif
        }
      }
      else {
        #if ENGINEPARAMETERS
        if( (sq&7) >= 4 )
          s += engpar.flankbonus;
        #endif
        s += (*(evalRoutines[board[sq]]))(sq,white);
      }
    }
  } while( (sq=*psq++) != 128 );

That evalRoutines looks like:
  
typedef int (*EVALFUNC)(int sq,int c);
static EVALFUNC evalRoutines[7] = {
  ErrorIt,
  ErrorIt,
  Knight,
  Bishop,
  Rook,
  Queen,
  ErrorIt };

Note there is a lot to tell about this manner to indirectly call functions with respect to specint. Maybe GCP wants to open the stinking rotten room that is called SPECINT.

We had added similar function calls to specsjeng and where i never had any problem with any compiler doing these function calls wrong, for specint the special specint compilers they use, many did do it wrong...

Question is: is it faster to get rid of the function table and call the functions in a different manner?

I had thought this was a neat manner to do it at the time (well start 90s you're young and student and want to try all features a language like C offers). Not a single thought i would give it to do something risky like this.

Vincent

Zach Wegner · Post by **Zach Wegner** » Wed Jan 28, 2009 9:25 pm

diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..

Or of course:

Code: Select all

if( board[sq_a5] == pawn & color[sq] == neutral ) { ...

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jan 28, 2009 10:42 pm

Zach Wegner wrote:
diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
Or of course:
Code: Select all
if( board[sq_a5] == pawn & color[sq] == neutral ) { ...

I agree on the readability, but compare and setz are quite dependently

Code: Select all

cmp  
setz dl
cmp  
setz dh
and  dl, dh
jz   skipCondition

So I would vote for Vincent's shorter code and to gain better parallelism

Code: Select all

sub
sub
or
jnz  skipCondition

bob · Post by **bob** » Wed Jan 28, 2009 10:48 pm

Zach Wegner wrote:
diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
Or of course:
Code: Select all
if( board[sq_a5] == pawn & color[sq] == neutral ) { ...

I'm personally not sure _any_ of this discussion makes a lot of sense. Branch prediction is quite good. Doing unnecessary work every pass thru that to eliminate a branch is not a sure win... I do tons of that kind of stuff (using &&) and speed doesn't seem to be that big an issue in my code... With OOE, I am not sure it makes much if any difference, because nothing would prevent the second branch condition from being tested first, since either being false will skip the code. The speculative pipeline understands the idea that if it tries the second branch condition out of order, and it would raise an exception, that won't be a problem if the first condition is also false. This is called "a control dependency" that has to be handled correctly for OOE anyway. And I'd bet that many times that second branch offers zero overhead, if not almost every time...

jwes · Post by **jwes** » Thu Jan 29, 2009 12:05 am

Zach Wegner wrote:
diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
Or of course:
Code: Select all
if( board[sq_a5] == pawn & color[sq] == neutral ) { ...

I believe that the c lang spec only guarentees non-zero for true, so this could fail (though probably won't on any real compiler). I think this should work, but doesn't on MSVC7.

Code: Select all

if(( board[sq_a5] == pawn & color[sq] == neutral )) { ...

diep · Post by **diep** » Thu Jan 29, 2009 12:07 am

bob wrote:
Zach Wegner wrote:
diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
Or of course:
Code: Select all
if( board[sq_a5] == pawn & color[sq] == neutral ) { ...
I'm personally not sure _any_ of this discussion makes a lot of sense. Branch prediction is quite good. Doing unnecessary work every pass thru that to eliminate a branch is not a sure win... I do tons of that kind of stuff (using &&) and speed doesn't seem to be that big an issue in my code... With OOE, I am not sure it makes much if any difference, because nothing would prevent the second branch condition from being tested first, since either being false will skip the code. The speculative pipeline understands the idea that if it tries the second branch condition out of order, and it would raise an exception, that won't be a problem if the first condition is also false. This is called "a control dependency" that has to be handled correctly for OOE anyway. And I'd bet that many times that second branch offers zero overhead, if not almost every time...

There is tools to measure the impact of branch prediction. In linux there is a free one called cachegrind from the valgrind toolset.

It tells me Diep it has 12.2% branch mispredicts.

So 122 out of every 1000 branches get mispredicted. So i do have to worry, as that is REALLY a lot.

That is a LOT of wasted cycles in Diep. Add to that, that a lot of its codes
exists out of branches.

Thanks,
Vincent

Zach Wegner · Post by **Zach Wegner** » Thu Jan 29, 2009 12:11 am

No, it guarantees 1/0. There was a time before the C standard when && and || didn't exist, and you had to use & and |.

diep · Post by **diep** » Thu Jan 29, 2009 12:14 am

bob wrote:
Zach Wegner wrote:
diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
Or of course:
Code: Select all
if( board[sq_a5] == pawn & color[sq] == neutral ) { ...
I'm personally not sure _any_ of this discussion makes a lot of sense. Branch prediction is quite good. Doing unnecessary work every pass thru that to eliminate a branch is not a sure win... I do tons of that kind of stuff (using &&) and speed doesn't seem to be that big an issue in my code... With OOE, I am not sure it makes much if any difference, because nothing would prevent the second branch condition from being tested first, since either being false will skip the code. The speculative pipeline understands the idea that if it tries the second branch condition out of order, and it would raise an exception, that won't be a problem if the first condition is also false. This is called "a control dependency" that has to be handled correctly for OOE anyway. And I'd bet that many times that second branch offers zero overhead, if not almost every time...

What i wonder about is why you are guessing processor can already evaluate the second branch: if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Suppose 'sq' is a value bigger than array size if on a5 is not a pawn.
Would crash the program.
So i'd argue processor cannot 'read ahead' the 2nd branch very easily,
as doing the lookup color[sq] it has to delay.

Vincent

bob · Post by **bob** » Thu Jan 29, 2009 12:22 am

diep wrote:
bob wrote:
Zach Wegner wrote:
diep wrote:if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.

Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.

Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:

if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
Or of course:
Code: Select all
if( board[sq_a5] == pawn & color[sq] == neutral ) { ...
I'm personally not sure _any_ of this discussion makes a lot of sense. Branch prediction is quite good. Doing unnecessary work every pass thru that to eliminate a branch is not a sure win... I do tons of that kind of stuff (using &&) and speed doesn't seem to be that big an issue in my code... With OOE, I am not sure it makes much if any difference, because nothing would prevent the second branch condition from being tested first, since either being false will skip the code. The speculative pipeline understands the idea that if it tries the second branch condition out of order, and it would raise an exception, that won't be a problem if the first condition is also false. This is called "a control dependency" that has to be handled correctly for OOE anyway. And I'd bet that many times that second branch offers zero overhead, if not almost every time...
What i wonder about is why you are guessing processor can already evaluate the second branch: if( board[sq_a5] == pawn && color[sq] == neutral ) { ...

Suppose 'sq' is a value bigger than array size if on a5 is not a pawn.
Would crash the program.
So i'd argue processor cannot 'read ahead' the 2nd branch very easily,
as doing the lookup color[sq] it has to delay.

Vincent

Nope, go look at Hennessy/Patterson and look at the discussion about "control dependencies" and out of order execution. When you speculate a branch, you can mis-predict and execute code that is really not going to be executed. In this example:

if (i != 0)
a = a / i;

When you encounter the test, you predict (for whatever reason) that the branch is not taken. And you fall into the divide by zero exception. But the exception can _not_ be raised until the preceeding control dependency has been resolved. If the condition is false (somehow) then we will really reach the divide by zero, and we get an exception. But in the case above, we would never do the divide, even though the processor thought we would and issued the instructions to do the operation.

That is quite common and is a mandatory feature in speculative / out of order execution. This goes back to the original IBM 360/91, and survives through the architecures of today. And it works exactly as I described, or, if you prefer, exactly as Hennessy / Patterson describe in "Computer Architecture, a Quantitative approach"

That is the entire point of speculative execution, otherwise why bother with branch prediction at all if you can't execute things before the branch is resolved???

Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers