hgm wrote:Do you have any idea which code produces thes i-cache misses? E.g. is it rarely used evaluation features, or promotions in Make / UnMake?
It is the total package but especially of course Evaluation as that's most code during runtime.
Note that i forgot to mention that branches are real expensive as well: 12.2% branch misprediction overall at the oldie opteron.
cachegrind does not work for core2 misprediction statistics,
as the manner in which processors do branchprediction is more topsecret than how to build a nuke.
Maybe i should try vtune for that. Of course this gets caused by Diep's huge amount of patterns. It is of course relative easy with hard work to get rid of a lot of branches, but then code becomes total unreadable.
I am considering however to build a define for this and see where the ship lands.
As explained by Nalimov some while ago the big problem with optimizations is that compiler doesn't know hell of a lot of things we do about the engine.
We know a square never gets above 63 and never underneath 0. We know that we never have illegal references of a square reference into most arrays, whereas it conditionally must assume it is possible illegal so can't optimize it:
int sq = movelist[side][1];
....
if( board[sq_a5] == pawn && color[sq] == neutral ) { ...
Compiler doesn't know that if the first condition is not true that the second still is a legal lookup.
Removing the branch i'm looking for a neat manner to do that without
increasing diep's codesize with 2MB of new code.
Of course faster as it reduces 1 branch is next code,
but i consider that UNREADABLE code:
if( !((board[sq_a5]-pawn)|(color[sq]-neutral)) ) { ..
That would remove a lot of instructions though from Diep as i assume
compiler not only reduces number of branches but also it is less instructions.
Further what causes a lot of instructions is the fact that i have optimized a lot of code over the years to faster code.
hgm wrote:
I never payed much attention to L1i misses, as my code is too small to produce them anyway. Furthermore, L1i is usually equiped with prefetch logic that would prevent any actual misses when execution proceeds linearly. So it is likely that taken branchess are responsible for most of the misses.
I would argue that this is by definition true.
A lot of patterns only get taken when the pattern is the case in this position. So in case of a generic condition, say checking whether we are in a certain game phase or whether we have a queen on the board, if for a few nodes we are in endgame then we backtrack or promote to queen and we have the pattern again hence these patterns get triggered in eval once again.
hgm wrote:
If the code is larger than the cache, there is of course nothing you can do (other than shrink the code). Although it might help a little if you would collect code of small if-clauses that are rarely used, but when they are used, are likely used together, in the same cache line.
You touch an interesting subject here. Yet we need to be careful here in case of diep. There is also the risk of big slowdown. If diep would take a look and evaluate every pattern, then Diep slows down factor 5 or so in nps?
You understand probably what i mean: generic 'if then else' patterns can skip multiple patterns and/or even a lot of function calls can get avoided.
Getting junk from L2 still is faster than evaluating everything.
[quote = "hgm"]
E.g. the extra code for castling in Make and UnMake could be located in the same cache line (assumingthey fit there together).
[/quote]
Yes i understood what you meant and it is a clever idea, of course my first attempt is going to try to get rid of some historic idiocy seeing whether that reduces the problem.
Note the L1i and L1d statistics are for core2, a processor with very little L1 cache. I didn't do the same compare for AMD. I should maybe do that.
Probably all this is no big deal at AMD. I do know the misprediction at AMD though, i suspect it is a lot better at intel.
hgm wrote:
The compiler typically generate codes that branches to some far off place to do the conditional part, and then jumps back at the end of it. (This makes better use of the L1i pre-fetcher bandwidth, I suppose.) So if you Make a
Only gcc is doing this in order to hurt AMD a bit more than Intel and to slow down GCC.
Default GCC is faster at AMD than intel c++. By doing such jumps of course gcc screws itself for PGO considerations and therefore after pgo intel c++ / visual c++ are faster and a lot than GCC.
We must be open and honest here. GCC is a team of volunteers. It is easy to hurt GCC in manners the team doesn't soon notice, and some companies have $100 billion reasons to do so. Nations like Georgia get overrun and thousands of soldiers die, for much smaller reasons than that.
Linus Thorvalds sometimes posts something to wake up the GCC team, but it doesn't help. They even show the middle finger to him and tell the true reason: "but then it is slower on my intel P4". In 2007/2008 of course no good reason to rape AMD flags.
So the biggest reason why diep suffers such big branch misprediction penalty overall of 12.2% is because of the sabotage done to GCC to hurt AMD.
If AMD would be $100 billion company and Intel 10 times smaller, of course the above names would've been reversed now.
Make no illusions there.
hgm wrote:
castling move, and it would normally be a miss ecause castling code does not belong to the code executed so frequently that it can be kept in cache always, then at least the UnMake, which usually will follow not too much thereafter, will produce an automatic hit, while it would have been another miss if it was located in an independnet cache line.
Note that on AMD the small associativity (2-way) might cause a problem, and lead to collisions of very frequently used code that could have been avoided by moving it around. A not-so-frequently-used piece of code that happens to map to the same set as two very frequently-used code parts will be flushed out, and almost always cause two misses when executed (one to bring it in, one to bring back the frequently-used code it replaced). And there might be sets were only one code part maps to, so that the other way in that set remains completely unused. I am not sure at all if compilers take this into account. On the Intel 8-way caches this should almost never be a problem.
What i am doing in Diep and i hadn't changed that yet in my evaluation is also indirect function calls. So i use a table of functions it can call dependant upon the value it encounters somewhere.
I know this hurts for old processors, but i do not know about modern processors like Nehalem and Phenom. Vaguely i remember a statement somewhere it always causes a branch mispredict doing that.
Is that the case?
It is a lot of work to test all this very consequent.
To show actual code of Diep from a part of evaluation doing this function call as now we touch really historic code of Diep:
Code: Select all
MYKING = PieceList[white][0];
OPKING = PieceList[black][0];
MyAtt = AttNum[white];
OpAtt = AttNum[black];
MyPawn = Pawns[white];
OpPawn = Pawns[black];
TrfstndO = TorenAfstand[OPKING];
AfstndM = Afstand[MYKING];
AfstndO = Afstand[OPKING];
#if IncSorting
preval = PreValue[white][0];
dpiec = DPiece[white][0];
#endif
#if Show
VERB("AANVANG STUKEVALS SCORE TOTAAL IS",s,0,0);
#endif
/* assumptie: op PieceList[0][0] staat de witte koning */
psq = &PieceList[white][1];
sq = *psq++;
do {
if( sq != 64 ) {
if( board[sq] == pawn ) {
if( Afstand[sq][MYKING] <= 2 ) {
s += (30-(5*stage));
#if Show
VERB("white pawn : pawndistance to white king (from White)",
(30-(5*stage)),sq,sq);
#endif
}
if( (OpAtt[sq]&15) >= (MyAtt[sq]&15)
&& ((MyAtt[sq]&ctlP) == 0 || (OpAtt[sq]&ctlP)) ) {
s -= 5*(OpAtt[sq]&15)+5;
s += 5*(MyAtt[sq]&15);
#if Show
VERB("white pawn : attacks to white pawn (from White)",
(5*(MyAtt[sq]&15))-(5*(OpAtt[sq]&15)+5),sq,sq);
#endif
}
}
else {
#if ENGINEPARAMETERS
if( (sq&7) >= 4 )
s += engpar.flankbonus;
#endif
s += (*(evalRoutines[board[sq]]))(sq,white);
}
}
} while( (sq=*psq++) != 128 );
That evalRoutines looks like:
typedef int (*EVALFUNC)(int sq,int c);
static EVALFUNC evalRoutines[7] = {
ErrorIt,
ErrorIt,
Knight,
Bishop,
Rook,
Queen,
ErrorIt };
Note there is a lot to tell about this manner to indirectly call functions with respect to specint. Maybe GCP wants to open the stinking rotten room that is called SPECINT.
We had added similar function calls to specsjeng and where i never had any problem with any compiler doing these function calls wrong, for specint the special specint compilers they use, many did do it wrong...
Question is: is it faster to get rid of the function table and call the functions in a different manner?
I had thought this was a neat manner to do it at the time (well start 90s you're young and student and want to try all features a language like C offers). Not a single thought i would give it to do something risky like this.
Vincent