I mean this inner loop in generation, and sorry, it looks ugly for my tasteDaniel Shawul wrote:Hey Srdja
I agree with reducing the memory usage, BUT I do not think your code is ugly at all. Trying to avoid branching code is a waste of time and many people seem to miss that... GPU computation is memory bound (200x slower) even with the new L1 and L2. Infact it is encouraged to compute / recompute stuff rather thand do a single uncache probe... So all this stuff about branching code, parallel move generation doesn't make that much sense to me, when you have far bigger issues to deal with. This is not CPU computation with wide registers (SWAR), there you have a couple of threads (say 4) where you can do vector computations (instead of scalar) to get better performance. The caching is really good there with few threads. For GPUs you have thousands_ of threads to work with. Unless the proposition is to launch few threads say 32x less than what is possible and hope to gain something from parallel move generation, evaluation etc, it doesn't make sense. AFAIK you and I have been trying to use each core for doing real search. So unless you think it is impossible toreach full occupancy with the current method used, there is no need to waste cores for that purpose.
cheers
Daniel
It is not only about branches bit to avoid domove/legaltest/undomove by predetermining pinned pieces and attacks.
Code: Select all
while( bbMoves ) {
// pop 1st bit
to = ((Square)(BitTable[((bbMoves & -bbMoves) * 0x218a392cd3d5dbf) >> 58]) );
bbMoves &= (bbMoves-1);
cpt = to; // TODO: en passant
pieceto = ( (piece>>1) == PAWN && ( (som == WHITE && (to>>3) == 7) || (som == BLACK && (to>>3) == 0) ) ) ? (QUEEN<<1 | som): piece; // pawn promotion
piececpt = ((board[0]>>cpt) &1) + 2*((board[1]>>cpt) &1) + 4*((board[2]>>cpt) &1) + 8*((board[3]>>cpt) &1);
// make move and store in global
move = ((Move)pos | (Move)to<<6 | (Move)cpt<<12 | (Move)piece<<18 | (Move)pieceto<<22 | (Move)piececpt<<26 );
// TODO: pseudo legal move gen: 2x speedup?
// domove
domove(board, pos, to, cpt, piece, pieceto, piececpt);
//get king position
bbTemp = board[0] ^ (board[1] | board[2] | board[3]);
bbTemp = (som == BLACK)? board[0] : bbTemp ;
bbTemp = bbTemp & board[1] & board[2] & ~board[3]; // get king
kingpos = (Square)(BitTable[((bbTemp & -bbTemp) * 0x218a392cd3d5dbf) >> 58]);
kic = 0;
// king in check?
kic = PieceInCheck(board, kingpos, som, RAttacks, BAttacks);
if (kic == 0) {
// copy move to global
global_pid_moves[pid*max_depth*256+sd*256+n] = move;
// Movecounters
n++;
COUNTERS[totalThreads*3+pid]++;
}
// undomove
undomove(board, pos, to, cpt, piece, pieceto, piececpt);
}
Gerd