following HGM post on the cache of the processors, I have another question. I would like to know how to measure the impact of core port saturation... We know a modern processor core can execute up to 4 instructions in parallel, so its throughput can be as high as 0.25 CPI, if the program that is being executed has a good instruction pattern, eg interleaving one memory instruction, one multiplication etc... the pipeline will be always full so no stall will occur. But looking at my chess code I saw (make move after a capture):
Code: Select all
TotalPiecesCount(opponent)--;
TotalPiecesCount(BOTH)--;
PieceCount(opponent, captured)--;
PieceCount(BOTH, captured)--;
Code: Select all
void test()
{
unsigned long long size = 512*1024*1024;
unsigned char *memory = (unsigned char*)malloc(size);
memset(memory, 0, sizeof(memory));
time_t t = clock();
for (int j = 0; j < 8; j++)
{
unsigned char *current = memory;
for (unsigned long long i = 0; i < size; i += 2 /* 16 */)
{
(*(current++))++;
(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
//(*(current++))++;
}
}
t = clock() - t;
printf("%6.3f sec\n", t * (1. / CLOCKS_PER_SEC));
}
Code: Select all
1.852 sec ----> when increment is 2 and there are 2 (*(current++))++; rows
1.531 sec ----> when increment is 16 and rows are 16 (*(current++))++; rows
Code: Select all
inc byte ptr[eax]
inc byte ptr[eax + 1]
Code: Select all
inc byte ptr[eax]
inc byte ptr[eax + 1]
inc byte ptr[eax + 2]
...
inc byte ptr[eax + 15]
This clearly means I miss something in these architectures...
Another interesting question is what is impact of the instruction cache size on a typical alpha beta searcher. My CPU, as an example, is an intel i7 3630QM quad-core with HT, and it has 32Kb of instruction cache. Are Eval+AB+Hash+...+printf+... instructions bigger than 32Kb? I guess yes, so keeping code compact should have very little impact on performance (please do not consider perft, where code locality is very high, so you should do something very wrong to get it wrong...). What do you think?
Best regards,
Natale.