Measuring cycles

Discussion of chess software programming and technical issues.

Moderator: Ras

Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Measuring cycles

Post by Gerd Isenberg »

Some VC code (32-bit mode) with x86 inline assembly to measure cycles with rdtsc instructions:

Code: Select all

unsigned int cycles;
unsigned int cpuidRDTSCcycles;

__forceinline void startRDTSC()
{
  __asm
  {
    xor   eax, eax
    cpuid
    rdtsc
    mov   [cpuidRDTSCcycles], eax
    xor   eax, eax
    cpuid
    rdtsc
    sub   eax, [cpuidRDTSCcycles]
    mov   [cpuidRDTSCcycles], eax

    xor   eax, eax
    cpuid
    rdtsc
    mov   [cpuidRDTSCcycles], eax
    xor   eax, eax
    cpuid
    rdtsc
    sub   eax, [cpuidRDTSCcycles]
    mov   [cpuidRDTSCcycles], eax

    xor   eax, eax
    cpuid
    rdtsc
    mov   [cpuidRDTSCcycles], eax
    xor   eax, eax
    cpuid
    rdtsc
    sub   eax, [cpuidRDTSCcycles]
    mov   [cpuidRDTSCcycles], eax

    xor   eax, eax
    cpuid
    rdtsc
    mov   [cycles], eax
  }
}

__forceinline void stopRDTSC()
{
  __asm
  {
    xor   eax, eax
    cpuid
    rdtsc
    sub   eax, [cycles]
    sub   eax, [cpuidRDTSCcycles]
    mov   [cycles], eax
  }
}
One may use it inside a small loop, to get cycles of the last run:

Code: Select all

void main (int argc, char**argv)
{
   for ( int i = 0; i < 7; i++)
   {
      startRDTSC();
      code2measure ...
      stopRDTSC();
   }
   printf("cycles by rdtsc = %d\n", cycles);
}
Or as proposed by HGM to address a histogram array by cycles:

Code: Select all

void main (int argc, char**argv)
{
   const int  N_TRIALS = 100000;
   unsigned int performanceHistogram[1024];
   memset (performanceHistogram, 0, sizeof (performanceHistogram));

   for ( int i = 0; i < N_TRIALS; i++)
   {
      startRDTSC();
      code2measure ...
      stopRDTSC();
      ++performanceHistogram[cycles & 1023];
   }
   // print histogram
}
There are issues with multi- and hyperthreading. For using rdtsc with gcc see http://www.cs.wm.edu/~kearns/001lab.d/rdtsc.html.

Winapi has QueryPerformanceCounter Function
http://msdn.microsoft.com/en-us/library ... S.85).aspx
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Measuring cycles

Post by Desperado »

hi Gerd, thx for this!

is there a problem to use this in 64 bit mode ?
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Measuring cycles

Post by Gerd Isenberg »

Desperado wrote:hi Gerd, thx for this!

is there a problem to use this in 64 bit mode ?
VC does not support inline asm in 64-bit mode. GCC does, but other syntax.
User avatar
hgm
Posts: 28461
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Measuring cycles

Post by hgm »

Gerd Isenberg wrote:Or as proposed by HGM to address a histogram array by cycles:

Code: Select all

void main (int argc, char**argv)
{
   const int  N_TRIALS = 100000;
   unsigned int performanceHistogram[1024];
   memset (performanceHistogram, 0, sizeof (performanceHistogram));

   for ( int i = 0; i < N_TRIALS; i++)
   {
      startRDTSC();
      code2measure ...
      stopRDTSC();
      ++performanceHistogram[cycles & 1023];
   }
   // print histogram
}
Note that I usually call RDTSC only once per loop, by using the stop time of one as the start time of the next. This because incrementing the histogram and loop counter takes totally negligible time compared to RDTSC. (Just 1 or 2 cycles.) RDTSC is pretty slow, and on some CPUs, athough the TSC measures clock cycles, it does not seem to increment in step of 1. A RDTSC can take as many as 50 clocks.

Code: Select all

void main (int argc, char**argv)
{
   const int  N_TRIALS = 100000;
   unsigned int performanceHistogram[1024];
   memset (performanceHistogram, 0, sizeof (performanceHistogram));

   for ( int i = 0; i < N_TRIALS; i++)
   {
      code2measure ...
      newTime = ReadTSC();
      cycles = newTime - OldTime;
      oldTime = newTime;
      ++performanceHistogram[cycles & 1023];
   }
   // print histogram
}
What I like about that is that you cannot have any "hidden jitter", as every time interval between two RDTSC events ends up in the histogram. With the other method it is in theory possible that, due to out-of-order instruction scheduling, the first call is systematically executed early, and the second call late, and you would never notice that you over-estimated the time needed to run the code at the expense of the time needed to make the histogram.

With a single RDTSC per loop, if almost all intervals are the same, you know there was no jitter and that the timing was reliable. If you see two peaks in the histogram for a code that should have constant execution time, you know there was jitter and tht you have to take the average.
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: Measuring cycles

Post by Zach Wegner »

Ah, thanks Gerd. Coincidentally, I was trying to use Agner Fog's code to measure clock cycles the past couple days. On Windows, his program does... something that causes it to crash, I suppose related to the DLLs it loads. On Unix here, it just outputs 0 for every cycle. The ReadTSC function just seems to output the same value for every iteration...

I'm also curious about your assembly. Why does it have the rdtsc/mov/rdtsc/sub sequence 3 times? I suppose just to synchronize it a bit more? :)
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Measuring cycles

Post by Daniel Shawul »

Some code I found from a link in wiki that seems to have everything. http://www.fftw.org/cycle.h
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Measuring cycles

Post by Gerd Isenberg »

Zach Wegner wrote:Ah, thanks Gerd. Coincidentally, I was trying to use Agner Fog's code to measure clock cycles the past couple days. On Windows, his program does... something that causes it to crash, I suppose related to the DLLs it loads. On Unix here, it just outputs 0 for every cycle. The ReadTSC function just seems to output the same value for every iteration...

I'm also curious about your assembly. Why does it have the rdtsc/mov/rdtsc/sub sequence 3 times? I suppose just to synchronize it a bit more? :)
Cpuid ensures rdtsc doesn't execute out of order, e.g. before the code to measure is finished. Since cpuid and rdtsc take some time, one needs to calibrate or measure the cycles they take. For some arbitrary reason this is repeated three time to to get a "reliable" time. This is probably not necessary for the loop test, as HG mentioned, he uses only one cpuid/rdtsc per loop cycle. Also note that only the lower dword of the performance counter is considered here, which should be fine for the purpuse to measure someting far below 2^32 cycles. I think the code was from some older intel or amd manual, but I am not that sure and can't find it actually. Simple play a bit around with that code.

As mentioned by Gian-Carlo rdtsc has some issues on multi-processors, and one should better rely on QueryPerformanceCounter.