SEE with alpha beta

Rein Halbersma · Post by **Rein Halbersma** » Sun Aug 21, 2011 8:51 pm

Don wrote: There is a another way to measure the execution time of functions too, you can use high resolution timers. But for short running routines there is a kind of uncertainty principle involved where the measurement gets in the way of the thing being measured.

I wonder how well that would work for something like this?

One Windows profiler that is very convenient to use is Very Sleepy. It will let you profile any running process without recompiling.
http://www.codersnotes.com/sleepy

hgm · Post by **hgm** » Sun Aug 21, 2011 9:29 pm

Don wrote:I don't really know the exact details, but this is from the gcc manpage:

Well, easy enough to compile two times, "gcc -O2 -S" and "gcc -O2 -S -pg", and see what the difference is:

A small part of qperft:

Code: Select all

void board_init&#40;char *b&#41;
&#123; /* make an empty board surrounded by guard band of uncapturable pieces */
        int i;

        for&#40;i= -1; i<0xBC; i++) b&#91;i&#93; = &#40;i-0x22&#41;&0x88 ? GUARD &#58; DUMMY;
&#125;

gcc -O2 -pg -S qperft.c:

Code: Select all

.globl board_init
	.type	board_init, @function
board_init&#58;
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	call	mcount
	movl	$-1, %edx
	movl	$-32, %eax
	movl	8&#40;%ebp&#41;, %ebx
	leal	-1&#40;%ebx&#41;, %ecx
	jmp	.L2
	.p2align 4,,7
	.p2align 3
.L5&#58;
	leal	-34&#40;%edx&#41;, %eax
	andl	$-120, %eax
	cmpb	$1, %al
	sbbl	%eax, %eax
	andl	$-65, %eax
	subl	$32, %eax
	leal	(%ebx,%edx&#41;, %ecx
.L2&#58;
	addl	$1, %edx
	cmpl	$188, %edx
	movb	%al, (%ecx&#41;
	jne	.L5
	popl	%ebx
	popl	%ebp
	ret
	.size	board_init, .-board_init

gcc -O2 -S qperft.c:

Code: Select all

.globl board_init
	.type	board_init, @function
board_init&#58;
	pushl	%ebp
	movl	$-1, %edx
	movl	%esp, %ebp
	movl	$-32, %eax
	pushl	%ebx
	movl	8&#40;%ebp&#41;, %ebx
	leal	-1&#40;%ebx&#41;, %ecx
	jmp	.L2
	.p2align 4,,7
	.p2align 3
.L5&#58;
	leal	-34&#40;%edx&#41;, %eax
	andl	$-120, %eax
	cmpb	$1, %al
	sbbl	%eax, %eax
	andl	$-65, %eax
	subl	$32, %eax
	leal	(%ebx,%edx&#41;, %ecx
.L2&#58;
	addl	$1, %edx
	cmpl	$188, %edx
	movb	%al, (%ecx&#41;
	jne	.L5
	popl	%ebx
	popl	%ebp
	ret
	.size	board_init, .-board_init

A "diff -u" between the two

Code: Select all

@@ -5,11 +5,10 @@
 	.type	board_init, @function
 board_init&#58;
 	pushl	%ebp
-	movl	%esp, %ebp
-	pushl	%ebx
-	call	mcount
 	movl	$-1, %edx
+	movl	%esp, %ebp
 	movl	$-32, %eax
+	pushl	%ebx
 	movl	8&#40;%ebp&#41;, %ebx
 	leal	-1&#40;%ebx&#41;, %ecx
 	jmp	.L2

You can see the only extra code is the "call mcount" in the entry code of the routine, and a slight re-ordering of the instructions around it. The actual body of the routine is not touched at all.

So I don't think your skepticism towards profiling is justified. Of course there will be statistical noise, as the timer interrupt samples the code only 100 times/sec (i.e.0.01 sec/sample). So you should expect sqrt(N) noise, and the results only get 1% accurate when you spent 100 sec in the routine. (And even then, for programs with long-term very regular behavior, such as matrix inversions, the samples might not be independent.) I spent 29 sec in SEE, though:

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 43.89    102.30   102.30 110301449     0.00     0.00  Evaluate
 32.30    177.60    75.30        8     9.41    29.13  Search
 12.59    206.94    29.34 196194872     0.00     0.00  SEE
  7.31    223.99    17.05 35669452     0.00     0.00  capt_gen
  2.71    230.31     6.32  8159522     0.00     0.00  move_gen
  1.16    233.02     2.71 31231960     0.00     0.00  capturable
  0.02    233.06     0.04   200272     0.00     0.00  make_list
  0.01    233.08     0.02   135676     0.00     0.00  Attacked
  0.01    233.10     0.02                             main

The mcount routine probably does not only count based on the address of the mcount call (i.e. the return address on the stack), but also digs deeper in the stack to figure out who called the routine that is now counted, because gprof gives me that info too, afterwards.

Zach Wegner · Post by **Zach Wegner** » Sun Aug 21, 2011 9:35 pm

Don wrote:
hgm wrote:Indeed, I profiled it. Now that I ported Joker to Linux recently, I could finally do that. Under Cygwin every function always took 0%...

I had 48% Eval, 23% Search, 12.5% SEE, 7% GenCaptures, 5% GenMoves and some minor stuff. Not very scientific, as I only used a single position only. Joker's eval is not very elaborate, but it is not very efficient either. A lot of time goes into Pawn evaluation, and it has no Pawn hash. So I could probably speed up Eval significantly, making the SEE contribution even heavier. And I don't subject LxH and equal captures to SEE.

Calling a function twice might bemisleading, because the second time all datait needs will be in cache, and perhaps the branch prediction logic learned to perfectly predict all branches from the previous time. (And newer CPUs might have a uOps cache where the function is still loaded.) It sounds like a very error-prone method to me.
Actually, using gcc profile is the most error prone method - it's only good for getting approximate numbers. Let me tell you why:

First of all, there is no caching issue because I'm not calling the same function twice. I made 2 copies of the same function and just changed the names. Then the original name I used as a wrapper - it is now the function that calls both functions and get's a result from each. And just to make absolutely sure the compiler doesn't notice that I ignore one of the results I compared the 2 resutls and return the "larger" of the two (they will always be the same of course, but the compiler doesn't know that.)

So I've actually increased the overhead as you will see - it looks something like this:

int see_cloneA( posiiton *p, mv );
int see_cloneB( positoin *p, mv );

see( position *p, int mv )
{
int x = see_cloneA(p, mv);
int y = see_cloneB(p, mv);

if (x > y) return x;
return y;
}

So now every time the program calls see(), instead of 1 function call/return you get three! And there is no special cache improvement from calling the same function twice in a row.

So I personally believe that profiling is very imprecise and this is far better. Here is the explanation for why that is:

Profiling is subject to a significant amount of noise because it's based on sampling and it actually changes the code - the code you are profiling is NOT the same code you run thanks to -pg flag to gcc which inserts extra code all over the place to make profiling possible. This can have really bad effects on caching and execution speed - far more than my method.

There is no perfect way (except perhaps with special purpose hardware) to profile code, but this method I think is far more accurate (but also far more tedious) than using gcc to profile code.

Of course this is only good to measure the overhead of a single function, it would not work well if I wanted a full program detailed profile of every function in the program, but in this case I did not need that.

However I do occasionally use the gcc profile and see() is never even on the radar. GCC profile is a very useful tool to get a rough idea of the execution speed of various functions.

I will just say that I really dislike your method of duplicating the function. It perturbs the system in big ways, enough that I wouldn't trust the results at all, and it requires far too much programmer effort. GCC's profiler sucks too, since it adds code in every function to determine the parent function. A simple sampling profiler is really what you want, ideally one that doesn't even touch the executable.

If you are really serious about performance, I would recommend a simulator, like this one: http://www.ptlsim.org

Don · Post by **Don** » Sun Aug 21, 2011 10:12 pm

hgm wrote:

Don wrote:I don't really know the exact details, but this is from the gcc manpage:

Well, easy enough to compile two times, "gcc -O2 -S" and "gcc -O2 -S -pg", and see what the difference is:

A small part of qperft:

Code: Select all

void board_init&#40;char *b&#41;
&#123; /* make an empty board surrounded by guard band of uncapturable pieces */
        int i;

        for&#40;i= -1; i<0xBC; i++) b&#91;i&#93; = &#40;i-0x22&#41;&0x88 ? GUARD &#58; DUMMY;
&#125;

gcc -O2 -pg -S qperft.c:

Code: Select all

.globl board_init
	.type	board_init, @function
board_init&#58;
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	call	mcount
	movl	$-1, %edx
	movl	$-32, %eax
	movl	8&#40;%ebp&#41;, %ebx
	leal	-1&#40;%ebx&#41;, %ecx
	jmp	.L2
	.p2align 4,,7
	.p2align 3
.L5&#58;
	leal	-34&#40;%edx&#41;, %eax
	andl	$-120, %eax
	cmpb	$1, %al
	sbbl	%eax, %eax
	andl	$-65, %eax
	subl	$32, %eax
	leal	(%ebx,%edx&#41;, %ecx
.L2&#58;
	addl	$1, %edx
	cmpl	$188, %edx
	movb	%al, (%ecx&#41;
	jne	.L5
	popl	%ebx
	popl	%ebp
	ret
	.size	board_init, .-board_init

gcc -O2 -S qperft.c:

Code: Select all

.globl board_init
	.type	board_init, @function
board_init&#58;
	pushl	%ebp
	movl	$-1, %edx
	movl	%esp, %ebp
	movl	$-32, %eax
	pushl	%ebx
	movl	8&#40;%ebp&#41;, %ebx
	leal	-1&#40;%ebx&#41;, %ecx
	jmp	.L2
	.p2align 4,,7
	.p2align 3
.L5&#58;
	leal	-34&#40;%edx&#41;, %eax
	andl	$-120, %eax
	cmpb	$1, %al
	sbbl	%eax, %eax
	andl	$-65, %eax
	subl	$32, %eax
	leal	(%ebx,%edx&#41;, %ecx
.L2&#58;
	addl	$1, %edx
	cmpl	$188, %edx
	movb	%al, (%ecx&#41;
	jne	.L5
	popl	%ebx
	popl	%ebp
	ret
	.size	board_init, .-board_init

A "diff -u" between the two

Code: Select all

@@ -5,11 +5,10 @@
 	.type	board_init, @function
 board_init&#58;
 	pushl	%ebp
-	movl	%esp, %ebp
-	pushl	%ebx
-	call	mcount
 	movl	$-1, %edx
+	movl	%esp, %ebp
 	movl	$-32, %eax
+	pushl	%ebx
 	movl	8&#40;%ebp&#41;, %ebx
 	leal	-1&#40;%ebx&#41;, %ecx
 	jmp	.L2

You can see the only extra code is the "call mcount" in the entry code of the routine, and a slight re-ordering of the instructions around it. The actual body of the routine is not touched at all.

So I don't think your skepticism towards profiling is justified.

Maybe it's not a big deal, but there is still the stack setup and memory accesses (which will be necessary to put the timing data in the right buckets) which will basically make the profile run different than an actual run. Is it a big deal? Probably not.

Of course there will be statistical noise, as the timer interrupt samples the code only 100 times/sec (i.e.0.01 sec/sample). So you should expect sqrt(N) noise, and the results only get 1% accurate when you spent 100 sec in the routine. (And even then, for programs with long-term very regular behavior, such as matrix inversions, the samples might not be independent.) I spent 29 sec in SEE, though:
Code: Select all
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 43.89    102.30   102.30 110301449     0.00     0.00  Evaluate
 32.30    177.60    75.30        8     9.41    29.13  Search
 12.59    206.94    29.34 196194872     0.00     0.00  SEE
  7.31    223.99    17.05 35669452     0.00     0.00  capt_gen
  2.71    230.31     6.32  8159522     0.00     0.00  move_gen
  1.16    233.02     2.71 31231960     0.00     0.00  capturable
  0.02    233.06     0.04   200272     0.00     0.00  make_list
  0.01    233.08     0.02   135676     0.00     0.00  Attacked
  0.01    233.10     0.02                             main
The mcount routine probably does not only count based on the address of the mcount call (i.e. the return address on the stack), but also digs deeper in the stack to figure out who called the routine that is now counted, because gprof gives me that info too, afterwards.

The bottom line is that I do not feel that that calling a cloned routine twice is horribly inaccurate. I will concede that if you run a profile long enough it's probably good enough, but my way doesn't touch any code except what I am specifically interested in.

What is more interesting to me is why see is free in my program and 1/8 of the whole time spend in your program. That is an amazing difference, even given the fact that Komodo is a really heavy program. Of course I can see why you would want to optimize it as much as possible.

hgm · Post by **hgm** » Sun Aug 21, 2011 10:19 pm

Zach Wegner wrote: GCC's profiler sucks too, since it adds code in every function to determine the parent function. A simple sampling profiler is really what you want, ideally one that doesn't even touch the executable.

Actually you can easily achieve that with the gcc profiler. The extra code is a consequence of compiling with the -pg option. If you don't want it, don't use the -pg option, but use -c in stead, to obtain uncontaminated .o files.

Then, during linking of the .o files, do use the -pg option. This now links with another startup routine, which reserves space for the histogram, sets up the timer interrupt, and saves the gmon.out on exit.You will then be profiling 100% original code. The interruptions every 10 msec should hardly disturb anything, as you have 10msec for things to settle back to normal (e.g.flushed out cache entries to be reloaded, branch-prediction tables to be rebuilt) before you make the next probe.

Zach Wegner · Post by **Zach Wegner** » Sun Aug 21, 2011 10:27 pm

hgm wrote:Actually you can easily achieve that with the gcc profiler. The extra code is a consequence of compiling with the -pg option. If you don't want it, don't use the -pg option, but use -c in stead, to obtain uncontaminated .o files.

Then, during linking of the .o files, do use the -pg option. This now links with another startup routine, which reserves space for the histogram, sets up the timer interrupt, and saves the gmon.out on exit.You will then be profiling 100% original code. The interruptions every 10 msec should hardly disturb anything, as you have 10msec for things to settle back to normal (e.g.flushed out cache entries to be reloaded, branch-prediction tables to be rebuilt) before you make the next probe.

That is certainly an improvement, but it still does perturb the code. Having a separate startup routine and histogram affect the alignment of the code, which can in turn affect set aliasing, etc. Of course, it's a pretty small effect, and most people probably don't care. Ideally there should be not even be a recompile, just a symbol table produced during compilation that can map IP->function.

hgm · Post by **hgm** » Sun Aug 21, 2011 11:29 pm

Zach Wegner wrote:That is certainly an improvement, but it still does perturb the code. Having a separate startup routine and histogram affect the alignment of the code, which can in turn affect set aliasing, etc. Of course, it's a pretty small effect, and most people probably don't care. Ideally there should be not even be a recompile, just a symbol table produced during compilation that can map IP->function.

I think you are expecting too much now. Two runs will never have the same L2 set aliasing, even for an identical executable. It depends on how the OS assigns pages. The code itself contains lots of alignment directives, and it would be totally unnecessary (not to say plain stupid) if they would not have made the length of the startup routine the same for the profiling and non-profiling version modulo the page length. So I suppose they will have done that.

SEE with alpha beta

Re: SEE with alpha beta

Re: SEE with alpha beta

Re: SEE with alpha beta

Re: SEE with alpha beta

Re: SEE with alpha beta

Re: SEE with alpha beta

Re: SEE with alpha beta