In my case, I have the same speed with the MSVC 6 that with the MSVC 2005.
Pedro
Compilation question for gcc experts
Moderator: Ras
-
- Posts: 1056
- Joined: Fri Mar 10, 2006 6:07 am
- Location: Basque Country (Spain)
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Compilation question for gcc experts
I think all of that is beyond dangerous. How do you know that nothing else has been added to the stack after your alloca()? Certainly nothing in the C calling convention standard suggests that is the case.
The second alloca() has no chance. This is an amount to adjust the stack by, and I don't see any mention of signed values, since this is an address adjustment...
Adjusting sp also seems dangerous. Since the alloca() standard says the space is freed when the procedure returns, you could easily run afoul of how the compiler chooses to implement this feature and leave the stack corrupted...
If you want a temp array, why not just declare a local one that gets allocated on the stack when you enter??
The second alloca() has no chance. This is an amount to adjust the stack by, and I don't see any mention of signed values, since this is an address adjustment...
Adjusting sp also seems dangerous. Since the alloca() standard says the space is freed when the procedure returns, you could easily run afoul of how the compiler chooses to implement this feature and leave the stack corrupted...
If you want a temp array, why not just declare a local one that gets allocated on the stack when you enter??
-
- Posts: 28348
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Compilation question for gcc experts
Yes, I admit it is a dirty hack. I certainly wouldn't do it in something I wanted to be portable. But if I test it for a given compile, and it works, I see no reason to refrain from such tricks just for making release compiles tailored for running on machines that I don't use myself for development.
The reason I cannot declare the array as a normal local variable is that I do not know in advance how many moves will be generated. (And even if I would, I would have to use something like alloca() to declare it, as normal locals in C cannot have dynamic size.) So I would have to declare it large enough for the worst case, which is huge (256 moves or so, which in Joker's case is 1KB). That would be mostly wasted space, as the typical number of moves would only be 40.
As a result, the top of the system stack would advance through memory in huge strides. A Pentium 4 has only 8KB of L1 data cache, in 4 ways of 2KB. So after 2 ply I would already run out of one way, and the cache mapping would wrap around. This leads to possible L1 collisions between the latest stack frame and that of 2 ply earlier. At the frontier of the tree, in QS, you are continually diving a few ply deeper and back, so the last 4 ply or so are all in the working set. This is very detrimental to performance. (Part of the problem is that Joker has no separate Search and QS, otherwise I could of course do with a much smaller worst case for the move-list size in QS.) This is why I want to squeeze out the unused space, so that a stack frame of Search() typically only occupies 260 bytes or so (160 for 40 moves, plus 100 for local scalar variables). A single cache way can then hold 8 ply worth of stack, which should be enough to cover most excursions near the tree frontier without L1 cache misses.
The cache misses really cause a horrible slowdown. For a perft(7) from the opening the time varies between 204 sec and 118 sec, just depending on the addres where the move stack starts (if I have separate move stacks and system stacks). This is a far larger effect than the difference between code efficiency of the various compilers. And I cannot tailer the starting point of the move stack by hand, as moving it around will merely make the collision occur at another ply depth. So with separate stacks there is no static solution that will avoid such collisions at any search depth and branching ratio.
I could of course dynamically skip areas in the move-stack area that would collide with recently used (or soon to be used) stack frames on the system stack, trying to keep the two mapping half a cache way (1KB) apart. But this is not so easy either, and very machine dependent (as the program would have to be aware of the way size, wchich ranges from 2KB on the p4 tp 32KB on an AMD). Merging the two stacks would be a solution that (once it works) would work on any machine.
I don't think the freeing of space on the return of the procedure would require any specific action by the compiler. It just pops the stack frame, by restoring conditions as they were when the routine was entered. So every memory of how you manipulated the stack is automatically erased. The code used for leaving a procedure that used alloca() is simply:
It makes no assumptions on the stack pointer before the return.
The reason I cannot declare the array as a normal local variable is that I do not know in advance how many moves will be generated. (And even if I would, I would have to use something like alloca() to declare it, as normal locals in C cannot have dynamic size.) So I would have to declare it large enough for the worst case, which is huge (256 moves or so, which in Joker's case is 1KB). That would be mostly wasted space, as the typical number of moves would only be 40.
As a result, the top of the system stack would advance through memory in huge strides. A Pentium 4 has only 8KB of L1 data cache, in 4 ways of 2KB. So after 2 ply I would already run out of one way, and the cache mapping would wrap around. This leads to possible L1 collisions between the latest stack frame and that of 2 ply earlier. At the frontier of the tree, in QS, you are continually diving a few ply deeper and back, so the last 4 ply or so are all in the working set. This is very detrimental to performance. (Part of the problem is that Joker has no separate Search and QS, otherwise I could of course do with a much smaller worst case for the move-list size in QS.) This is why I want to squeeze out the unused space, so that a stack frame of Search() typically only occupies 260 bytes or so (160 for 40 moves, plus 100 for local scalar variables). A single cache way can then hold 8 ply worth of stack, which should be enough to cover most excursions near the tree frontier without L1 cache misses.
The cache misses really cause a horrible slowdown. For a perft(7) from the opening the time varies between 204 sec and 118 sec, just depending on the addres where the move stack starts (if I have separate move stacks and system stacks). This is a far larger effect than the difference between code efficiency of the various compilers. And I cannot tailer the starting point of the move stack by hand, as moving it around will merely make the collision occur at another ply depth. So with separate stacks there is no static solution that will avoid such collisions at any search depth and branching ratio.
I could of course dynamically skip areas in the move-stack area that would collide with recently used (or soon to be used) stack frames on the system stack, trying to keep the two mapping half a cache way (1KB) apart. But this is not so easy either, and very machine dependent (as the program would have to be aware of the way size, wchich ranges from 2KB on the p4 tp 32KB on an AMD). Merging the two stacks would be a solution that (once it works) would work on any machine.
I don't think the freeing of space on the return of the procedure would require any specific action by the compiler. It just pops the stack frame, by restoring conditions as they were when the routine was entered. So every memory of how you manipulated the stack is automatically erased. The code used for leaving a procedure that used alloca() is simply:
Code: Select all
leal -12(%ebp), %esp // resore stack
popl %ebx // except 3 saved registers
popl %esi // which are then restored
popl %edi
popl %ebp // restore frame pointer
ret