Questions about getting ready for multicore programming.

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Consistant performance penalty for C++ classes

Post by bob »

Aleks Peshkov wrote:Gerd, how efficient is x86 implementation of thread scope variables that supported by all major compilers and seems to be added in next C++ standard?
No different than local data accesses today. They are just allocated on a stack in a block, one block per thread... You won't see any difference there at all, as "scope" is a compiler issue, not an execution issue.
User avatar
Bo Persson
Posts: 243
Joined: Sat Mar 11, 2006 8:31 am
Location: Malmö, Sweden
Full name: Bo Persson

Re: Consistant performance penalty for C++ classes

Post by Bo Persson »

Carey wrote:I just couldn't let this go.

When Hyatt says he ran such & such test and got such & such results, you can be sure he did. He may not have done the exact same test you would have or some such difference, but he is definetly a big believer in testing.

So, I began thinking that maybe the problem isn't C++ or my test program, but maybe my compiler.

I'm using the current version of MingW / GCC.

Admittedly GCC has never been known for the best code generation, but it should have produced reasonable results.

But maybe it didn't. Maybe GCC has an optimization problem with C++.

So, I installed MSVC Express 2008 and ran the C++ tests again.

I can't give you the numbers because my laptop is on battery, so they are much slower than before, but this time, they are much closer together.

My battery is almost dead and it's late and I need to go to bed, but it looks like my problem might have been GNU C++ generating poor code for data in classes.

I'll run some more tests tomorrow.
If you are using the default gcc for Cygwin, this could explain a lot of the overhead. Going from ver 3.4 to 4.2 (or so) will give you a lot of added optimizations.

Like Gerd says elsethread, having a this-pointer (or a TREE*) in a register can save a lot on code size in a 64 bit build.
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: Consistant performance penalty for C++ classes

Post by Carey »

Gerd Isenberg wrote:
Carey wrote: I'm thinking that the performance penalty is due to the 'this' pointer having to be referenced for every access to the 'Board[]' array. That's got to add some overhead. I wouldn't have thought 10% or more, but.... (shrug)
I have the impression that accessing global variables in 64-bit mode becomes more expensive. There is no compact mode with 32-bit addresses. There is a rip-relative addressing mode, but assembly generated by vc2005 indicates a pointer is needed to access globals all the time. Globals like static class members as well as statics inside the local scope of a function.

Code: Select all

 lea r10, base of some data_segment
 mov rax, [r10 + offset global var]
Thus passing a board-, search- or equivalently a this-pointer around - even in a recursive search, might be faster than accessing globals. It might even make sense to keep all the constant data inside a one time initialized, embedded none static "const" member.
I haven't tested any of this under a 64 bit OS.

Right now, I'm just working with a simple mailbox program, so 64 bits wouldn't help.

I had been working on a bitboard program, but things got to be way too cluttered & bloated so I abandoned it.

I then started working on doing a basic chess program with full winboard support. I just used a simple mailbox program because it was smaller and easier to deal with.

I then decided I might as well throw in simple multi-core support too.

Then I'll switch things over to bitboards.
A nice aspect with C++ is that you may wrap and hide sse2-instrinsics (or other SIMD-architectures like AltiVec) to use 128-bit registers as vector of two bitboards for fill-stuff with usual C++ operator syntax. The vc2005 generated assembly is quite optimal with that stuff.
Well, I'm not much for doing SSE2 stuff.

If I've understood what some of your other messages have said, doing sse2 etc. stuff on an AMD X2 isn't worth the time due to the higher latency and lower throughput as compared to the Core2 stuff.
Whether you use move.isCapture() or isCapture(move) with TMove as ordinal scalar int is a matter of taste or pragmatism. Whether you internally use a bitfield struct/class or explicite masking or filling by shift/and/or shouldn't care as long as you "hide" the implementation and the object fits inside a register and can be passed by value with default fastcall convention. Endian issues matters for persistent objects where one may use endian and size independent write/read methods.

Gerd
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: Consistant performance penalty for C++ classes

Post by Carey »

Bo Persson wrote:
If you are using the default gcc for Cygwin, this could explain a lot of the overhead. Going from ver 3.4 to 4.2 (or so) will give you a lot of added optimizations.

Like Gerd says elsethread, having a this-pointer (or a TREE*) in a register can save a lot on code size in a 64 bit build.
64 bit does have more registers than the 32 bit mode.... Sad day for the computing world when the x86 architecture became dominant.


I am using the official current release for MingW. That means it is v3 as opposed to v4.

The Mingw people seem to have an aversion to v4 and still refuse to offiically release a current version of the GCC stuff. They keep claiming it's too buggy etc. etc.

I could find some home compiled copies of GCC4 to run, but I always have an aversion to doing things like that. I've had too many problems over the years when I used somebody's home compiled customized programs.


But yes, late last night I did consider that maybe the MingW compiler was my problem. Since there weren't any offiical releases of MingW 4, I had to give MSVC a try. (It is faster, but I don't like it the IDE.)
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: Questions about getting ready for multicore programming.

Post by Carey »

Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.

(I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)


So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.


So I guess most of this thread has been taken care of.... If I want to do multi-core programming, I'm going to have to stay with MSVC. In which case it doesn't matter too much whether it's threads or processes or something else.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions about getting ready for multicore programming.

Post by bob »

Carey wrote:Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.

(I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)


So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.


So I guess most of this thread has been taken care of.... If I want to do multi-core programming, I'm going to have to stay with MSVC. In which case it doesn't matter too much whether it's threads or processes or something else.
Just to clarify, there isn't any "something else".

:)
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: Questions about getting ready for multicore programming.

Post by Carey »

bob wrote:
Carey wrote:Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.

(I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)


So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.


So I guess most of this thread has been taken care of.... If I want to do multi-core programming, I'm going to have to stay with MSVC. In which case it doesn't matter too much whether it's threads or processes or something else.
Just to clarify, there isn't any "something else".

:)
Actually there, although it's only a little removed.

Normal 'process' programm just forks and shares some read-only data and them communicates through shared memory, while sharing the trans table.

The 'something else' is to go all the way and use entirely seperate programs communicationg though some channel (pipes, or LAN, or whatever you want). No shared memory, etc.

Each engine can be running on a different core or even a different processor.

It solves all the shared memory bugs & issues, while increasing the communication complexity.

Not much removed, but there are enough differences in what can be done that it's worth calling it 'something else'.

Threads are on the left, common fork()ing processes in the middle, and entirely seperate programs on the right.
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Results for GCC v4.2

Post by Carey »

I just can't seem to leave this alone...

I downloaded and installed into a seperate directory the 'TDM' port of gcc v4.2 for MingW.

I then did the 'data in class' versus 'data not in class' test.

My laptop was on battery, so the numbers aren't comparable to my other tests, but it was faster than with GCC 3.4 that MingW normally offers.

However, the performance penalty was still bad. In this case, nearly 15% performance reduction for data in the class versus global data.

Maybe I missed the magical option that would improve this. My IDE (CodeBlocks) doesn't give a lot of choices for optimization or code tweaking.

Or maybe I screwed up the install of GCC 420 and it's somehow doing my old gcc. (I don't think so, because this is faster than what I was getting before.)


It still looks like my previous conclusion is right. If you intend to put data in a class or in a struct for multi-threading, you had better have a darn good compiler, else you will be getting a significant performance penalty with 32 bit code.

With this performance penalty for GNU C (which I prefer over msvc), I'm definetly going to have to come up with an approach where I don't have to access the data via pointers.


I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.


Well.... I hope this has been entertaining, if not informative for everybody here. It certainly was informative for me. Not massively helpful (since I don't like MSVC and that one is the only one without a major performance penalty), but definetly informative.

Carey
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Results for GCC v4.2

Post by Gerd Isenberg »

Carey wrote:I just can't seem to leave this alone...

I downloaded and installed into a seperate directory the 'TDM' port of gcc v4.2 for MingW.

I then did the 'data in class' versus 'data not in class' test.

My laptop was on battery, so the numbers aren't comparable to my other tests, but it was faster than with GCC 3.4 that MingW normally offers.

However, the performance penalty was still bad. In this case, nearly 15% performance reduction for data in the class versus global data.

Maybe I missed the magical option that would improve this. My IDE (CodeBlocks) doesn't give a lot of choices for optimization or code tweaking.

Or maybe I screwed up the install of GCC 420 and it's somehow doing my old gcc. (I don't think so, because this is faster than what I was getting before.)


It still looks like my previous conclusion is right. If you intend to put data in a class or in a struct for multi-threading, you had better have a darn good compiler, else you will be getting a significant performance penalty with 32 bit code.

With this performance penalty for GNU C (which I prefer over msvc), I'm definetly going to have to come up with an approach where I don't have to access the data via pointers.


I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.


Well.... I hope this has been entertaining, if not informative for everybody here. It certainly was informative for me. Not massively helpful (since I don't like MSVC and that one is the only one without a major performance penalty), but definetly informative.
Carey
One additional register for this-pointer everywhere takes space and time in 32-bit mode with only a few registers available. The relative more, the smaller the program initially is. Despite compiler and optimization issues, you'll always have chaotical "none-linearities", if you add code or increase data inside your program. If you already exceeded some threshold before, you may add code and data to some extend without further (or even negative) slowdown. If you are below that threshold and cross some borders while adding code/data the slowdown may be notable, since you suddenly need more pages and cachelines for code and/or data/bss/stack.

Does your global version keep the variables in the same order than with classes, e.g. by using a global struct? Changing the order inside those structs may have enormous effects as well.

I recommend to keep the search threadsafe. The better your 64-bit speedup ;-)

Cheers,
Gerd
Volker Annuss
Posts: 180
Joined: Mon Sep 03, 2007 9:15 am

Re: Results for GCC v4.2

Post by Volker Annuss »

Carey wrote: I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.
You can get a 64-Bit-Compiler for free by downloading the Windows-SDK. It works from the command line, but I did not get it work inside VC2005 and VC2008 Express Edition.

Greetings
Volker