Questions about getting ready for multicore programming.

Carey · Post by **Carey** » Tue Apr 01, 2008 6:50 am

I've been writing very simplistic chess programs off & on for close to 20 years now. You'd think that in all that time I would have written a good program, but I never did. I had fun, though and that's all that mattered.

Well, I'm working on another program. This time I want to take advantage of my dual-core cpu.

Although I realize that parallel chess programing is a major topic, my question is more about getting ready to do it.

Just organizing the data & program so I can start to do dual core programming. I haven't stumbled upon the 'perfect' choice.

I'm drowning in choices and little details.

The first choice is to just use threads.

Obviously some data needs to be global. Like the chess clocks, the hash table, the 'master' board position & game history. And general user settings.

And just as obvious, a lot of data needs to be private to the search. Its copy of the board & game history, the move list, etc.

So what are good ways to do this?

I thought perhaps doing some basic C++ classes would be a good idea. (Although chess data structures don't really split well into classes.) That way each data access goes to its own class. But doing a move generation seems to take nearly 10% longer than using globals. (I didn't convert the whole program. Just the move gen to test its performance. I would expect the eval to have similar performance penalites.)

I would assume structs and passing pointers would behave similarly. (That is, a C version of 'classes'.)

Loosing 10% performance just doesn't sound good to me. (This is with a simple mailbox program at the moment. Eventually I'll switch to bitboard, but this was easier to get up & running.)

I then thought about Thread Local Storage. I haven't looked too closely at it, so I don't know whether there would be additional complexities & side effects. But it would seem to give each thread some private storage that it could easily access.

I also through about seperate processes. That would solve some of the problems, but create others as well. Hyatt seems to prefer processes, but that seems to be more for platform portability reasons. Something I don't care about.

I also thought about doing entirely seperate programs. I did this back in my 8-bit micro days on a system with 512k and running a multitasking OS. I split my program into a user interface and a chess engine. Each program ran in their own 64k address space and communicated via pipes & signals.

Any thoughts or advice?

I'm not too concerned about platform portability, only Windows compiler portability. (I tend to use MingW, OpenWatcom & MSVC to check for programing issues.)

I know, I'm probably agonizing over something that doesn't matter much or that is 'obvious', but I could really use some thoughts on this.

sje · Post by **sje** » Tue Apr 01, 2008 7:21 am

Use a firewall design philosophy as much as possible so that later changes (e.g., processes vs threads, dual core vs N core) will have minimal impact.

Compile and test using at least two different tool chains (Windows/Linux) to avoid unnecessary lock-in. If possible, also test the x86 vs PowerPC for endian dependencies; they're better found sooner than later. A suprising number of open source chess programs are sadly endian dependent. Oh, also test x86 vs x86-64 implementations if possible.

The many advantages of C++ over C will likely mitigate any speed differences.

The user interface should be separate from the rest of the program. In fact, you might want to just support the UCI protocol. If you do have a regular console command interface, do not make it also handlo the ancient xboard commands; use a separate xboard interface if you need it.

WinPooh · Post by **WinPooh** » Tue Apr 01, 2008 10:08 am

Steven Edwards wrote:If possible, also test the x86 vs PowerPC for endian dependencies; they're better found sooner than later. A suprising number of open source chess programs are sadly endian dependent.

Could you please give an example of such endian dependency? Which operations are most dangerous from this point of view?

sje · Post by **sje** » Tue Apr 01, 2008 10:13 am

WinPooh wrote:
sje wrote:If possible, also test the x86 vs PowerPC for endian dependencies; they're better found sooner than later. A suprising number of open source chess programs are sadly endian dependent.
Could you please give an example of such endian dependency? Which operations are most dangerous from this point of view?

Externalization of any binary data items longer than a single byte will trigger an endian specific behavior if the data is accessed by a different endian machine. Example: opening book data with sixteen bit counts or scores.

Another case: C/C++ unions of integer arrays with different sized element types.

bob · Post by **bob** » Tue Apr 01, 2008 5:40 pm

Carey wrote:I've been writing very simplistic chess programs off & on for close to 20 years now. You'd think that in all that time I would have written a good program, but I never did. I had fun, though and that's all that mattered.

Well, I'm working on another program. This time I want to take advantage of my dual-core cpu.

Although I realize that parallel chess programing is a major topic, my question is more about getting ready to do it.

Just organizing the data & program so I can start to do dual core programming. I haven't stumbled upon the 'perfect' choice.

I'm drowning in choices and little details.

The first choice is to just use threads.

Obviously some data needs to be global. Like the chess clocks, the hash table, the 'master' board position & game history. And general user settings.

And just as obvious, a lot of data needs to be private to the search. Its copy of the board & game history, the move list, etc.

So what are good ways to do this?

I thought perhaps doing some basic C++ classes would be a good idea. (Although chess data structures don't really split well into classes.) That way each data access goes to its own class. But doing a move generation seems to take nearly 10% longer than using globals. (I didn't convert the whole program. Just the move gen to test its performance. I would expect the eval to have similar performance penalites.)

I would assume structs and passing pointers would behave similarly. (That is, a C version of 'classes'.)

Loosing 10% performance just doesn't sound good to me. (This is with a simple mailbox program at the moment. Eventually I'll switch to bitboard, but this was easier to get up & running.)

I then thought about Thread Local Storage. I haven't looked too closely at it, so I don't know whether there would be additional complexities & side effects. But it would seem to give each thread some private storage that it could easily access.

I also through about seperate processes. That would solve some of the problems, but create others as well. Hyatt seems to prefer processes, but that seems to be more for platform portability reasons. Something I don't care about.

I also thought about doing entirely seperate programs. I did this back in my 8-bit micro days on a system with 512k and running a multitasking OS. I split my program into a user interface and a chess engine. Each program ran in their own 64k address space and communicated via pipes & signals.

Any thoughts or advice?

I'm not too concerned about platform portability, only Windows compiler portability. (I tend to use MingW, OpenWatcom & MSVC to check for programing issues.)

I know, I'm probably agonizing over something that doesn't matter much or that is 'obvious', but I could really use some thoughts on this.

The way you handle memory is almost independent of whether you decide to use threads or processes. In modern unix systems, both will work the same way, where all instruction pages are shared, and data that is initialized prior to creating the new processes or threads, and then never modified (magic data, rotated bitboard data, etc) is also shared and not duplicated.

So what you are left with is the data that you either want to share with other threads/processes, or data you want to keep private from other threads or processes.

The simplest solution is to create a large block of memory that contains your private data, but create N of them. GIve each thread/process a pointer to its private data and they will not interact with each other at all. For data that is shared and modified, you create one large block of memory and give all threads/processes a pointer to that data, and then you take proper care by using atomic locks. It will be likely that some of your private data will also be shared with processes helping you search that sub-tree, so you will need locks to limit that access as well... Other types of "private" data will work themselves out simply, but how depends on whether you use threads or processes.

bob · Post by **bob** » Tue Apr 01, 2008 5:43 pm

WinPooh wrote:
Steven Edwards wrote:If possible, also test the x86 vs PowerPC for endian dependencies; they're better found sooner than later. A suprising number of open source chess programs are sadly endian dependent.
Could you please give an example of such endian dependency? Which operations are most dangerous from this point of view?

Almost anything. For example, unless you take great pains to prevent it, the books will be incompatible between different architectures because the bytes in the hash signature will be reversed on one of them. Any program that uses a union might have a problem. How you do things like MSB()/LSB() can go wrong...

mathmoi · Post by **mathmoi** » Tue Apr 01, 2008 5:45 pm

sje wrote:The many advantages of C++ over C will likely mitigate any speed differences.

If you don't use polymorphism and you code carefully, I don't see why C++ code would run slower than C code. Can you explain why you think so?

Carey · Post by **Carey** » Tue Apr 01, 2008 6:15 pm

sje wrote:Use a firewall design philosophy as much as possible so that later changes (e.g., processes vs threads, dual core vs N core) will have minimal impact.

That was one of the reasons I mentioned my old 8 bit micro program that did that. The user interface was entirely seperate from the actual engine.

You can't get much more firewalled than that...

Of course, back then I had to write my own interface, rather than being able to just use WinBoard, but the principle still holds.

But I was thinking that might seem a bit extreme.

Not to mention that I'm unsure how easy it would be for multiple engine 'threads' to communicate with each other. My old method of pipes & signals and such wouldn't be suitable for Windows. Also the hash table would be a small problem since it couldn't be shared among the seperate engines.

That's the big issue I'm having. How to seperate the engine from the main program, but yet with OUT a performance penalty, but yet still be able to go multi-core.

I really don't like the idea of having a 10% or more performance penalty just due to the structural changes that would allow me to run on a multicore.

Compile and test using at least two different tool chains (Windows/Linux) to avoid unnecessary lock-in.

Not easy for me because I don't know Linux.

I've tried Linux a few times over the past few years, but I never really liked it. Not to mention the not so minor fact that as far as I can tell, no major linux distro supports my laptop hardware, so there's no network connection available. Without that, there's no great desire to waste my laptop's limited hard drive space on something I'd never use.

If possible, also test the x86 vs PowerPC for endian dependencies; they're better found sooner than later. A suprising number of open source chess programs are sadly endian dependent.

Considering the PowerPC is essentially dead, I don't see much point in that.

But I did think about that a few years back when I was working on my bitboard program. At the time the only option was to try it under an emulator, but I had trouble getting Darwin to run on it, so I gave up.

(I'm an ex-Motorola fan, so the 68k's BigEndian nature always made more sense to me than Intel's pea-brained approach.)

Oh, also test x86 vs x86-64 implementations if possible.

That I am planning to do. I have Vista Ultimate retail (christmas gift to go with the new laptop. It's the limited edition, numbered signature edition.)

I've been debating for the past couple months whether to switch the laptop's OS over to 64 bit or not. One of the biggest issues has been the availability of drivers. Acer is only providing older 64 bit drivers for the laptop. The drivers they do offer work, but there's a good chance they'll never provide any updates. And nVidia (the chipset maker) refuses to support laptops with public drivers.

The many advantages of C++ over C will likely mitigate any speed differences.

I'd like to hear more about that part.

I could never seperate the chess engine into isolated enough parts that C++'s clases made sense. Every part needs to access so much data.

I ended up creating arbitrary member functions to access the data. They did nothing but 'look pretty' and followed C++ guidelines about data hiding.

Take the MakeMove() for example. It has to access the board, the hash signature, the move itself and the move history. It essentially touches all parts of the data except the hash table.

Other places are almost as bad.

A few C++ chess programs I looked at were more C than C++. They did have the data in classes, but the data was public so all parts of the chess program could access them without wasting cpu cycles on useless member functions.

And even if I made all the search engine data public (or all into a single class), just the class nature itself gave my MoveGen a 10% performance penalty. Just due to the greater access cost of a class versus having the board as a global data array.

I can live with 10% penalty if I have to, but I don't like it.

So, I would like to hear some more details about how C++ might actually be profitably used. I am a very poor OOP programmer. I'm definetly more of a traditional structured programmer than an OOP programmer.

The user interface should be separate from the rest of the program. In fact, you might want to just support the UCI protocol. If you do have a

I've already got nearly all of the WinBoard stuff done on my previous program that I could use.

I'm not so sure that I like the UCI protocol. I'm not saying its bad, but I'm not sure I like it.

regular console command interface, do not make it also handlo the ancient xboard commands; use a separate xboard interface if you need it.

I'm not sure I agree with that part, because most console commands map 1:1 to what winboard does.

Some syntactic differences and a few exceptions or extensions, of course, but as a whole, most of what winboard does is the same kind of stuff I've been putting in my chess programs user interface for 20 years.

About the only major differences are that I never before bothered to process any commands during the search. But that's easily handled.

Carey · Post by **Carey** » Tue Apr 01, 2008 6:31 pm

Big quote snipped. I know you like to quote everything, but to me it gets in the way of reading what is currently being said.

bob wrote: The way you handle memory is almost independent of whether you decide to use threads or processes. In modern unix systems, both will work the same way, where all instruction pages are shared, and data that is initialized prior to creating the new processes or threads, and then never modified (magic data, rotated bitboard data, etc) is also shared and not duplicated.

I think I understand that part. I've read yours (and others) posts in here about that.

So what you are left with is the data that you either want to share with other threads/processes, or data you want to keep private from other threads or processes.

The simplest solution is to create a large block of memory that contains your private data, but create N of them. GIve each thread/process a pointer to its private data and they will not interact with each other at all. For data that is shared and modified, you create one large block of memory and give all threads/processes a pointer to that data, and then you take proper care by using atomic locks. It will be likely that some of your private data will also be shared with processes helping you search that sub-tree, so you will need locks to limit that access as well... Other types of "private" data will work themselves out simply, but how depends on whether you use threads or processes.

Right. I'm pretty sure I understand that.

That's just basically creating a big struct to hold the private data and pass that pointer to every routine. (Or some similar way.) Just like creating a C++ class for each thread, except the C++ nature saves you some notational effort about passing the data pointer, etc.

Or with processes you do the global stuff first, fork, then use special memory calls to look into the parent's memory space.

I think I understand the basics. I haven't done it, but I think I understand the idea.

The problems I have with that is that it seems 'clunky' and it hurts performance.

It just seems like a kludge doing it that way. It just doesn't feel 'right' manually handling all the data like that.

I know, going multicore will be a win over all, but it still seems fundamentally wrong to choose a method that will have 10% or so performance penalty when you aren't multi-core.

That's part of my problem. It's not so much that I can't just pick a method and make it work. I believe I could do threads or processes or even seperate external programs and 'make it work'.

It's that none of these seem to feel right. They all feel like they are a 'work around'. Like a kludge. Like I'm forcing a solution onto a problem and the fit isn't quite right.

And when something doesn't feel right, then there is something wrong somewhere. Either with the program or the tools or your own mental process.

(I guess I should have written my original message a little clearer about that part. But my only excuse is that it was late at night and I was about to go to bed.)

bob · Post by **bob** » Tue Apr 01, 2008 8:11 pm

Carey wrote:Big quote snipped. I know you like to quote everything, but to me it gets in the way of reading what is currently being said.

bob wrote: The way you handle memory is almost independent of whether you decide to use threads or processes. In modern unix systems, both will work the same way, where all instruction pages are shared, and data that is initialized prior to creating the new processes or threads, and then never modified (magic data, rotated bitboard data, etc) is also shared and not duplicated.
I think I understand that part. I've read yours (and others) posts in here about that.

So what you are left with is the data that you either want to share with other threads/processes, or data you want to keep private from other threads or processes.

The simplest solution is to create a large block of memory that contains your private data, but create N of them. GIve each thread/process a pointer to its private data and they will not interact with each other at all. For data that is shared and modified, you create one large block of memory and give all threads/processes a pointer to that data, and then you take proper care by using atomic locks. It will be likely that some of your private data will also be shared with processes helping you search that sub-tree, so you will need locks to limit that access as well... Other types of "private" data will work themselves out simply, but how depends on whether you use threads or processes.

Right. I'm pretty sure I understand that.

That's just basically creating a big struct to hold the private data and pass that pointer to every routine. (Or some similar way.) Just like creating a C++ class for each thread, except the C++ nature saves you some notational effort about passing the data pointer, etc.

Or with processes you do the global stuff first, fork, then use special memory calls to look into the parent's memory space.

I think I understand the basics. I haven't done it, but I think I understand the idea.

The problems I have with that is that it seems 'clunky' and it hurts performance.

It just seems like a kludge doing it that way. It just doesn't feel 'right' manually handling all the data like that.

I know, going multicore will be a win over all, but it still seems fundamentally wrong to choose a method that will have 10% or so performance penalty when you aren't multi-core.

That's part of my problem. It's not so much that I can't just pick a method and make it work. I believe I could do threads or processes or even seperate external programs and 'make it work'.

It's that none of these seem to feel right. They all feel like they are a 'work around'. Like a kludge. Like I'm forcing a solution onto a problem and the fit isn't quite right.

And when something doesn't feel right, then there is something wrong somewhere. Either with the program or the tools or your own mental process.

(I guess I should have written my original message a little clearer about that part. But my only excuse is that it was late at night and I was about to go to bed.)

You are not going to see a 10% penalty. Probably 1% at worst. I was originally afraid that passing the TREE pointer around in Crafty would be a 10% penalty, but when I finished, there was no significant speed difference at all. So you don't have to worry about speed, and actually should not be at this point anyway. You simply want to find the simplest way to share what is needed because this kind of programming is non-trivial and anything you can do to simplify the design will reduce the number of bugs later...

Here the approach that I use, which works with either threads or processes by the way, seems effective enough, and bearable in terms of effort required for results obtained... You are going to have independent tasks running since you will be searching different parts of the tree in parallel anyway, and that will begin to feel "natural" after a while...

Questions about getting ready for multicore programming.

Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.