Stockfish port to C# Complete

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Stockfish port to C# Complete

Post by mcostalba »

diep wrote:
mcostalba wrote:
diep wrote: Every compare here is with GCC. A junk compiler i use daily, so the junk it produces i know everything about.
GCC has improved a bit since the 90s. Today this produces the fastest binary on Windows:

http://www.equation.com/servlet/equation.cmd?fa=fortran
Why do you claim this nonsense?
When, in the 90s, have you stopped testing and started to go "by memory" ?

For your information this is the compiler used by Jim to produce the fastest x86-64 SSE42 Windows binary released for SF 2.2.2 (few months ago). Before, until last year, he was using the Intel compiler, but found this one faster.
whittenizer
Posts: 85
Joined: Sun May 29, 2011 11:56 pm
Location: San Diego

Re: Stockfish port to C# Complete

Post by whittenizer »

Hi,

In my experience with my c# port, the best optimizations were using struts instead of classes where possible, and inclining as much code as possible. Sure, I could get lots better results if this wasn't managed code. What's killing me is the work around we did for all the pointer arithmetic. It's too much of an overhead.

This was a great learning process. I've moved back over to c++, and will focus my efforts on optimizing SF's search and eval functions so we can achieve greater depths in less time for any given position.

Enjoy :-)
kinderchocolate
Posts: 454
Joined: Mon Nov 01, 2010 6:55 am
Full name: Ted Wong

Re: Stockfish port to C# Complete

Post by kinderchocolate »

What do you mean by greater depth in a given amount of time? It's no point to search to 100 ply if the algorithm can only calculate 100 nodes per second. Do you mean increasing the number of nodes per second?

If you want to go deeper, why not just modify the maximum search depth constant?
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: Stockfish port to C# Complete

Post by bpfliegel »

In my experience with my c# port, the best optimizations were using struts instead of classes where possible, and inclining as much code as possible. Sure, I could get lots better results if this wasn't managed code.
It was very tempting to go unsafe sometimes, no? :)
Good luck with search/eval optimizations, that's also a nice area!
Balint
whittenizer
Posts: 85
Joined: Sun May 29, 2011 11:56 pm
Location: San Diego

Re: Stockfish port to C# Complete

Post by whittenizer »

Maybe I'm not clear what I mean. If I put in a given position, and then go with a move time of 10000, Houdini can reach a greater depth than SF with the same inputs. So, with that, probably that's nps. I'm still in research mode. If u can shed some light on what I might want to look into, I'd surely be thankful.

Thanks much
whittenizer
Posts: 85
Joined: Sun May 29, 2011 11:56 pm
Location: San Diego

Re: Stockfish port to C# Complete

Post by whittenizer »

This project was Silverlight based so it had to be managed.

I have my work cut out but this is the way to go for top performance.

Thanks
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Stockfish port to C# Complete

Post by diep »

mcostalba wrote:
diep wrote:
mcostalba wrote:
diep wrote: Every compare here is with GCC. A junk compiler i use daily, so the junk it produces i know everything about.
GCC has improved a bit since the 90s. Today this produces the fastest binary on Windows:

http://www.equation.com/servlet/equation.cmd?fa=fortran
Why do you claim this nonsense?
When, in the 90s, have you stopped testing and started to go "by memory" ?

For your information this is the compiler used by Jim to produce the fastest x86-64 SSE42 Windows binary released for SF 2.2.2 (few months ago). Before, until last year, he was using the Intel compiler, but found this one faster.
Latest GCC snapshot i tested a few weeks ago, and it's still lightyears behind intel c++ and even visual studio.

Just because it hardly gets any speedup by PGO.

To avoid a bug in GCC's pgo, i'm doing the profile run single threaded with Diep. Even then it just gives 3% speedup.

I've posted extensively examples of how GCC messes up everywhere on the net. Starting in 2007.

Latest snapshot still didn't have that fixed.

So it already STARTS with a disadvantage over other compilers of 25% or so. Such bad PGO performance is of course a joke.

Note that around 2004-2005 some snapshots back then did do pretty ok at PGO, then suddenly BOOM and it no longer worked at all for Diep that is.

Default pgo gives 0.5% in GCC. Bug after bug and 7 years later it still hasn't been fixed.

One of the big screw ups in GCC which hits much software hard is the rewrite to end of function; it is grabbing your code, and instead of generating a simple CMOV it moves the code to end of function, jumps sometimes to there and then jumps after executing 2 instructions, back to where it was.

That *hurts*.

To quote Linus: "there is no excuse to not generate CMOV's"

A polish guy then posted back in 2007 replying to Linus: "but then it is slower at my P4".

Only at around end 2011 they started moving. We're some months later now, but a snapshot of a few weeks ago still was TURTLE slow still having the same bugs and bottlenecks.

Of course i am compiling for 64 bits, yet diep's code would be faster in 32 bits; i just want efficient code without messing up with the branch prediction.

I want a normal PGO just like other compilers have it!

They aren't capable of producing that, and they're overruling Linus on their way refusing to generate effectively shorter code for *many* years.

Now that they have some competition from other compilers that are 'on the production line' to overtake GCC, it wouldn't amaze me if they 'magically' suddenly improve a lot. They need a kick in their butt man.

The GCC team showed the middle finger to dozens of very important and influencial guys such as Linus for many years.

I'm amazed they know how to produce SSE 4.2 with SF, as they still didn't figure out how to efficiently produce code for branches. The entire fall through model of intel simply hasn't been implemented in GCC.

When did intel introduce this?

Oh 1994 or so?

The difference of gcc 4.0 versus the latest snapshots i tried, is just a few percent for Diep, meanwhile visualstudio and intel c++ got dozens of percentages faster for the modern hardware opteron (barcelona core) and core2 xeons that i have here.

Vincent

"x86-64 and IA-64 will prove to be the ultimate disaster for GCC"
Marc Lehmann, in a private email to me
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Stockfish port to C# Complete

Post by diep »

RoadWarrior wrote:
diep wrote:
RoadWarrior wrote:Interesting - Peter Österlund has just released a C++ implementation of his strong (Elo 2682) Java engine CuckooChess. He says that it's about twice as fast as the Java version: http://talkchess.com/forum/viewtopic.php?t=42999
Yeah about a factor of 4 it is effectively.
Vincent, just look at Peter's numbers for a clear and relevant comparison. Whatever you might say, all you have are words and what he has is working programs and an operational comparison.

Give me a couple of years to whip Amaia into shape, and then we can have a match between Diep (Spyker C12 Zagato) and Amaia (Model T Ford). That would be very interesting. :D
Yeah unlike dutch Spyker default i run diep here at 64 CPU cores and i do have mellanox and nvidia as my sponsors, so blitz engines like houdini 2.0 are not really a match either to that :)

In the end the 'rating lists' basically just play games at 10 minutes a game for 40 moves @ 4 cores or something like that. I bet some even use just 5 minutes for 40 moves. That 'determines' the elorating for the 'ratinglists' basically. It's peanuts what they do compared to what gets tested 'at home' for the engines.

But in all seriousness. We have seen 0 tests so far that objectively measured *anything* for C#.

A few 'quickies', with a lousy default GCC compile. Not the SSE4.2 where Marco is speaking about. From Sjeng spec we know that intel c++ when applying SSE2+/3/4.2 in a clever manner they won suddenly 30% in speed for it (moving to intel c++ 11.x with 10.x being before that) for i7.

So that was just generic compiler optimizations. The type that GCC never gave to Diep by the way.

We haven't seen objective benchmarks and we haven't been able to confirm it. Basically we just hear about someone who works for years in C# and then is doing a few unconfirmed tests in a quick and dirty manner, at a single core. We don't know when it turboboosted nor for which engine. If he did run it at an i7 at all, which i doubt. We basically know very little, except that there wasn't much of a test at all.

So working for 2 years from which 9+ months for optimizing the code further speedwise, then a few quickie tests and claim it's "just 2x slower" than C/C++.

That's with all respect the usual way how people try to HIDE a performance problem.

In the meantime we see 1 posting here of someone who says: "if i test at 1 core it seems not a problem, if i test at 4 versus 1 cores it's 3x slower". That wit hsome SELFWRITTEN c++ code. We know default most c++ coders over here are factor 2 slower.

Also look at Tord. For years Glaurung was relative slow speed. Then we see a bunch of programmers busy in his code and suddenly it's a lot lot faster.

There is dozens of C++ programming stories like that on how they were thinking to write C++ code very well. Comes a few other guys help them out and they magically speedup another factor 2+ in C++. I can confirm this as i personally helped a few speedup their C++ code. Now i don't pretend to be worlds best C++ programmer, as i'm very far away from that, but in most cases a factor 2 was easy to find.

If we add that factor to that 3.0 then we get to 6.0 speed difference already of C/C++ being factor 6 faster than C#.

Vincent
RoadWarrior
Posts: 73
Joined: Fri Jan 13, 2012 12:39 am
Location: London, England

Re: Stockfish port to C# Complete

Post by RoadWarrior »

diep wrote:Yeah unlike dutch Spyker default i run diep here at 64 CPU cores and i do have mellanox and nvidia as my sponsors, so blitz engines like houdini 2.0 are not really a match either to that :)

In the end the 'rating lists' basically just play games at 10 minutes a game for 40 moves @ 4 cores or something like that. I bet some even use just 5 minutes for 40 moves. That 'determines' the elorating for the 'ratinglists' basically. It's peanuts what they do compared to what gets tested 'at home' for the engines.
Is there anybody in the computer chess community whom you haven't disparaged? :)

When it's you against the world, most people will bet on the world.
There are two types of people in the world: Avoid them both.
User avatar
Jim Ablett
Posts: 1383
Joined: Fri Jul 14, 2006 7:56 am
Location: London, England
Full name: Jim Ablett

Re: Stockfish port to C# Complete

Post by Jim Ablett »

diep wrote:
mcostalba wrote:
diep wrote:
mcostalba wrote:
diep wrote: Every compare here is with GCC. A junk compiler i use daily, so the junk it produces i know everything about.
GCC has improved a bit since the 90s. Today this produces the fastest binary on Windows:

http://www.equation.com/servlet/equation.cmd?fa=fortran
Why do you claim this nonsense?
When, in the 90s, have you stopped testing and started to go "by memory" ?

For your information this is the compiler used by Jim to produce the fastest x86-64 SSE42 Windows binary released for SF 2.2.2 (few months ago). Before, until last year, he was using the Intel compiler, but found this one faster.
Latest GCC snapshot i tested a few weeks ago, and it's still lightyears behind intel c++ and even visual studio.

Just because it hardly gets any speedup by PGO.

To avoid a bug in GCC's pgo, i'm doing the profile run single threaded with Diep. Even then it just gives 3% speedup.

I've posted extensively examples of how GCC messes up everywhere on the net. Starting in 2007.

Latest snapshot still didn't have that fixed.

So it already STARTS with a disadvantage over other compilers of 25% or so. Such bad PGO performance is of course a joke.

Note that around 2004-2005 some snapshots back then did do pretty ok at PGO, then suddenly BOOM and it no longer worked at all for Diep that is.

Default pgo gives 0.5% in GCC. Bug after bug and 7 years later it still hasn't been fixed.

One of the big screw ups in GCC which hits much software hard is the rewrite to end of function; it is grabbing your code, and instead of generating a simple CMOV it moves the code to end of function, jumps sometimes to there and then jumps after executing 2 instructions, back to where it was.

That *hurts*.

To quote Linus: "there is no excuse to not generate CMOV's"

A polish guy then posted back in 2007 replying to Linus: "but then it is slower at my P4".

Only at around end 2011 they started moving. We're some months later now, but a snapshot of a few weeks ago still was TURTLE slow still having the same bugs and bottlenecks.

Of course i am compiling for 64 bits, yet diep's code would be faster in 32 bits; i just want efficient code without messing up with the branch prediction.

I want a normal PGO just like other compilers have it!

They aren't capable of producing that, and they're overruling Linus on their way refusing to generate effectively shorter code for *many* years.

Now that they have some competition from other compilers that are 'on the production line' to overtake GCC, it wouldn't amaze me if they 'magically' suddenly improve a lot. They need a kick in their butt man.

The GCC team showed the middle finger to dozens of very important and influencial guys such as Linus for many years.

I'm amazed they know how to produce SSE 4.2 with SF, as they still didn't figure out how to efficiently produce code for branches. The entire fall through model of intel simply hasn't been implemented in GCC.

When did intel introduce this?

Oh 1994 or so?

The difference of gcc 4.0 versus the latest snapshots i tried, is just a few percent for Diep, meanwhile visualstudio and intel c++ got dozens of percentages faster for the modern hardware opteron (barcelona core) and core2 xeons that i have here.

Vincent

"x86-64 and IA-64 will prove to be the ultimate disaster for GCC"
Marc Lehmann, in a private email to me

Try latest link-time optimizations.

Code: Select all

 -Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use
Jim.