gcc again

bob · Post by **bob** » Thu Apr 10, 2014 6:01 pm

I have run into yet another interesting gcc issue. Here's an explanation:

I just made a couple of small changes to the forward pruning and reduction code in crafty. In total, maybe 50 lines of code. We've had some A/C problems so I have not been able to cluster test as of yet, so I thought I would play a few test games between my laptop and a 12-core box I have at the office.

I did the usual PGO compile, and when I started the match, the xboard engine output window immediately pointed to a problem. Crafty normally searches 40-45M nps on this box, but the actual number was in the very low 20's. What happened? I puzzled over this, and decided to comment out 1/2 of my changes. Rather than do a full PGO just to test removing part of the change, I did a normal compile. Back to 44M on the test position.

I looked at the change for quite a while but could find NOTHING to change the SMP performance (I forgot to mention, PGO single-thread had normal speed, it was the 12-thread version that did not). I put the change back in, with a couple of prints to see what was happening. 42M nps. Hmm. I then re-compiled with PGO, back to 21M.

So, in summary, the new search, which differs by a total of 50 lines including comments is 1/2 normal speed with PGO, normal speed without PGO. To be complete, I went back to original search. 42M with or without PGO (PGO is always just a bit faster, but only a couple of M nps with 12 threads so I am ignoring that.

In summary, original with or without PGO, same rough speed. New version 1/2 speed with PGO, normal speed without.

About the craziest thing I have seen since I tried the newer profile optimization that tries to reorganize local data order to improve pre-fetching, but which hurt my program badly because it was already ordered for locality.

I am going to play with this by slowly making the changes and then compiling, but it takes about 5-6M to do a PGO compile which is a pain. But I can't see the loss in speed without doing that.

Note that the hardware I am using is intel 6x2 core that I have been using for almost 4 years, the gcc compiler is version 4.7.2 (not easy to upgrade since this is a cluster used by many and we don't want to make changes that might break something, although looking at these results, something might already be broken).

Anyone seen anything similar?

BTW the old version and new version are identical everywhere except for the search code, specifically the pruning and reduction blocks (two of each, one in Search() one in SearchParallel()). Everything else is absolutely identical.

Yes, I know, the simple solution is to not use PGO. But I really want to understand WHY. I have PGO'ed each individual file also, rather than compiling one large file where inlining goes crazy, to keep the PGO stuff localized to each individual source file. I suppose the next step is going to be to PGO and non-PGO compile search with -S and look at the asm to see what is going on...

michiguel · Post by **michiguel** » Thu Apr 10, 2014 6:19 pm

bob wrote:I have run into yet another interesting gcc issue. Here's an explanation:

I just made a couple of small changes to the forward pruning and reduction code in crafty. In total, maybe 50 lines of code. We've had some A/C problems so I have not been able to cluster test as of yet, so I thought I would play a few test games between my laptop and a 12-core box I have at the office.

I did the usual PGO compile, and when I started the match, the xboard engine output window immediately pointed to a problem. Crafty normally searches 40-45M nps on this box, but the actual number was in the very low 20's. What happened? I puzzled over this, and decided to comment out 1/2 of my changes. Rather than do a full PGO just to test removing part of the change, I did a normal compile. Back to 44M on the test position.

I looked at the change for quite a while but could find NOTHING to change the SMP performance (I forgot to mention, PGO single-thread had normal speed, it was the 12-thread version that did not). I put the change back in, with a couple of prints to see what was happening. 42M nps. Hmm. I then re-compiled with PGO, back to 21M.

So, in summary, the new search, which differs by a total of 50 lines including comments is 1/2 normal speed with PGO, normal speed without PGO. To be complete, I went back to original search. 42M with or without PGO (PGO is always just a bit faster, but only a couple of M nps with 12 threads so I am ignoring that.

In summary, original with or without PGO, same rough speed. New version 1/2 speed with PGO, normal speed without.

About the craziest thing I have seen since I tried the newer profile optimization that tries to reorganize local data order to improve pre-fetching, but which hurt my program badly because it was already ordered for locality.

I am going to play with this by slowly making the changes and then compiling, but it takes about 5-6M to do a PGO compile which is a pain. But I can't see the loss in speed without doing that.

Note that the hardware I am using is intel 6x2 core that I have been using for almost 4 years, the gcc compiler is version 4.7.2 (not easy to upgrade since this is a cluster used by many and we don't want to make changes that might break something, although looking at these results, something might already be broken).

Anyone seen anything similar?

BTW the old version and new version are identical everywhere except for the search code, specifically the pruning and reduction blocks (two of each, one in Search() one in SearchParallel()). Everything else is absolutely identical.

Yes, I know, the simple solution is to not use PGO. But I really want to understand WHY. I have PGO'ed each individual file also, rather than compiling one large file where inlining goes crazy, to keep the PGO stuff localized to each individual source file. I suppose the next step is going to be to PGO and non-PGO compile search with -S and look at the asm to see what is going on...

Weird.

You had problems before with profile-generate and replaced it with profile-arcs. Which one do you use here?

Are you running the profile test with several cores? what if you use only one core?

Miguel

bob · Post by **bob** » Thu Apr 10, 2014 7:35 pm

michiguel wrote:
bob wrote:I have run into yet another interesting gcc issue. Here's an explanation:

I just made a couple of small changes to the forward pruning and reduction code in crafty. In total, maybe 50 lines of code. We've had some A/C problems so I have not been able to cluster test as of yet, so I thought I would play a few test games between my laptop and a 12-core box I have at the office.

I did the usual PGO compile, and when I started the match, the xboard engine output window immediately pointed to a problem. Crafty normally searches 40-45M nps on this box, but the actual number was in the very low 20's. What happened? I puzzled over this, and decided to comment out 1/2 of my changes. Rather than do a full PGO just to test removing part of the change, I did a normal compile. Back to 44M on the test position.

I looked at the change for quite a while but could find NOTHING to change the SMP performance (I forgot to mention, PGO single-thread had normal speed, it was the 12-thread version that did not). I put the change back in, with a couple of prints to see what was happening. 42M nps. Hmm. I then re-compiled with PGO, back to 21M.

So, in summary, the new search, which differs by a total of 50 lines including comments is 1/2 normal speed with PGO, normal speed without PGO. To be complete, I went back to original search. 42M with or without PGO (PGO is always just a bit faster, but only a couple of M nps with 12 threads so I am ignoring that.

In summary, original with or without PGO, same rough speed. New version 1/2 speed with PGO, normal speed without.

About the craziest thing I have seen since I tried the newer profile optimization that tries to reorganize local data order to improve pre-fetching, but which hurt my program badly because it was already ordered for locality.

I am going to play with this by slowly making the changes and then compiling, but it takes about 5-6M to do a PGO compile which is a pain. But I can't see the loss in speed without doing that.

Note that the hardware I am using is intel 6x2 core that I have been using for almost 4 years, the gcc compiler is version 4.7.2 (not easy to upgrade since this is a cluster used by many and we don't want to make changes that might break something, although looking at these results, something might already be broken).

Anyone seen anything similar?

BTW the old version and new version are identical everywhere except for the search code, specifically the pruning and reduction blocks (two of each, one in Search() one in SearchParallel()). Everything else is absolutely identical.

Yes, I know, the simple solution is to not use PGO. But I really want to understand WHY. I have PGO'ed each individual file also, rather than compiling one large file where inlining goes crazy, to keep the PGO stuff localized to each individual source file. I suppose the next step is going to be to PGO and non-PGO compile search with -S and look at the asm to see what is going on...
Weird.

You had problems before with profile-generate and replaced it with profile-arcs. Which one do you use here?

Are you running the profile test with several cores? what if you use only one core?

Miguel

I am using the -profile-arcs. The generate option doesn't just shuffle code to make prefetching match with expected branch predictions, but it also shuffles data values around in memory, and that hurt. but this problem is in the same order of magnitude, which MIGHT mean that the underlying cause is the same.

All I have done is make two changes.

(1) I was reducing 1 or 2 plies. I now go beyond that with a simple linear computation, so far reducing up to 5 seems to be best, but I am still fiddling around.

(2) I do some late forward pruning based on something I started working on before the last CCT event (didn't use it there as it was a "work in progress"). Idea was to prune obviously unsafe moves in later plies, moves that, for example, move a queen onto a square attacked by an enemy pawn. There are some constraints beyond that, but the change is fairly simple.

When I first saw the slow-down I thought OK, I am reducing more aggressively, and if I split 5 plies from the tip and reduce to q-search, that's not very efficient because the split is not free and there is no real work to do in parallel. I started by increasing the split depth minimum, and the split node count minimum, but very little change. I am convinced I am not changing the shape of the tree enough to break the parallel search splitting logic, and then I confirmed that when the non-PGO run is just as fast as expected.

About all I can think of to try is to do what I did before, compile everything separately, and PGO everything but search, and produce the search.s file, then PGO search as well and again produce the search.s and compare them. There has to be some major difference. Might take some looking since the compiler has gotten very creative with the xmm stuff and who knows what else, but something is wrong. Could somehow be a odd bug in my code to be sure, but I'd certainly be surprised. I might try gcc 4.8 on another box, linked statically, so that it will run on the 12 core box, and see if it behaves any differently. And I have not tested this yet on my mac, but that is on the queue for later tonight when I get home, to see if there is anything odd going on with just 2 physical/4 logical cores and PGO.

BTW in my profiling, most of the profile is 1 core. I have one multiple core test just to get SearchParallel(), Thread() and such profiled. I also discovered the -fprofile-correction when seems to solve issues with corrupted profile data. Works perfect, every time, for the version with the old search.c source. With the new version, fails every time.

bob · Post by **bob** » Thu Apr 10, 2014 8:30 pm

Here's another quirk:

SAME source code compiled twice, no PGO in ANY of these runs, first run using crafty with each source file compiled separately, the second run is using crafty compiled using "crafty.c" which puts every source file into one large wad and compiles that:

log.001 is compiled as separate files, 12 threads, log.002 is compiled as one large file, 12 threads, no PGO anywhere, all .gcda files removed, etc. Log.003 is separate files, 1 thread, log.004 is one big wad, one thread. Interesting it is slightly slower. But what about the first two? factor of 3x on raw NPS?

log.001: time=21.36(90%) n=228673421 fh1=0.95% 50move=0 nps=11.2M
log.002: time=7.96(82%) n=286158701 fh1=0.94% 50move=0 nps=36.4M
log.003: time=41.43(100%) n=227615690 fh1=0.95% 50move=0 nps=6.0M
log.004: time=42.30(100%) n=227615690 fh1=0.95% 50move=0 nps=5.9M

Strangest thing I have seen. This is going to be my obsession for the next few days, I am going to figure out what is going on. I notice that with 4-6 threads the parallel performance is not as bad: first 2 are same as before but 4 threads, next two are 6 threads. first test big compile=2x faster, second 2.5x faster. Completely nuts.

log.001: time=26.29(98%) n=263679628 fh1=0.95% 50move=0 nps=10.5M
log.002: time=13.72(97%) n=254793247 fh1=0.95% 50move=0 nps=19.1M
log.003: time=19.49(97%) n=213935909 fh1=0.95% 50move=0 nps=11.5M
log.004: time=8.09(94%) n=204361836 fh1=0.95% 50move=0 nps=25.7M

bob · Post by **bob** » Fri Apr 11, 2014 11:20 pm

This turned out to be a linux glitch.

We lost A/C for about a week in our lab, so all clusters were powered down. When they were powered back up, the cluster I was using (2 x 6 cores) came back up with hyper threading enabled, where it has always been disabled. Turns out the linux kernel we run there is broken and does NOT handle hyper threading correctly. I run 12 threads, and sometimes (rarely) they mostly get on separate physical cores, but they CAN pile up on 6 physical cores (12 logical cores that share those 6 physical cores).

When I discovered this, I requested that the thing be re-started with someone going into the bios and fixing this permanently rather than using a software disable. While waiting, I chose to add some processor affinity code to Crafty to prevent this nonsense from happening again. And I am now seeing normal results no matter what I do.

I recall a scheduler bug introduced when Intel went to 6 cores, as the method for detecting which cores were logical and which were physical was broken since 6 is not a power of 2. But I was pretty sure that the kernel we run (3.2.0) on this cluster handled hyper threading correctly. Wrong assumption.

I tend to paraphrase Linus Trovald's famous quote and turn it into "Hyperthreading is not the answer, it is the question. NO is the answer."

Back to real testing one again after several days of wasted time.

bob · Post by **bob** » Sat Apr 12, 2014 12:01 am

<sigh>, NOT solved. the hyper-threading was causing a set of symptoms, but not THE symptoms I described. UGH. Still looking...

mcostalba · Post by **mcostalba** » Sat Apr 12, 2014 7:52 am

bob wrote: I am going to play with this by slowly making the changes and then compiling, but it takes about 5-6M to do a PGO compile which is a pain. But I can't see the loss in speed without doing that.

You were going along the right track..then you lost focus.

Don't mess with other people/hyperthreading/your thoughts, simply follow strictly what you have written above, it will take a bit of time but you will get to the reason of the strange behavior.

Any other approach will take you to nowhere.

bob · Post by **bob** » Sat Apr 12, 2014 3:34 pm

mcostalba wrote:
bob wrote: I am going to play with this by slowly making the changes and then compiling, but it takes about 5-6M to do a PGO compile which is a pain. But I can't see the loss in speed without doing that.
You were going along the right track..then you lost focus.

Don't mess with other people/hyperthreading/your thoughts, simply follow strictly what you have written above, it will take a bit of time but you will get to the reason of the strange behavior.

Any other approach will take you to nowhere.

I have tested all week. All I can say is that PGO is causing problems, sporadically. It used to crash all the time as recently as maybe 2 years ago, but then it started working.

I compile the same code with intel, and PGO works just as it always has, no oddity whatsoever. But I have tested several different versions of Crafty now, and ALL are showing this strange PGO behavior, one version with threads runs at 20M bps, recompile and it runs at 40M.

Still testing to try to figure out why this is happening. The single-thread execution times are just as expected, so apparently there is some quirk related to threads that I have not discovered, yet. But I plan on looking until I find it, if I have to produce asm for the whole program to see what is up...

gcc again

gcc again

Re: gcc again

Re: gcc again

Re: gcc again

Re: gcc again - solved

Re: gcc again - solved

Re: gcc again

Re: gcc again