Houdini with a six point lead near the halfway point of TCEC

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

tmokonen
Posts: 1296
Joined: Sun Mar 12, 2006 6:46 pm
Location: Kelowna
Full name: Tony Mokonen

Re: Houdini with a six point lead near the halfway point of

Post by tmokonen »

Interesting that the Komodo team are using an old version of MinGW. Not too surprising though, after reading the thread "gcc4.8 outperforming gcc5, gcc6, gcc7" in the programming forum.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Houdini with a six point lead near the halfway point of

Post by syzygy »

velmarin wrote:Houdart is right to decline a change.
Yes, and in my view the request to replace Komodo should have been rejected without consulting Houdart.

(Someone will bring up the Stockfish "precedent", but the Stockfish team back then never requested a replacement of the binary. They may have asked for an increase in MoveOverhead - I am not entirely sure - but what they got was the old binary they did not ask for, and which then promptly lost another game on time because the problem was unrelated to the switch to lazy smp in the first place.)
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Houdini with a six point lead near the halfway point of

Post by syzygy »

AdminX wrote:It is important to point out that the approximately 8% speed reduction we noted on our best hardware (24 cores) is apparently as high as 23% on TCEC's 44-core machine based on Komodo's relative nodes per second vs. Houdini in Stage 2.
In my view, this has the signs of a case of false sharing, not of a compiler bug.
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Houdini with a six point lead near the halfway point of

Post by mjlef »

syzygy wrote:
AdminX wrote:It is important to point out that the approximately 8% speed reduction we noted on our best hardware (24 cores) is apparently as high as 23% on TCEC's 44-core machine based on Komodo's relative nodes per second vs. Houdini in Stage 2.
In my view, this has the signs of a case of false sharing, not of a compiler bug.
We have seen an increase in slowdown due to the number of threads when LTO is used. Is there anyway to avoid "false sharing"? I assume the C++ key words "volatile" and "static" guide the compiler to know when it has to go fetch real memory versus cached memory. In Komodo I cannot think of two memory fetches close to each other shared by the threads. But I will start looking for them. Threads have several eval related hashes they use solely. But they should not share more than a few bytes, if even that given alignment when we allocate. Code is shared, but that si read and not written. Anyway, a lot to think about.
User avatar
Harvey Williamson
Posts: 2010
Joined: Sun May 25, 2008 11:12 pm
Location: Whitchurch. Shropshire, UK.
Full name: Harvey Williamson

Re: Houdini with a six point lead near the halfway point of

Post by Harvey Williamson »

Dirt wrote:
velmarin wrote:The Komodo team looked for one more thing, and lost strength.
Blaming the compiler is simply absurd.
No, the compiler shares the blame.
velmarin wrote:Houdart is right to decline a change.
Yes, I agree with that.
Was a change requested?
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Houdini with a six point lead near the halfway point of

Post by Ras »

mjlef wrote:I assume the C++ key words "volatile" and "static" guide the compiler to know when it has to go fetch real memory versus cached memory.
No - because the compiler has no idea about the CPU cache.

"volatile" instructs the compiler not to optimise away accesses because the variable may have changed "from outside" the current control flow. Whether this access ends in CPU cache or in actual memory is not visible.

"static" is for scoping, thereby of course easing optimisation.

For false sharing, if you use the alignment directive for variable A and B with both 64 byte alignment, then they cannot be in the same 64 byte line. If that is enough to prevent caching collision, I guess that should be fine:

Code: Select all

uint32_t  __attribute__((aligned (64))) A;
uint32_t  __attribute__((aligned (64))) B;
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Houdini with a six point lead near the halfway point of

Post by syzygy »

mjlef wrote:
syzygy wrote:
AdminX wrote:It is important to point out that the approximately 8% speed reduction we noted on our best hardware (24 cores) is apparently as high as 23% on TCEC's 44-core machine based on Komodo's relative nodes per second vs. Houdini in Stage 2.
In my view, this has the signs of a case of false sharing, not of a compiler bug.
We have seen an increase in slowdown due to the number of threads when LTO is used. Is there anyway to avoid "false sharing"? I assume the C++ key words "volatile" and "static" guide the compiler to know when it has to go fetch real memory versus cached memory. In Komodo I cannot think of two memory fetches close to each other shared by the threads. But I will start looking for them. Threads have several eval related hashes they use solely. But they should not share more than a few bytes, if even that given alignment when we allocate. Code is shared, but that si read and not written. Anyway, a lot to think about.
Maybe "perf c2c" could help:
https://joemario.github.io/blog/2016/09/01/c2c-blog/

On a recent Linux system, perf should already have the "c2c" option.

Doing "perf c2c record -u stockfish bench 128 6 17" on my 6-core PC and then "perf c2c2 report --stats", I get:

Code: Select all

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :         11
  Load HITs on shared lines         :         13
  Fill Buffer Hits on shared lines  :          2
  L1D hits on shared lines          :          0
  L2D hits on shared lines          :          0
  LLC hits on shared lines          :         11
  Locked Access on shared lines     :          0
  Store HITs on shared lines        :          0
  Store L1D hits on shared lines    :          0
  Total Merged records              :         11
Interestingly, doing the same with Cfish I get:

Code: Select all

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :        487
  Load HITs on shared lines         :       1692
  Fill Buffer Hits on shared lines  :        698
  L1D hits on shared lines          :        105
  L2D hits on shared lines          :         30
  LLC hits on shared lines          :        653
  Locked Access on shared lines     :          0
  Store HITs on shared lines        :         20
  Store L1D hits on shared lines    :         12
  Total Merged records              :        540
I suspect this has to do with Cfish using per-node CMH tables and SF using per-thread CMH tables. Sharing cache lines between threads running on the same node is not terribly bad. (It is bad if the threads perform locked instructions on the cache line.)

Without --stats you get an overview of the shared cachelines. I only get addresses instead of symbols even when compiled with -g, perhaps because the shared cache lines are on the heap. Or I may be missing an option.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Houdini with a six point lead near the halfway point of

Post by syzygy »

Harvey Williamson wrote:
Dirt wrote:
velmarin wrote:The Komodo team looked for one more thing, and lost strength.
Blaming the compiler is simply absurd.
No, the compiler shares the blame.
velmarin wrote:Houdart is right to decline a change.
Yes, I agree with that.
Was a change requested?
That was not clear to me either from the thread's opening post, but indeed it was:
http://www.chessdom.com/houdini-with-a- ... t-of-tcec/
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Houdini with a six point lead near the halfway point of

Post by syzygy »

For komodo9:

Code: Select all

# perf c2c record -u komodo9
Komodo 9.02 64-bit by Don Dailey, Larry Kaufman and Mark Lefler
using hardware POPCNT
info string Licensed to Komodochess.com
setoption name Hash value 128
setoption name Threads value 6
info string Threads now set to 6
go depth 24
...
quit
[ perf record: Woken up 426 times to write data ]
[ perf record: Captured and wrote 106.983 MB perf.data (1401799 samples) ]
[root@localhost Rustfish]# perf c2c report --stats
...
=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :         14
  Load HITs on shared lines         :       9138
  Fill Buffer Hits on shared lines  :       5033
  L1D hits on shared lines          :        528
  L2D hits on shared lines          :         54
  LLC hits on shared lines          :       1809
  Locked Access on shared lines     :          0
  Store HITs on shared lines        :        178
  Store L1D hits on shared lines    :         14
  Total Merged records              :        598
Doing the same with Stockfish (go depth 29 to get about the same search time):

Code: Select all

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :          1
  Load HITs on shared lines         :          1
  Fill Buffer Hits on shared lines  :          0
  L1D hits on shared lines          :          0
  L2D hits on shared lines          :          0
  LLC hits on shared lines          :          1
  Locked Access on shared lines     :          0
  Store HITs on shared lines        :          0
  Store L1D hits on shared lines    :          0
  Total Merged records              :          1
So Komodo9 seems to have a few cache lines that are accessed relatively often by different threads.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Houdini with a six point lead near the halfway point of

Post by corres »

[quote="AdminX"]

Chessdom Write Up:
[b]Statement by team Komodo[/b]

"....Although the bug has probably cost us some points it probably does not fully explain the current five point score deficit."

/[/quote]

This is the essence.
Houdini is the better engine now.
I think TCEC would be more correct competition without any exchange of initial competitors.