Houdini with a six point lead near the halfway point of TCEC

tmokonen · Post by **tmokonen** » Tue Nov 28, 2017 8:25 pm

Interesting that the Komodo team are using an old version of MinGW. Not too surprising though, after reading the thread "gcc4.8 outperforming gcc5, gcc6, gcc7" in the programming forum.

syzygy · Post by **syzygy** » Tue Nov 28, 2017 8:39 pm

velmarin wrote:Houdart is right to decline a change.

Yes, and in my view the request to replace Komodo should have been rejected without consulting Houdart.

(Someone will bring up the Stockfish "precedent", but the Stockfish team back then never requested a replacement of the binary. They may have asked for an increase in MoveOverhead - I am not entirely sure - but what they got was the old binary they did not ask for, and which then promptly lost another game on time because the problem was unrelated to the switch to lazy smp in the first place.)

syzygy · Post by **syzygy** » Tue Nov 28, 2017 9:34 pm

AdminX wrote:It is important to point out that the approximately 8% speed reduction we noted on our best hardware (24 cores) is apparently as high as 23% on TCEC's 44-core machine based on Komodo's relative nodes per second vs. Houdini in Stage 2.

In my view, this has the signs of a case of false sharing, not of a compiler bug.

mjlef · Post by **mjlef** » Tue Nov 28, 2017 9:59 pm

syzygy wrote:
AdminX wrote:It is important to point out that the approximately 8% speed reduction we noted on our best hardware (24 cores) is apparently as high as 23% on TCEC's 44-core machine based on Komodo's relative nodes per second vs. Houdini in Stage 2.
In my view, this has the signs of a case of false sharing, not of a compiler bug.

We have seen an increase in slowdown due to the number of threads when LTO is used. Is there anyway to avoid "false sharing"? I assume the C++ key words "volatile" and "static" guide the compiler to know when it has to go fetch real memory versus cached memory. In Komodo I cannot think of two memory fetches close to each other shared by the threads. But I will start looking for them. Threads have several eval related hashes they use solely. But they should not share more than a few bytes, if even that given alignment when we allocate. Code is shared, but that si read and not written. Anyway, a lot to think about.

Harvey Williamson · Post by **Harvey Williamson** » Tue Nov 28, 2017 10:09 pm

Dirt wrote:
velmarin wrote:The Komodo team looked for one more thing, and lost strength.
Blaming the compiler is simply absurd.
No, the compiler shares the blame.

velmarin wrote:Houdart is right to decline a change.
Yes, I agree with that.

Was a change requested?

Ras · Post by **Ras** » Tue Nov 28, 2017 10:23 pm

mjlef wrote:I assume the C++ key words "volatile" and "static" guide the compiler to know when it has to go fetch real memory versus cached memory.

No - because the compiler has no idea about the CPU cache.

"volatile" instructs the compiler not to optimise away accesses because the variable may have changed "from outside" the current control flow. Whether this access ends in CPU cache or in actual memory is not visible.

"static" is for scoping, thereby of course easing optimisation.

For false sharing, if you use the alignment directive for variable A and B with both 64 byte alignment, then they cannot be in the same 64 byte line. If that is enough to prevent caching collision, I guess that should be fine:

Code: Select all

uint32_t  __attribute__(&#40;aligned &#40;64&#41;)) A;
uint32_t  __attribute__(&#40;aligned &#40;64&#41;)) B;

syzygy · Post by **syzygy** » Tue Nov 28, 2017 10:30 pm

mjlef wrote:
syzygy wrote:
AdminX wrote:It is important to point out that the approximately 8% speed reduction we noted on our best hardware (24 cores) is apparently as high as 23% on TCEC's 44-core machine based on Komodo's relative nodes per second vs. Houdini in Stage 2.
In my view, this has the signs of a case of false sharing, not of a compiler bug.
We have seen an increase in slowdown due to the number of threads when LTO is used. Is there anyway to avoid "false sharing"? I assume the C++ key words "volatile" and "static" guide the compiler to know when it has to go fetch real memory versus cached memory. In Komodo I cannot think of two memory fetches close to each other shared by the threads. But I will start looking for them. Threads have several eval related hashes they use solely. But they should not share more than a few bytes, if even that given alignment when we allocate. Code is shared, but that si read and not written. Anyway, a lot to think about.

Maybe "perf c2c" could help:
https://joemario.github.io/blog/2016/09/01/c2c-blog/

On a recent Linux system, perf should already have the "c2c" option.

Doing "perf c2c record -u stockfish bench 128 6 17" on my 6-core PC and then "perf c2c2 report --stats", I get:

Code: Select all

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          &#58;         11
  Load HITs on shared lines         &#58;         13
  Fill Buffer Hits on shared lines  &#58;          2
  L1D hits on shared lines          &#58;          0
  L2D hits on shared lines          &#58;          0
  LLC hits on shared lines          &#58;         11
  Locked Access on shared lines     &#58;          0
  Store HITs on shared lines        &#58;          0
  Store L1D hits on shared lines    &#58;          0
  Total Merged records              &#58;         11

Interestingly, doing the same with Cfish I get:

Code: Select all

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          &#58;        487
  Load HITs on shared lines         &#58;       1692
  Fill Buffer Hits on shared lines  &#58;        698
  L1D hits on shared lines          &#58;        105
  L2D hits on shared lines          &#58;         30
  LLC hits on shared lines          &#58;        653
  Locked Access on shared lines     &#58;          0
  Store HITs on shared lines        &#58;         20
  Store L1D hits on shared lines    &#58;         12
  Total Merged records              &#58;        540

I suspect this has to do with Cfish using per-node CMH tables and SF using per-thread CMH tables. Sharing cache lines between threads running on the same node is not terribly bad. (It is bad if the threads perform locked instructions on the cache line.)

Without --stats you get an overview of the shared cachelines. I only get addresses instead of symbols even when compiled with -g, perhaps because the shared cache lines are on the heap. Or I may be missing an option.

syzygy · Post by **syzygy** » Tue Nov 28, 2017 10:34 pm

Harvey Williamson wrote:
Dirt wrote:
velmarin wrote:The Komodo team looked for one more thing, and lost strength.
Blaming the compiler is simply absurd.
No, the compiler shares the blame.

velmarin wrote:Houdart is right to decline a change.
Yes, I agree with that.
Was a change requested?

That was not clear to me either from the thread's opening post, but indeed it was:
http://www.chessdom.com/houdini-with-a- ... t-of-tcec/

syzygy · Post by **syzygy** » Tue Nov 28, 2017 10:44 pm

For komodo9:

Code: Select all

# perf c2c record -u komodo9
Komodo 9.02 64-bit by Don Dailey, Larry Kaufman and Mark Lefler
using hardware POPCNT
info string Licensed to Komodochess.com
setoption name Hash value 128
setoption name Threads value 6
info string Threads now set to 6
go depth 24
...
quit
&#91; perf record&#58; Woken up 426 times to write data &#93;
&#91; perf record&#58; Captured and wrote 106.983 MB perf.data &#40;1401799 samples&#41; &#93;
&#91;root@localhost Rustfish&#93;# perf c2c report --stats
...
=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          &#58;         14
  Load HITs on shared lines         &#58;       9138
  Fill Buffer Hits on shared lines  &#58;       5033
  L1D hits on shared lines          &#58;        528
  L2D hits on shared lines          &#58;         54
  LLC hits on shared lines          &#58;       1809
  Locked Access on shared lines     &#58;          0
  Store HITs on shared lines        &#58;        178
  Store L1D hits on shared lines    &#58;         14
  Total Merged records              &#58;        598

Doing the same with Stockfish (go depth 29 to get about the same search time):

Code: Select all

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          &#58;          1
  Load HITs on shared lines         &#58;          1
  Fill Buffer Hits on shared lines  &#58;          0
  L1D hits on shared lines          &#58;          0
  L2D hits on shared lines          &#58;          0
  LLC hits on shared lines          &#58;          1
  Locked Access on shared lines     &#58;          0
  Store HITs on shared lines        &#58;          0
  Store L1D hits on shared lines    &#58;          0
  Total Merged records              &#58;          1

So Komodo9 seems to have a few cache lines that are accessed relatively often by different threads.

corres · Post by **corres** » Tue Nov 28, 2017 10:55 pm

[quote="AdminX"]

Chessdom Write Up:
[b]Statement by team Komodo[/b]

"....Although the bug has probably cost us some points it probably does not fully explain the current five point score deficit."

/[/quote]

This is the essence.
Houdini is the better engine now.
I think TCEC would be more correct competition without any exchange of initial competitors.

Houdini with a six point lead near the halfway point of TCEC

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of

Re: Houdini with a six point lead near the halfway point of