Strange sporadic speed limitation in engine running in Linux on Ryzen

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

RubiChess
Posts: 584
Joined: Fri Mar 30, 2018 7:20 am
Full name: Andreas Matthies

Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by RubiChess »

Hi.

Deep Linux system expertise badly needed...

Testing my engine on Andrew Grants OpenBench framework I discovered several very unbalanced testing results.
E.g. testing some eval parameter change http://chess.grantnet.us/test/4856/ I got a +20Elo result on my own worker running OpenBench client under Ubuntu 19.10 on a Ryzen 3700x which is abolutely nonsense. After stopping the Openbench worker and running some bench on the two engine binaries it turned out that one of them was extremely slow (1767721 nps vs. 2122860 nps).
It isn't a bad compilation cause if I copy the slow binary and run the copy, speed is fine. So it seems that "something in the Linux system" slows down the binary. I tried "lsof" to see if there is a process running with a handle on the slow binary but without success.

Last time it happened, a reboot cured the slowness. This time the system is still running and waiting for your ideas to analyse the problem...

More description of the problem can be found here: https://github.com/AndyGrant/OpenBench/issues/50

Any idea is welcome.

./Andreas
User avatar
Deberger
Posts: 91
Joined: Sat Nov 02, 2019 6:42 pm
Full name: ɹǝƃɹǝqǝᗡ ǝɔnɹꓭ

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by Deberger »

RubiChess wrote: Sat Mar 07, 2020 5:26 pm So it seems that "something in the Linux system" slows down the binary.

Any idea is welcome.
"temperature above threshold, cpu clock throttled"
abulmo2
Posts: 433
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by abulmo2 »

I recommend to see the following video about what can affect program performance:
https://www.youtube.com/watch?v=r-TLSBdHe1A
Many things can affect the memory layout of the program and affect its performance, including the directory where it runs from.
Richard Delorme
RubiChess
Posts: 584
Joined: Fri Mar 30, 2018 7:20 am
Full name: Andreas Matthies

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by RubiChess »

Deberger wrote: Sat Mar 07, 2020 8:37 pm
RubiChess wrote: Sat Mar 07, 2020 5:26 pm So it seems that "something in the Linux system" slows down the binary.

Any idea is welcome.
"temperature above threshold, cpu clock throttled"
Nope. I ran slow engine - fast copy - slow engine in this order many times. So temperature is not the problem.
BeyondCritics
Posts: 396
Joined: Sat May 05, 2012 2:48 pm
Full name: Oliver Roese

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by BeyondCritics »

Last time it happened, a reboot cured the slowness.
Did you investigated that hint?E.g. before and after a test start "top" and check for irregularites, like high system load or a thread hogging the cpu.
O try to boot into an older kernel. Or do some diagnostic system performance tests and check for irregularities.
RubiChess
Posts: 584
Joined: Fri Mar 30, 2018 7:20 am
Full name: Andreas Matthies

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by RubiChess »

BeyondCritics wrote: Sun Mar 08, 2020 12:51 pm
Last time it happened, a reboot cured the slowness.
Did you investigated that hint?E.g. before and after a test start "top" and check for irregularites, like high system load or a thread hogging the cpu.
O try to boot into an older kernel. Or do some diagnostic system performance tests and check for irregularities.
I can exclude reasons like other processes stressing the cpu or something. As I said, is is exactly "this" binary file that runs slow nothing else. Running a copy of it the next moment gives fast speed, running the slow original next gives another slow result. So the system itself and every other engine binary runs fast only the one that caused the bad result in OpenBench still runs slow until system reboot. On Windows I probably would blame the virus scanner but on Linux??
What do you mean by "do some diagnostic system performance tests" in detail?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by bob »

Very first thing. Run "top" and keep it active. See if, during the match, the cpu utilization jumps up due to something in your linux distro. I had this happen to me years ago in Suse, which I always considered to be overloaded/bloated anyway. If you don't want to watch top, you might try this:

#!/bin/csh
while (1)
date >>logfile
ps -r | head 10 >>log file
sleep 60
end

When you find a strange slowdown, look in the log file for that time-frame and see if something unexpected is going on.
User avatar
Deberger
Posts: 91
Joined: Sat Nov 02, 2019 6:42 pm
Full name: ɹǝƃɹǝqǝᗡ ǝɔnɹꓭ

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by Deberger »

Apparently I misunderstood the error you are reporting.

A copy of a binary, executed on the same machine, executed in the same way, has consistently differing results?

I would compare the file sizes and file ownerships and file permissions and md5sums.

If everything is the same I would backup any valuable data and check the file system with fsck.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by petero2 »

abulmo2 wrote: Sat Mar 07, 2020 9:05 pm I recommend to see the following video about what can affect program performance:
https://www.youtube.com/watch?v=r-TLSBdHe1A
Many things can affect the memory layout of the program and affect its performance, including the directory where it runs from.
In particular for your situation:
  1. Does the copied engine run faster even if the copy is in the same directory as the original and the filename of the copy has the same length as the original filename?
  2. If you rename the original engine file (instead of copying it), does it still run slow?
  3. Does this slowdown happen when you use a single search thread?
  4. Does this slowdown happen even if you lock the program to a single CPU core, e.g something like:

    Code: Select all

    taskset 1 ./RubiChess bench
RubiChess
Posts: 584
Joined: Fri Mar 30, 2018 7:20 am
Full name: Andreas Matthies

Re: Strange sporadic speed limitation in engine running in Linux on Ryzen

Post by RubiChess »

petero2 wrote: Sun Mar 08, 2020 10:45 pm
abulmo2 wrote: Sat Mar 07, 2020 9:05 pm I recommend to see the following video about what can affect program performance:
https://www.youtube.com/watch?v=r-TLSBdHe1A
Many things can affect the memory layout of the program and affect its performance, including the directory where it runs from.
In particular for your situation:
  1. Does the copied engine run faster even if the copy is in the same directory as the original and the filename of the copy has the same length as the original filename?
  2. If you rename the original engine file (instead of copying it), does it still run slow?
  3. Does this slowdown happen when you use a single search thread?
  4. Does this slowdown happen even if you lock the program to a single CPU core, e.g something like:

    Code: Select all

    taskset 1 ./RubiChess bench
Thanks for response. Here are some answers:
  1. Yes. I even renamed the fast copy to exactly the same folder and name as the slow original (after renaming that) and it still ran faster. Renaming or moving the slow original file doesn't change slow speed.
  2. Yes. Already answered before. Seems that this doesn't fit to the arguments mentioned in the video.
  3. Yes. All tests were done with single thread.
  4. I have used taskset. I don't remeber exactly but I'm pretty sure it didn't change anything in the speed.
I tried to reproduce the problem by rebooting about 15 times and then running the bench on the two binaries again.
I had two "little" reproductions where the binary constantly reached only ~2.09 mnps instead of 2.15mnps. Small difference but noticeable. Both reproductions (and also the original problem I reported) happened after I ran Windows before (computer is dual-boot) and then rebooted directly into Linux (warm reboot) so this may be related. Maybe some hardware that was configured by Windows and not completely reinitialized by Linux. But still strange that single binary files are effected.

I will always do a clean cold boot in the future when running Openbench client and we'll see if it happenes again. But I fear that I have seen other results by other clients at Openbench that are also very biased.

Thanks for your advise.
Andreas
Last edited by RubiChess on Tue Mar 10, 2020 6:45 pm, edited 1 time in total.