Prefetch and Threading

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Prefetch and Threading

Post by D Sceviour »

The Prefetch() instruction can be used to speed up the read operation for hash table probes. My tests indicate a speed improvement for one CPU using Prefetch(), but increasing threads decrease the effectiveness of Prefetch(). What is happening?

The Prefetch() command works something like this:
"Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that
would otherwise be idle, prefetching the data that the compute thread will need." [ Inter-core Prefetching for Multi-core Processors Using Migrating Helper Threads; Md Kamruzzaman, Steven Swanson, Dean M. Tullsen ]
This next report indicates that hyper-threading should be turned off if using prefetch for a multi-threaded environment. That is, turn on hyper-threading or prefetch, but not both in the system BIOS.

https://www.ixsystems.com/community/thr ... etch.1226/

Supposedly, multiple prefetch buffers can be allocated on the L2 cache. Does each CPU have its own L2 cache prefetch buffer, or is there only one prefetch thread for the main bus transfer? For Shared Hash Table search threading design, the prefetch thread should apparently be attached to the hash table and not the search thread. The results might be different for engines where each thread has its own hash table.

Assuming the prefetch buffer is allocated to the CPU and not the number of threads using a CPU, then multi-threading on a CPU using prefetch would try to over-write each others buffers. Even if the prefetch thread used locks, the efficiency would be expected to deteriorate. More code identifying multi-thread conditions looks unwise since it would quickly defeat the efficiency expected of prefetch to reduce latency. It is looking as if prefetch can be an inefficient operation.

So the question, is prefetch effective on anything other than single CPU and single thread searches?
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Prefetch and Threading

Post by mar »

D Sceviour wrote: Thu Apr 25, 2019 3:04 pm The Prefetch() instruction can be used to speed up the read operation for hash table probes. My tests indicate a speed improvement for one CPU using Prefetch(), but increasing threads decrease the effectiveness of Prefetch(). What is happening?

The Prefetch() command works something like this:
"Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that
would otherwise be idle, prefetching the data that the compute thread will need." [ Inter-core Prefetching for Multi-core Processors Using Migrating Helper Threads; Md Kamruzzaman, Steven Swanson, Dean M. Tullsen ]
This next report indicates that hyper-threading should be turned off if using prefetch for a multi-threaded environment. That is, turn on hyper-threading or prefetch, but not both in the system BIOS.

https://www.ixsystems.com/community/thr ... etch.1226/

Supposedly, multiple prefetch buffers can be allocated on the L2 cache. Does each CPU have its own L2 cache prefetch buffer, or is there only one prefetch thread for the main bus transfer? For Shared Hash Table search threading design, the prefetch thread should apparently be attached to the hash table and not the search thread. The results might be different for engines where each thread has its own hash table.

Assuming the prefetch buffer is allocated to the CPU and not the number of threads using a CPU, then multi-threading on a CPU using prefetch would try to over-write each others buffers. Even if the prefetch thread used locks, the efficiency would be expected to deteriorate. More code identifying multi-thread conditions looks unwise since it would quickly defeat the efficiency expected of prefetch to reduce latency. It is looking as if prefetch can be an inefficient operation.

So the question, is prefetch effective on anything other than single CPU and single thread searches?
You seem confused.
prefetch is a CPU instruction https://c9x.me/x86/html/file_module_x86_id_252.html

There are no "prefetch buffers" whatsoever, you simply prefetch a cache line which is 64 bytes for x86 across all levels.
"prefetching thread" would be certainly a nonsense for a chess engine; you want to dedicate all your available cores to search.
L1 is per core, L2 may/may not be shared among multiple cores and so on.

I never used prefetch in any of my projects; if you prefetch something you won't access, you risk wasting precious cache and actually running slower.
In order to prefetch TT, you can't do that unless you know the hash of the next move you'll search anyway, so not exactly a plenty of time until you enter a new node.
My guess is the gain would be minimal anyway, but those who actually use prefetch will know better. I'd be interested myself in how much prefetch really gains, I'm a bit skeptical (I may be wrong).
Martin Sedlak
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: Prefetch and Threading

Post by Sesse »

I tried prefetch for TT in Stockfish at some point, and couldn't measure a gain. As you say, the prefetch is really only effective if you can issue the prefetch some time before the actual access.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Prefetch and Threading

Post by D Sceviour »

The following nodes per second measurement was taken from a search on identical positions using a single thread and fixed search depth of 28 ply. The hardware is an Intel i9-7960X::

Using prefetch, 1 core:
elap: 5.045000
Total Nodes=15911571 nps: 3153928 prefetch

Using no prefetch, 1 core:
elap: 5.107000
Total Nodes=15911571 nps: 3115639 no prefetch

The represents about a 1% increase in speed for one core using prefetch. Tests indicated no significant change in elo strength, but maybe a small increase of less than 5 elo might be possible.

Tests indicated a reduction in nps when using 8 cores on the same position. I attribute this to a thrashing bottleneck on the main bus transfer, but there might be another answer.

prefetch 8 cores:
elap: 5.744000
Total Nodes=17304351 nps: 3012595

no prefetch 8 cores:
elap: 5.399000
Total Nodes=15722698 nps: 2912150
abulmo2
Posts: 433
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Prefetch and Threading

Post by abulmo2 »

D Sceviour wrote: Thu Apr 25, 2019 3:04 pm The Prefetch() instruction can be used to speed up the read operation for hash table probes. My tests indicate a speed improvement for one CPU using Prefetch(), but increasing threads decrease the effectiveness of Prefetch(). What is happening?
I do not know what is happening, but I do observe a similar pattern with Amoeba 3.
Here my data:
  • CPU Ryzen 1700x (8 core, 16 threads), memory: 16 Gb of DDR4 3000Mhz.
  • Amoeba 3.1.x (in development) searching the bratko-kopec test suite at depth 15.
  • I compute the average time +/- 1.96 x mean error type, based on 10 runs for single threaded searches & 30 runs for multithreaded searches.

Code: Select all

single thread + prefetch:    50.12s +/- 0.05s
single thread + no prefetch: 52.82s +/- 0.04s
8 threads + prefetch:        29.33s +/- 0.61s
8 threads + no prefetch:     29.14s +/- 0.79s
Prefetch brings a clear benefit when a single thread is used, which vanishes when using 8 threads.
Richard Delorme