zullil wrote: So the non-NUMA-aware master is about 1% faster in nps.
Thanks Louis, this was expected
The non-NUMA version uses a per-thread big table, while the NUMA version uses a per-node table, so the memory contention is higher. CFish result instead is different because very probably non-NUMA CFish still uses a global table shared by threads.
The point is that the so called SF NUMA patch does 2 things:
- Sets threads affinity for each thread
- Changes table to be per-node
To really evaluate NUMA contirbution alone we should just test the first, all other things being equal. Eventually other tweaks are second-order improvements, but first we have to characterize NUMA alone.
With NUMA I mean:
1 Setting threads affinity one per core.
2 Allocating per-thread tables once affinity is set
The second point is not present in the so called NUMA patch but makes absolutely sense if you pin the thread. Note that if you don't set affinity, i.e in normal case, the second point has been tested to not change anything (possibly or because threads jump too quickly or because threads are more or less stable and the OS has the opportunity to migrate memory to the corresponding core).
Following the other interesting posts of this thread, I see that a sensible difference between Texel and SF is that in the former case threads do sleep quite often, while instead with SF's lazy SMP threads never sleep and this allows the OS to optimally place the threads in a dynamic fashion.
I still prefer the OS to place the threads for me, as long as I feed it with a predictable long running threads set (to avoid tricking it with sleeping threads) instead of manually set affinity, mainly because OS has more knowledge of difference between physical cores and logical cores, for instance if I have a 2 cores machine with hyper-threading enabled: CPU 0, CPU 1, CPU 2, CPU 3, my manual setting would probably end up in setting 2 threads affinity on CPU 0 and CPU 1, while the OS, realizing that CPU 0 and 1 are just the same physical core, would perhaps place them on CPU 0 and CPU 3.
In general detecting hyper-thread from physical cores is far from trivial, especially on Intel hardware, so people that will cheaply start commenting "you need to detect ht ..." and other similar BS, please don't spam this thread.