The most muscular compiler switch I ever saw

Dann Corbit · Post by **Dann Corbit** » Thu Feb 05, 2026 7:44 am

The bleeding edge Stockfish code, fetched from github today gave the following results:

Arch=avx2:
Total time (ms) : 258602
Nodes searched : 3311863930
Nodes/second : 12806799

Arch=native:
Total time (ms) : 109160
Nodes searched : 1347508396
Nodes/second : 12344342

The time is less than half.
The nodes are less than half.
The NPS is very close to equal.

The only thing that I can figure is that the profile guided optimization caused much better move ordering when the architecture was set to native. Of course, with an AMD Ryzen Threadripper 3970X, one would think that avx2 and native would be nearly identical.

Thoughts?

Dann Corbit · Post by **Dann Corbit** » Thu Feb 05, 2026 7:45 am

I should mention: 16GB RAM for hash, and 16 threads, depth = 20 for the benchmark.

Dann Corbit · Post by **Dann Corbit** » Thu Feb 05, 2026 8:47 am

One more thing (probably inconsequential) is that I profile longer than the standard makefile with this bench command:

PGOBENCH = $(WINE_PATH) ./$(EXE) bench 16384 1 18

I have found that increasing the thread count destroys the profile. Seems odd. Other profilers I have used work find with SMP.

Dann Corbit · Post by **Dann Corbit** » Thu Feb 05, 2026 8:51 am

For those with similar hardware that want to replicate the result, here is the compiler version that I am using:

$ g++ --version
g++.exe (Rev11, Built by MSYS2 project) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

AndrewGrant · Post by **AndrewGrant** » Thu Feb 05, 2026 11:17 am

Dann Corbit wrote: ↑Thu Feb 05, 2026 7:44 am The only thing that I can figure is that the profile guided optimization caused much better move ordering when the architecture was set to native. Of course, with an AMD Ryzen Threadripper 3970X, one would think that avx2 and native would be nearly identical.

PGO does not produce functional differences. The move ordering is the same.

jdart · Post by **jdart** » Thu Feb 05, 2026 4:32 pm

-march=native will enable optimizations that are specific to the processor you are building on. Since Intel/AMD add new instructions regularly, processors can differ in terms of the exact instruction set they support. "avx2" will enable a common subset of instructions, but full use of the supported instruction set may result in better code.

syzygy · Post by **syzygy** » Fri Feb 06, 2026 12:39 am

Dann Corbit wrote: ↑Thu Feb 05, 2026 7:45 am I should mention: 16GB RAM for hash, and 16 threads, depth = 20 for the benchmark.

So it's just SMP randomness.
Either use 1 thread or do 20 runs with both and take the average.

The only reason to use mulitple threads to benchmark two functionally identica versions is to see if avx512 instructions (if -march=native uses those) cause a slowdown with multiple threads that you might not see with a single thread.

You have to understand that a compiler switch cannot be the reason for a different number of nodes unless there is a bug. The difference is entirely caused by indeterminacy caused by multithreading.

Dann Corbit · Post by **Dann Corbit** » Fri Feb 06, 2026 12:52 am

The Native version is also a much better problem solver.
Since time to ply is better than cut in half, it seems to be far more capable.
Now, the other version (avx2) searches wider. So you would think that it would solve some problems better. But so far I have not seen that.
It's not just SMP variation, or I would see a fluxuation like that for native on multiple runs, and a fluxuation of that scale for avx2 on multiple runs. But I don't. Now, I don't have any logical explanation of why the native tree seems to be much less bushy. But it is very interesting to me.
I have seen this effect for other engines as well, but those were also related to Stockfish so it could still be something peculiar to the SF codebase/

syzygy · Post by **syzygy** » Fri Feb 06, 2026 2:11 am

Dann Corbit wrote: ↑Fri Feb 06, 2026 12:52 am The Native version is also a much better problem solver.

Nonsense. Unless you found a bug in either SF or your compiler, the two builds are functionally identical.

Do the two versions give identical benches when using 1 thread?

Dann Corbit · Post by **Dann Corbit** » Fri Feb 06, 2026 3:36 am

Bench with no parameters:
avx2
otal time (ms) : 7936
Nodes searched : 2668754
Nodes/second : 336284

native
Total time (ms) : 7354
Nodes searched : 2668754
Nodes/second : 362898

The most muscular compiler switch I ever saw

The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw

Re: The most muscular compiler switch I ever saw