Clarification Question

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Werewolf
Posts: 1991
Joined: Thu Sep 18, 2008 10:24 pm

Clarification Question

Post by Werewolf »

On MacOS, to run Windows simply one needs to subscribe to Parallels. Within Parallels one can install Windows 11 and run normal chess software including Stockfish.

Because Parallels is a hypervisor rather than an emulator, I have heard it couldn't handle instructions like AVX2 etc.
In other words Stockfish would run in Windows but using a much slower build.

Can I just double-check this lack of AVX support is still true?
User avatar
Eelco de Groot
Posts: 4658
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: Clarification Question

Post by Eelco de Groot »

Hello, I have really no idea but just reading what the AI says, guess could be that it does not really matter what Parallels is or does. Apple is now running on its own hardware. That seems to imply that every software that is more or less optimized for instruction sets available on PCs, that is my guess AMD ot Intel, may run into problems if you want to run it on M3 or M4 from Apple just as an example. It has nothing to do with how Parallels then tries to run that software anyway. On top of this, Windows is not designed for that hardware and neither Apple nor Micrisoft seems very inclined to put much effort in making Windows more compatible which, I would think, would make it possible to do real dual boots etc. I just did read some Reddit post about that. There was Bootcamp for that but not for Apple silicon anymore or something like that?

Main prolem seems to be if you want to run Stockfish on the Mac not many GUIs are available. Stockfish is not really the problem it will still be really fast even if not optimally changed for M4 or something but the GUIs will not all be there. In Parrallels I don't know if that Stockfish then can achieve full speed but in MacOS it should. Assuming the processor was taken into consideration, maybe not the latest M4 or even M3 but still pretty fast. Run the designed for Mac version of Stockfish will maybe keep you out of trouble with the not abailable instruction sets on M4 but not the same can be said for run of the mill software or other chess programs etc.

I am just guessing..
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
User avatar
Eelco de Groot
Posts: 4658
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: Clarification Question

Post by Eelco de Groot »

If the 'designed for Mac ARM' version of Stockfish will not run in Windows because the libraries are for MacOS for instance, then that version will not run in Parallels and you need a generic Windows version (of Stockfish) that will then be emulated as an ARM version by the Parallels software. Because none of the fastest instruction sets from Intel or AMD are available anyway on ARM, (no complex instructions but speed by simple instructions). I imagine that is not what you want for speed.
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
Werewolf
Posts: 1991
Joined: Thu Sep 18, 2008 10:24 pm

Re: Clarification Question

Post by Werewolf »

this isn't really answering the question, but thanks anyway.

Does anyone actually know...?
smatovic
Posts: 3224
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Clarification Question

Post by smatovic »

As of Dec 6, 2024

https://github.com/official-stockfish/S ... 2522933228
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.
As of today from Apple Rosetta documention:

What Can’t Be Translated?
https://developer.apple.com/documentati ... nvironment
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.
Idk if there is some technical limitation in Arm Neon architecture for AVX support to be implemented in Rosetta 2. I assume that Parallels uses Rosetta 2 to run Windows in M-series macOS.

--
Srdja
Werewolf
Posts: 1991
Joined: Thu Sep 18, 2008 10:24 pm

Re: Clarification Question

Post by Werewolf »

smatovic wrote: Fri Jun 06, 2025 10:37 am As of Dec 6, 2024

https://github.com/official-stockfish/S ... 2522933228
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.
As of today from Apple Rosetta documention:

What Can’t Be Translated?
https://developer.apple.com/documentati ... nvironment
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.
Idk if there is some technical limitation in Arm Neon architecture for AVX support to be implemented in Rosetta 2. I assume that Parallels uses Rosetta 2 to run Windows in M-series macOS.

--
Srdja
Thanks.
Hai
Posts: 679
Joined: Sun Aug 04, 2013 1:19 pm

Re: Clarification Question

Post by Hai »

Werewolf wrote: Wed Jun 04, 2025 6:56 pm On MacOS, to run Windows simply one needs to subscribe to Parallels. Within Parallels one can install Windows 11 and run normal chess software including Stockfish.

Because Parallels is a hypervisor rather than an emulator, I have heard it couldn't handle instructions like AVX2 etc.
In other words Stockfish would run in Windows but using a much slower build.

Can I just double-check this lack of AVX support is still true?
You're asking a very relevant and technical question about the performance implications of running x86 Windows (and thus x86 Stockfish) on Apple Silicon Macs via Parallels, specifically concerning AVX2 support.

Here's the situation based on recent information:

The good news: Parallels Desktop does now support AVX, AVX2, and other extensions for Windows VMs running on Apple Silicon.

A Reddit thread from December 2024 explicitly states: "Parallel Windows VM now supports AVX, AVX2, BMI, FMA, F16C". This is a significant development.
Another source (from November 2024, regarding Windows 11 ARM 24H2 and Parallels 20) mentions "better emulation through Prism which now supports, AVX2 translation".
However, there are still important caveats:

Emulation, Not Native Virtualization (for x86 Windows):
While Parallels is a hypervisor for running ARM-native Windows 11 on Apple Silicon, if you're trying to run x86 Windows applications (like a standard Stockfish.exe built for Intel/AMD CPUs), those applications still need to be translated.
Windows 11 for ARM (the version you run in Parallels on Apple Silicon) has its own x86/x64 emulation layer (similar in concept to Apple's Rosetta 2 for macOS). This emulation layer is what translates the x86 instructions, including AVX2.
So, while AVX2 instructions can now be translated, it's still an emulated environment for those specific x86 instructions, not a direct pass-through of native x86 hardware capabilities.

Performance Impact of Emulation:
Even with AVX2 translation, the performance will likely be slightly slower than a native ARM build of Stockfish running directly on macOS or a native x86 build running on an Intel/AMD CPU.
The emulation layer adds overhead. While a simple instruction might be translated quickly, complex AVX2 operations that would be lightning fast natively on an x86 CPU will incur a little performance penalty during translation.

Optimal Stockfish Builds:
For the best performance on Apple Silicon, the recommended approach is to use a native ARM64 build of Stockfish. Stockfish provides armv8 and newer builds. These builds utilize the ARM architecture's native instruction sets (like NEON, which is ARM's equivalent of SIMD instructions like AVX).
Running the x86 Windows version of Stockfish (even with AVX2 support now translated) will almost certainly be slightly slower than a well-optimized native ARM64 build of Stockfish.

In summary:
You were correct in your initial understanding that there was a lack of AVX/AVX2 support in Windows x86 emulation on Apple Silicon. However, this has recently changed with updates to Windows 11 ARM's emulation layer and Parallels Desktop.

So, yes, Stockfish will now run in Windows on Parallels and can theoretically utilize AVX2 instructions (via translation).

However, it will still likely run slightly slower than a native ARM64 Stockfish build on macOS. For optimal performance on your Mac, sticking with or compiling an armv8 or newer (or specific Apple Silicon) build of Stockfish directly on macOS is generally the superior option.

Take a look at:
viewtopic.php?p=979291#p979291
Wait for the new M5 chip from Apple, which you can buy later this year around month 9 to 11.
Also a Mac Studio (cores x 2 and so on).

The most important things for you are probably:
(Stockfish Makefile
# Default for armv9.2 and above
ARCH=armv9.2
ARMV92_GENERIC = yes
...
# Target for armv9.0
ifeq ($(ARCH), armv9)
ARMV9_GENERIC = yes
CXXFLAGS += -march=armv8-a+sve
endif
...
# Target for armv9.2
ifeq ($(ARCH), armv9.2)
ARMV92_GENERIC = yes
CXXFLAGS += -march=armv8-a+sve2
endif)

ARMv9.6 Architecture
SVE2.2
SVE2
SME2
Rosetta 3
AVX-512
AVX-256
Multithreading
SMT
Hyperthreading


SVE2 (Scalable Vector Extension 2), AVX256 (Advanced Vector Extensions 2), and AVX512 (Advanced Vector Extensions 512) are all SIMD (Single Instruction, Multiple Data) instruction sets. They are designed to accelerate computations by performing the same operation on multiple data elements simultaneously. However, they belong to different processor architectures and have distinct characteristics.

Here's a comparison:

1. AVX256 (Advanced Vector Extensions 256)
Architecture: Primarily x86-64 (Intel Haswell/Broadwell, AMD Excavator/Zen and later).
Vector Length: Fixed vector length of 256 bits. This means it can operate on, for example, 8 single-precision floating-point numbers (32-bit each) or 4 double-precision floating-point numbers (64-bit each) in one instruction.
Registers: Uses 256-bit YMM registers.
Key Features:
Enhances AVX with FMA (Fused Multiply-Add) instructions.
Introduces gather instructions for non-contiguous data access.
Improved integer processing capabilities compared to AVX.
Use Cases: General-purpose acceleration for scientific computing, image/video processing, cryptography, and more. Widely adopted due to its presence in many mainstream CPUs.

2. AVX512 (Advanced Vector Extensions 512)
Architecture: Primarily x86-64 (Intel Skylake-X/Xeon Phi, Ice Lake, Sapphire Rapids, and some later AMD Zen processors).
Vector Length: Fixed vector length of 512 bits. This allows it to process twice as much data as AVX256 in a single instruction (e.g., 16 single-precision floats).
Registers: Uses 512-bit ZMM registers (32 of them, compared to 16 YMM/XMM registers for AVX/AVX2).
Key Features:
Doubles the vector register width compared to AVX2.
Introduces new masking capabilities (operations only on selected elements).
Adds new instructions for specific workloads like deep learning (e.g., VNNI - Vector Neural Network Instructions).
Often comes with higher power consumption and potential for clock throttling on some CPUs when heavily utilized.
Use Cases: High-performance computing (HPC), AI/machine learning, scientific simulations, financial modeling, and other highly data-parallel workloads. Adoption is less widespread than AVX2 due to its presence mostly in higher-end or server CPUs.

3. SVE2 (Scalable Vector Extension 2)
Architecture: Primarily ARMv9 (and some later ARMv8-A designs, including some custom Apple Silicon cores that implement aspects of it).
Vector Length: Scalable/Variable vector length (from 128 bits up to 2048 bits). This is its most defining feature. The actual vector length is determined by the hardware implementation, not fixed by the instruction set.
Registers: Uses Z-registers with a scalable length.
Key Features:
Vector Length Agnostic (VLA): Code compiled for SVE2 can run efficiently on any SVE2-capable processor, regardless of its specific vector length. The same binary will adapt to the hardware's capabilities. This is a significant advantage for software development, as developers don't need to write multiple versions for different vector lengths.
Predication: Allows instructions to operate on only a subset of elements based on a predicate register, similar to AVX512's masking but with more flexibility.
Enhanced capabilities for integer processing, bit manipulation, and cryptographic operations.
Introduces specific instructions for various workloads, including DSP (Digital Signal Processing) and ML.
Use Cases: Broad range of applications from embedded systems to supercomputing, particularly strong in workloads requiring flexible vectorization, such as AI, scientific computing, signal processing, and general-purpose code where vectorization is beneficial.

Key Differences & Similarities:
Feature AVX256 AVX512 SVE2
Architecture x86-64 x86-64 ARMv9 (and some ARMv8)
Vector Length Fixed 256 bits Fixed 512 bits Scalable (128-2048 bits, hardware-defined)
Flexibility Low (fixed) Low (fixed) High (Vector Length Agnostic)
Registers YMM (16 registers, 256-bit) ZMM (32 registers, 512-bit) Z-registers (32 registers, scalable length)
Masking/Pred. Limited (through specific instructions) Yes (extensive, using dedicated mask registers) Yes (comprehensive predication)
Complexity Moderate High (more instructions, more registers) Moderate to High (due to VLA concept, but simpler for developers once understood)
Power/Thermal Generally lower impact Higher power, potential for throttling Designed for efficiency across various scales
Software Dev. Requires targeting specific fixed lengths Requires targeting specific fixed lengths Write once, run everywhere (VLA benefit)

In essence:
AVX256 and AVX512 are fixed-size vector extensions on the x86 architecture. AVX512 is a superset of AVX256, offering wider vectors and more features but also comes with higher complexity and potential power concerns.
SVE2 (on ARM) represents a paradigm shift with its scalable vector length. This makes it incredibly flexible and future-proof, as code written for SVE2 will automatically take advantage of wider vectors on future hardware without recompilation. This "write once, run anywhere" philosophy for SIMD code is its major differentiating factor and a significant advantage for heterogeneous computing environments. Okay, let's break down SVE2 compared to AVX256 and AVX512. These are all SIMD (Single Instruction, Multiple Data) instruction sets, meaning they allow a single instruction to operate on multiple data elements simultaneously, significantly accelerating tasks like multimedia processing, scientific simulations, and of course, chess engine computations.

Here's a comparison:
1. AVX2 (Advanced Vector Extensions 2) - x86 Architecture

Vector Length: Fixed at 256 bits. This means each AVX2 register (YMM registers) can hold, for example:
8 single-precision floating-point numbers (floats)
4 double-precision floating-point numbers (doubles)
32 bytes
16 16-bit integers (shorts)
8 32-bit integers (ints)
4 64-bit integers (longs)
Key Features:
Introduced with Intel Haswell and AMD Excavator architectures.
Expands on AVX1 by adding full integer vector operations (AVX1 was mainly for floating-point).
Includes FMA (Fused Multiply-Add) instructions, which combine multiplication and addition into a single operation, improving efficiency.
Uses a three-operand instruction format, allowing the destination register to be distinct from the source operands.
Use Cases: General-purpose vectorization for a wide range of applications, including image/video processing, scientific computing, and heavily used in chess engines.

2. AVX-512 (Advanced Vector Extensions 512) - x86 Architecture
Vector Length: Fixed at 512 bits. This doubles the register size compared to AVX2 (ZMM registers), allowing twice as many data elements to be processed per instruction. For example:
16 single-precision floating-point numbers
8 double-precision floating-point numbers
64 bytes
Key Features:
Introduced with Intel Xeon Phi (Knights Landing) and later in Skylake-X and Xeon Scalable processors.
Masking: A significant enhancement. AVX-512 allows operations to be conditionally applied to elements within a vector using "mask registers" (k0-k7). This improves efficiency for loops with conditional statements, as you don't need separate instructions to blend or filter.
Gather/Scatter Operations: More flexible memory access, allowing loading from/storing to non-contiguous memory locations efficiently (AVX2 mainly had gather loads).
More Instruction Variants: A richer set of instructions, including specialized ones for specific workloads (e.g., deep learning, cryptography).
Power Consumption: Historically, AVX-512 could lead to higher power consumption and frequency throttling on some consumer CPUs due to the increased computational demands, though this has improved on newer Intel architectures.
Use Cases: High-Performance Computing (HPC), AI/Deep Learning, scientific simulations, financial analytics, complex 3D modeling, and other highly parallel, compute-intensive workloads.

3. SVE2 (Scalable Vector Extension 2) - ARM Architecture
Vector Length: Scalable and Variable (not fixed). This is the most significant difference. Instead of a fixed register size (like 256 or 512 bits), SVE2 vectors can have a length that is determined by the hardware implementation, ranging from 128 bits up to 2048 bits, in multiples of 128 bits. The software (compiler) writes code that is "vector-length agnostic" – it doesn't need to know the exact vector length at compile time.
Key Features:
Part of the ARMv9 architecture.
Vector-Length Agnostic Programming: This is SVE2's killer feature. Developers write code once, and it automatically scales to the optimal vector length of the underlying hardware. This simplifies code development and ensures future-proofing.
Predication: Similar to AVX-512's masking, SVE2 uses predicate registers (P0-P15) to conditionally enable or disable operations on individual elements within a vector. This is crucial for efficient vectorization of loops with if/else branches and tail handling.
Gather/Scatter Operations: Supports efficient non-contiguous memory access.
Rich Instruction Set: Expands on the original SVE and ARM's NEON instruction set, adding more integer operations, bit manipulation, and media processing instructions, making it suitable for a wider range of workloads including machine vision, multimedia, and database scenarios.
No "Heavy" State: Unlike some AVX-512 implementations that could cause frequency throttling, SVE2 is designed to integrate smoothly with the core's power management.
Use Cases: Broad applications from high-performance computing (HPC) and supercomputers (like Fugaku) to smartphones, servers, and embedded systems. Ideal for workloads that benefit from vectorization and where power efficiency is critical.

Comparison Summary:
Feature AVX2 (x86) AVX-512 (x86) SVE2 (ARM)
Architecture x86 (Intel, AMD) x86 (Intel, some AMD) ARM (ARMv9 onwards)
Vector Length Fixed 256 bits Fixed 512 bits Scalable (128-2048 bits), hardware-defined
Key Advantage Efficient 256-bit SIMD, integer ops, FMA 512-bit wide, Masking, Gather/Scatter, more ops Vector-Length Agnostic, Predication
Programming Requires recompilation for different widths Requires recompilation for different widths Single binary scales to vector length
Complexity Relatively straightforward More complex instruction set, sometimes power-heavy Compiler handles scalability; powerful intrinsics
Chess Engines Widely used for optimized x86 builds Utilized in high-performance x86 builds The future for highly optimized ARM builds
In essence, while AVX2 and AVX-512 offer progressively wider fixed-length vectors for x86 CPUs, SVE2 represents a paradigm shift with its scalable vector length. This "vector-length agnostic" approach is a major leap forward in SIMD programming, making code more adaptable to future hardware and ensuring optimal performance across a diverse range of ARM-powered devices without recompilation for specific vector widths.

Theoretical Potential:
In a purely theoretical scenario, where both the hardware truly implements a 2048-bit SVE2 vector unit, and the software is perfectly optimized to use it 100% of the time, yes, you could see a 4x increase in raw vector computation throughput compared to AVX512.

Considering these, a realistic overall speedup for Stockfish on your M5 MAX using an SVE2-optimized build (compared to a highly optimized AVX512 build on a comparable x86 CPU) would likely be in the range of 2x to 3x. This is still an incredibly significant performance gain and highlights the power and efficiency of Apple Silicon and ARMv9's SVE2.
smatovic
Posts: 3224
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Clarification Question

Post by smatovic »

Werewolf wrote: Fri Jun 06, 2025 7:23 pm
smatovic wrote: Fri Jun 06, 2025 10:37 am As of Dec 6, 2024

https://github.com/official-stockfish/S ... 2522933228
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.
As of today from Apple Rosetta documention:

What Can’t Be Translated?
https://developer.apple.com/documentati ... nvironment
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.
Idk if there is some technical limitation in Arm Neon architecture for AVX support to be implemented in Rosetta 2. I assume that Parallels uses Rosetta 2 to run Windows in M-series macOS.

--
Srdja
Thanks.
Seems I was wrong, and Parallels has its own x86 emulation not relying on Rosetta:
This update finally answers that issue and allows for the installation of x86_64 Linux distros without resorting to Rosetta.
https://www.tomshardware.com/software/o ... ux-distros

As of Jan 2025 the x86 feature was still experimental.
“Performance is slow—really slow. Windows boot time is about 2-7 minutes, depending on your hardware. Windows operating system responsiveness is also low.” So, if you want to play with it right now, you must be patient.
--
Srdja
smatovic
Posts: 3224
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Clarification Question

Post by smatovic »

Followup....

There is Apple Rosetta 2, to run x86 binaries in M-series macOS, does currently not support AVX++ according to Apple documentation.

There is MS Prism, to run x86 binaries in Windows for Arm, AVX2 support looks experimental according to web.

There is Parallels x86 emulation, as of Jan 2025 still a preview according to web.

Now Idk what your Windows in Paralles utilizes to run x86 binaries.

--
Srdja
Werewolf
Posts: 1991
Joined: Thu Sep 18, 2008 10:24 pm

Re: Clarification Question

Post by Werewolf »

smatovic wrote: Sun Jun 15, 2025 9:05 am Followup....

There is Apple Rosetta 2, to run x86 binaries in M-series macOS, does currently not support AVX++ according to Apple documentation.

There is MS Prism, to run x86 binaries in Windows for Arm, AVX2 support looks experimental according to web.

There is Parallels x86 emulation, as of Jan 2025 still a preview according to web.

Now Idk what your Windows in Paralles utilizes to run x86 binaries.

--
Srdja
It seems like they are trying.

I ran Hai's post above through Chat-GPT and it disagreed and thought it was too optimistic, but it does look like they are aiming to solve these issues.

The reason I ask is because I've been burnt by Dell and others in the past and Apple (for a monsterous fee admitedly) do look after their customers.