Zobrist keys

syzygy · Post by **syzygy** » Tue Jun 10, 2025 2:16 am

hgm wrote: ↑Mon Jun 09, 2025 10:07 pm
syzygy wrote: ↑Mon Jun 09, 2025 7:23 pm To write to a location in RAM, a CPU core first needs to issue a "request for ownership" on the location's cache line. An RFO fetches the cache line from RAM to cache.

It seems Intel has a patent on doing an RFO_NODATA, which does not fetch the cacheline's content from RAM. But it seems this was intended for implementing special instructions which perform "non-temporal" writes, such as MOVNTDQA.
https://www.felixcloutier.com/x86/movntdqa

I think that the optimization your propose does not play well with the strongly ordered memory model of x86/x86-64, unless the CPU (or the compiler) can predict that the full cache line will be written so that it knows in advance that an RFO_NODATA suffices, but even then there might be complications.
Well, I doubt it. The technique is known as 'write combining', and I get lots of hits on it from Google. E.g. https://stackoverflow.com/questions/772 ... ack-memory .

Writes can be combined, but it still starts with a "read for ownership" which fetches the cacheline even if you end up overwriting it completely. A cpu core cannot write a cacheline before it has exclusive ownership. There are some exceptions such as rep stosb, which on modern CPUs is implemented using microcode which avoid reading the cacheline.

("Write combine memory" is yet a different concept. This is used for non-cached video memory and is only weakly ordered.)

syzygy · Post by **syzygy** » Tue Jun 10, 2025 2:53 am

See also the top answer here:

Advantages for rep movs

1. When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:

- Avoiding the RFO request when it knows the entire cache line will be overwritten.
...

and a bit later:

The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The rep movs approaches lie in the middle.

With non-temporal stores copying memory involves 1 read and 1 write.
WIth temporal stores copying memory involves 2 reads 1 write.
Those 2 reads are the RFO (whose read results are not used) and the actual read. Apparently, even in a memcpy situation an x86-64 cpu is unable to optimize away the RFO read.

Indeed the answer gives this same explanation near the end:

Why I am going on and on about this? Because the best memcpy implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.

hgm · Post by **hgm** » Wed Jun 11, 2025 2:02 pm

What are you quoting from? This doesn't look like Intel docs, more like a report from a user trying to optimize a memcopy by trial and error. If a CPU is smart enough that it can recognize the target area of a string instruction contains a complete cache line, it should certainly be smart enough to recognize that an aligned store of a YMM register does so. Why doesn't he consider using those for memcopy?

But all this assumes writing complete cache lines must be detected at the level of x64 macine code. This seems unlikely: cache access is quite far removed from this level. It takes place at the end of the store queue, i.e. after write combining has had an opportunity to take place. Store operations are presumably only executed when the store microOp gets retired from the re-order buffer, not while it is still speculatively executed and would overwrite memory locations with data that might never have to be written. So in general there is ample time to combine results of different store microOps, while they are waiting for this retirement. They might even have to wait in that queue a long time after that, waiting for cache misses of earlier stores to be processed.

syzygy · Post by **syzygy** » Thu Jun 12, 2025 2:15 pm

hgm wrote: ↑Wed Jun 11, 2025 2:02 pm What are you quoting from?

Oops, I somehow forgot to include the URL. I am quoting from Stackoverflow, like you did. I think it was one or two clicks away from your link.
https://stackoverflow.com/questions/433 ... for-memcpy

This doesn't look like Intel docs, more like a report from a user trying to optimize a memcopy by trial and error. If a CPU is smart enough that it can recognize the target area of a string instruction contains a complete cache line, it should certainly be smart enough to recognize that an aligned store of a YMM register does so. Why doesn't he consider using those for memcopy?

I only quoted the parts that are relevant for the current discussion. The current discussion is whether a CPU, when executing regular code, can optimise a "read for ownership" request into some sort of "take ownership but do not fetch the cacheline" operation.

As it turns out, x86-64 CPUs executing implementations of memset and memcpy that do not use non-temporal writes and do not use special micro-code enhanced instruction (such as rep stosb) do issue "read for ownship" requests even if the cachelines being read are fully overwritten. This is confirmed by measurement.

So the point is not that writes are never combined. The point is that the CPU will still fetch the cacheline being written to when it requests exclusive ownership of that cacheline. Even in the most obvious cases of memset and memcpy.

Zobrist keys

Re: Zobrist keys

Re: Zobrist keys

Re: Zobrist keys

Re: Zobrist keys