interesting SMP bug

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: interesting SMP bug

Post by bob »

syzygy wrote:
bob wrote:Don't know where you picked up the assumption about 8 byte atomic writes, but it is wrong.
It's in the Introduction part:
For the remainder of this discussion, assume that 64 bit memory read/write operations are atomic, that is the entire 64 bit value is read/written in one cycle. This leads to the conclusion that a 128 bit read/write is broken up into two 64 bit reads/writes, and therefore this is not atomic since it becomes two distinct cpu-to-memory transactions.
As I already wrote earlier in this thread, there is no indication that this assumption is actually needed for the lockless hashing scheme to work. I tend to agree with you that it is not needed. But I can see how a reader of the paper could be left with the impression that it is required.
That's an assumption to illustrate why the 128 bit read/write is a problem. If 128 bit writes were atomic, there would be no problem, but I've not seen such an architecture to date.

That assumption is irrelevant to lockless hashing. You create a 128 bit hash entry and xor the two halves together. For this position to ever be used, all 128 bits must remain intact. If memory writes are done byte by byte, it still works perfectly, because if ANY byte gets overwritten, a match is impossible (excluding the random change that doesn't screw up the hash signature).

I don't even believe the PC guarantees 8 byte writes to be atomic. In any case, it is unimportant to the idea, which also works for any size tt entry, 16 bytes is not a max, a min, or a requirement.
rtitle
Posts: 33
Joined: Wed Aug 14, 2013 7:23 pm

Re: interesting SMP bug

Post by rtitle »

If you're coding in C++ and you want atomic reads/writes of any primitive type from 1 byte to 8 bytes, you can and in fact *must* use the std::atomic< > template. For example:

std::atomic<unsigned char> uc; // declare 1 byte that is read & written atomically

std::atomic<long long> ll; // declare 8 bytes that are read/written atomically

Why you must use these types: The C++ standard says: If there's no enforced ordering between two accesses to a given memory location from separate threads (i.e. no locks), one or both of the accesses is not atomic (i.e. you did not use std::atomic), and one or both is a write, then this is a data race and the behavior is undefined. "Undefined" means the program can do anything, e.g. the written variable could receive a random value unrelated to what either thread wrote, or the program could even crash as far as the C++ standard is concerned. In practice you can often get away with data races without any of these bad things happening, but strictly speaking your program is not correct if it has data races in it.

See "C++ Concurrency in Action" by Anthony Williams. Excellent book. Chapter 5 talks about atomic operations and types. Later chapters talk about both lock-based and lock-free data structures (such as hash tables, which is what the transposition table is).

Cheers,

Rich Title
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: interesting SMP bug

Post by bob »

rtitle wrote:If you're coding in C++ and you want atomic reads/writes of any primitive type from 1 byte to 8 bytes, you can and in fact *must* use the std::atomic< > template. For example:

std::atomic<unsigned char> uc; // declare 1 byte that is read & written atomically

std::atomic<long long> ll; // declare 8 bytes that are read/written atomically

Why you must use these types: The C++ standard says: If there's no enforced ordering between two accesses to a given memory location from separate threads (i.e. no locks), one or both of the accesses is not atomic (i.e. you did not use std::atomic), and one or both is a write, then this is a data race and the behavior is undefined. "Undefined" means the program can do anything, e.g. the written variable could receive a random value unrelated to what either thread wrote, or the program could even crash as far as the C++ standard is concerned. In practice you can often get away with data races without any of these bad things happening, but strictly speaking your program is not correct if it has data races in it.

See "C++ Concurrency in Action" by Anthony Williams. Excellent book. Chapter 5 talks about atomic operations and types. Later chapters talk about both lock-based and lock-free data structures (such as hash tables, which is what the transposition table is).

Cheers,

Rich Title
Unfortunately the 8-byte assumption does not fix the problem. ttable entries are longer than that, which causes the ttable issue someone mentioned.
rtitle
Posts: 33
Joined: Wed Aug 14, 2013 7:23 pm

Re: interesting SMP bug

Post by rtitle »

For this reason, I manage to squeeze my TT entries into 8 bytes as follows:

- The 64-bit hash is divided into a 31-bit hash index (2 billion TT entries - I have a machine with > 16G memory) and 32 of the remaining 33 bits go into the TT entry itself to detect hash collisions.
- The "best move" (recorded by a previous search of the same position) is encoded into an 8-bit index into the movelist. This relies on the fact that from a given position, the move generator will generate the moves in a deterministic order (i.e. it's sufficient to remember "move index idx was the best-move last time I analyzed this position"; I don't have to store the full move representation of best-move).
- The only other information I store is depth info (what depth search the existing best-move info in this TT entry is based on); this is to help decide when to replace a TT entry. I try to keep higher-quality entries, i.e. those based on deeper searches.

The above assumes my transposition table is used for one purpose only - to remember best-move info from previous iterations of the iterative-deepening search (thus improving alpha-beta efficiency). I do not store evaluations in the TT entry; I decided it's too tricky to figure out when they're reusable (i.e. because of fail-highs and fail-lows, it's complex to figure out if a stored evaluation is actually reusable in the current context). Anyway, optimizing the alpha-beta search is far more useful than optimizing away transpositions. So I'd rather have a larger transposition table just for the former than a smaller one that serves both purposes. Also I do not use my transposition table for repetition detection; for that I simply do a backward walk through the move history back to the last irreversible move, comparing hash values to detect repetition.

I realize this is all rather non-traditional; but I prefer to go my own way in my chess engine. Much more interesting than just cloning other people's ideas. In some later post I'll describe my highly object-oriented way of representing positions and generating moves (also non-traditional; maybe not the speediest but rather elegant).

Cheers,

Rich Title