Problems using a single "byte array" for the transposition table

KhepriChess · Post by **KhepriChess** » Sat Aug 13, 2022 7:38 pm

I'm going down the rabbit hole of trying to add something like Lazy SMP (or some poor-man's version of it) to my engine. Or at least trying to see if a shared hash table will even be feasible.

JavaScript is pretty limited in sharing things across threads and the choices are to either pass around actual objects (terrible for the performance a chess engine needs) or using a "shared array buffer" (basically a byte array that can be accessed by multiple threads without having to pass it around).

Because the shared array buffer works on raw bytes, it's a bit tedious to work with. But it looks something like this:

Code: Select all

// You have to create a buffer and then, to work on it, create a view around it
const sharedBuffer = new SharedArrayBuffer((size * 1024 * 1024) / 16)); // size is defaulted to 32MB
const sharedDataView = new DataView(sharedBuffer);

// Setting an entry in the TT
function Set() {
    // hash is a 64-bit value. Conversions to Number and BigInt here because of how Javascript handles the two
    const index = Number(hash % BigInt(this.sharedDataView.byteLength)); // For 32MB, byteLength = 2097152

    if (score > this.Checkmate) {
        score += ply;
    }

    if (score < -this.Checkmate) {
        score -= ply;
    }

    // For all these functions, the first parameter is the place in the buffer at which the value should be set, the "byte offset"
    sharedDataView.setUint32(index, move, true);
    sharedDataView.setUint8(index + 4, depth); // + 4 to offset from the 32-bit entry above (the offset is in bytes)
    sharedDataView.setUint8(index + 5, flag);
    sharedDataView.setBigUint64(index + 6, hash, true);
    sharedDataView.setInt16(index + 14, score, true);
}

I left out the function to "Get" an entry, but it's basically the same except the "set...()" methods are replaced with "get...()".

For the most part, this actually seems to work. Getting and setting TT entries in the byte array works. But...

Sometimes, the "byte offset" in the set or get methods (e.g. "setUint32(byteOffset, ...)") is greater than 2097152 (or whatever the byteLength of the buffer is). So for example, sometimes "index + 6" is greater than the byteLength of the buffer (it's not always the "index + 6" one, just using that as an example).

I'm sort of lost as to why the index/byte offset I'm getting is larger than it apparently should be. Am I missing something obvious?

algerbrex · Post by **algerbrex** » Sat Aug 13, 2022 8:02 pm

From my understanding, the modulo-ing trick just makes sure you'll always get a value between 0 and the size of the table. But since you're table size is in terms of bytes and not entries, you might sometimes get an index that is within the size of the table but isn't aligned with a full entry, so you run out of bytes.

What you could do is divide the transposition table size in bytes by the size of an entry. This ensures the size is in terms of entries, not discrete bytes, so modulo-ing will always give us a number between 0 and the number of entries available, not the number of bytes available. So you'd avoid the problem of getting an index that is not aligned with a full entry slot. Then, to actually get the right byte offset, since you're still working in discrete bytes, multiply the value times the size of your TT entry.

The only other caveat is that you'll probably want to round down the number of bytes you use to whatever's evenly divisible by your TT entry size, so you're not wasting bytes.

JVMerlino · Post by **JVMerlino** » Sat Aug 13, 2022 8:14 pm

KhepriChess wrote: ↑Sat Aug 13, 2022 7:38 pm I'm going down the rabbit hole of trying to add something like Lazy SMP (or some poor-man's version of it) to my engine. Or at least trying to see if a shared hash table will even be feasible.

Not exactly related to your specific problem, but I can confirm that using shared memory is feasible (I call it "Very Lazy SMP"). In Myrddin, I used processes rather than threads, and all processes have access to the same shared memory for the transposition, eval, and pawn hashes. The parent process is the one that communicates with the GUI, and all the other processes are just put into analysis mode and fed moves as the game progresses, filling up the hash tables, thereby speeding up the parent process' search.

It's certainly not ideal if you want to get the most out of SMP, but CCRL testing at 4CPU shows +80-85 elo. It was pretty quick to implement, and that was my goal.

KhepriChess · Post by **KhepriChess** » Sat Aug 13, 2022 8:28 pm

algerbrex wrote: ↑Sat Aug 13, 2022 8:02 pm What you could do is divide the transposition table size in bytes by the size of an entry. This ensures the size is in terms of entries, not discrete bytes, so modulo-ing will always give us a number between 0 and the number of entries available, not the number of bytes available. So you'd avoid the problem of getting an index that is not aligned with a full entry slot.

Is that not what "(size * 1024 * 1024) / 16)" does? Convert the size (in MB) to bytes and then divide by 16 (16-byte entries). That's what I'm setting the size of the entire table size to, which is how I've seen it does elsewhere.

But does that make sense (and I'm sort of thinking out loud here)? A 32MB table would have 2,097,152 entries (assuming 16-byte entries). So should I be setting the table size to 2,097,152 or to 33,554,432?

KhepriChess · Post by **KhepriChess** » Sat Aug 13, 2022 9:03 pm

I missed the window to edit my above post, but I was going to add:

In Blunder, you do (and in other languages, where you can directly set objects into memory):

Code: Select all

    size := (sizeInMB * 1024 * 1024) / entrySize
    tt.entries = make([]Entry, size)
    tt.size = size

But that's because each Entry is 16-bytes, so you get 16 * (sizeInMB * 1024 * 1024) as the total size of your TT, right?

In my case, I don't have a way to split the array into 16-byte entries (which is really annoying). The array is blind to the size of the data it contains; the array just sees the total size and it can be split up into chunks of whatever, whenever.

So I think I have to set the total size of the TT to sizeInMB * 1024 * 1024 (and not divide it by the entrySize). And then for the index do the modulo thing but then divide that by 16 (and round it down). Does that make sense?

algerbrex · Post by **algerbrex** » Sat Aug 13, 2022 9:21 pm

Ah, I'm sorry, I didn't read your post carefully and missed you're already doing the same as I with regards to getting the correct TT size. So yes you're getting the size correctly.

The other problem I mentioned I think still stands though. Since you're working in individual bytes, you don't just need an index, you need a byte offset. So for whatever index you get, you need to multiply it by the size of your entries to get the correct byte offset in your table.

So I think I have to set the total size of the TT to sizeInMB * 1024 * 1024 (and not divide it by the entrySize). And then for the index do the modulo thing but then divide that by 16 (and round it down). Does that make sense?

I don't quite think that would work. So if you used that approach, the index/byte offset you would be getting would be between 0 and the number of bytes you're working with, not entries. So I think you'd actually run into the issue I thought you were facing but weren't. Namely, you might get a byte offset that's not aligned with the size of a TT entry. So you'd either run out of bytes if the offset was near the end of the array (like you were describing in your original post), or you'd overstep and be using bytes that should belong to another TT entry. And eventually, this would cause you to start overwriting and overlapping other entries.

So I think what you should do is set the byte table size to be sizeInMB * 1024 * 1024, but use sizeInMB * 1024 * 1024 / 16 as the size when calculating an index, so you get a number between 0 and the number of entries you want, which can then be multiplied by the TT entry size to get a correctly aligned byte offset. And the byte offset can then be used to index into the table, without worrying about whether you'll run out of bytes for the current entry.

So say sizeInMB = 1, then sizeInMB * 1024 * 1024 / 16 = 65,536, but the number of bytes is sizeInMB * 1024 * 1024 = 1,048,576. And lets say you get the index 65,535 after modulo-ing. So the last entry in the table. Then to get the right byte offset so you know where to start copying the bytes into the table, multiply the index, 65,535 by your TT entry size, 16, to get 65,535 x 16 = 1,048,560. And you'll notice that starting at the offset gives you exactly 16 bytes to store your entry in, so multiplying the index by the TT entry size aligns your byte offset correctly.

And I realize my earlier caveat was incorrect. You don't need to worry about rounding the number of bytes to a number that's evenly divisible by 16. I wasn't thinking correctly, as that's exactly what using 1024 * 1024 does. It always gives you a factor that's divisible by 16.

KhepriChess · Post by **KhepriChess** » Sat Aug 13, 2022 10:34 pm

Ah, I see what you're saying. Thanks for the example. I'm not sure I would have thought to use two different sizes for specifying the actual array size versus getting the index (and then multiplying the index by the entry size).

It's been a little weird figuring this out, because practically every other language has the concept of creating the array from slices whereas JS does not. It's annoying because JS had this nice idea of adding the ability to work with low-level byte arrays, but then limited you on how you can slice it (only up to 64-bit chunks) and only with raw bytes.

Running a test on that now, and so far it seems promising.

algerbrex · Post by **algerbrex** » Sun Aug 14, 2022 12:13 am

KhepriChess wrote: ↑Sat Aug 13, 2022 10:34 pm Ah, I see what you're saying. Thanks for the example. I'm not sure I would have thought to use two different sizes for specifying the actual array size versus getting the index (and then multiplying the index by the entry size).

It's been a little weird figuring this out, because practically every other language has the concept of creating the array from slices whereas JS does not. It's annoying because JS had this nice idea of adding the ability to work with low-level byte arrays, but then limited you on how you can slice it (only up to 64-bit chunks) and only with raw bytes.

Running a test on that now, and so far it seems promising.

Good to hear, interested to see how it comes out. You've certainly taken on a challenge I wouldn't have personally with using Javascript, due to many issues you'd face versus a more traditional language, including the one you mentioned here. But challenge is half the fun right?

KhepriChess · Post by **KhepriChess** » Sun Aug 14, 2022 6:46 am

Well, after doing a 10,000 game test run, strength is basically the same as just using regular arrays (Elo 3.1 +/ 5.6). Not necessarily a bad thing and I can work with it (and goes to say how well JS engines optimize the code). Now to see whether using the atomic functions slow it down too much...

You've certainly taken on a challenge I wouldn't have personally with using Javascript, due to many issues you'd face versus a more traditional language, including the one you mentioned here. But challenge is half the fun right?

Yup, pretty much. It's thankfully getting better (like the additions of bigints and shared memory), albeit very very slowly.

KhepriChess · Post by **KhepriChess** » Sun Aug 14, 2022 11:07 pm

A bit of a rant:

I swear whoever wrote the spec for ArrayBuffers and Atomics never actually tried using them.

Atomics is a nice "class" that provides static methods to perform atomic operations on a shared memory buffer. Stuff like "Atomics.add()" and "Atomics.load()". Really quite convenient.

Except you can't just pass the buffer into the atomics method. You have to pass a view. And not just any view, it has to be a typed array. i.e. "Atomics.add(buffer, data)" doesn't work, but "Atomics.add(new Uint32Array(buffer), data)" does. And since JS only has typed arrays as large as 64-bits, it gets practically unusable if you need to work with data that's larger than 64-bits, which of course is the case for transposition tables. It's really stupid because you can use the generic DataView on a buffer array, but those read/write operations aren't atomic.

So I don't think Atomics are going to work here. I see there's mention of lock-less methods, primarily the Xor one, on the chessprogramming site, but I don't have a singular 16-byte "data" object to work with.

Problems using a single "byte array" for the transposition table

Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table

Re: Problems using a single "byte array" for the transposition table