Couple of points here.
First, the concept is "fault in". When you malloc/shmget/whatever to allocate a block of memory, that does nothing regarding which node the data ends up on. Exception is the windows "WinMallocInterleaved()" function which handles this for you. I assume by touching each block by swapping from one core to the next. So the idea for unix-based systems is to malloc() the TT, but do not touch it. Then you need to spawn the new threads and immediately pin them to a specific core. Then CPU_SET (for systems that support this like Linux) can be used to do the pinning. Once you do that, each thread should touch every PAGE of its chunk of the TT. Easiest way is to zero the thing since this only gets done once. The only question is how do you decide what goes on each core's local memory. The obvious answer is something that is a multiple of the page size, which can be 4K or something MUCH larger if huge pages are being used. So you could allocate page N or core 0, page N+1 onto core 1, and continue wrapping around until all pages have been touched by the correct core. Or, you could let each core touch N consecutive pages so that you have large chunks interleaved over all the cores, rather than having many more small chunks. If you believe you potentially have hot spots (I have not tested so do not know if this happens) then using 4K chunks makes the most sense.
Second, so far as I know, ANY machine with a reasonable number of processors is NUMA, even on a single chip. I have not really looked at this since I retired, but I doubt it has changed. So most likely, you have NUMA issues no matter what system you are running. A fully shared memory is a pain to design/build and putting in on a single chip is a tough (and expensive) proposition, limiting # of cores due to the data paths required on the chip. For most applications, you can likely ignore this stuff. If you try to fault in a subset of pages on each core, but fail to pin the thread to that core, it doesn't work to improve efficiency regarding the TT. There is definitely a gain for thread local data (in the case of Crafty, "split blocks". Those blocks that contain purely thread-local data really should be on the node with that core.
Finally there is one confounding issue that is harder to handle. MOST NUMA systems are M x N. That is M cores per local memory block. You really want to get all the threads that share a large block of memory frequently (split blocks and such) on the "connected" cores. Otherwise you have have M*N threads on M*N different cores, scattered over over the entire system in a really awkward configuration. IE cores I and I+1 assume they are using the same large block of memory, but they are on different nodes. This is pretty processor specific and is messy to handle. Another issue is global memory, containing globally used data, such as magic move indices, etc. Do you keep one copy or do you create one for each node?
It is not a really easy issue to deal with, but it is also not important. Only issue is that it might not make a lot of difference. Most NUMA machines I have used had a BIOS configuration option to use a "SMP" memory model, which simply does interleaved memory automatically, with block 0 going on node 0, block 1 going on node 1, etc. That is better than doing nothing since you can potentially have the key data on one node making it a hot spot. This is common when you have to initialize large arrays, and do this in the original thread before creating new threads. ALL of that early initialized data ends up on one node, which is not optimal.
I am sure there is more I forgot when writing the above.
BTW Daniel: Allocating the TT on node 0 is a bad idea. You can completely fill node 0's memory with the TT and it is going to spill over to other nodes anyway. But now, ANY malloc() calls will also have to spill over to other nodes, meaning thread-local stuff will be "over there" rather than "over here". Better to do the fault-in logic above to evenly distribute the TT over all nodes, leaving local memory on each node for any thread-local data you might want. IE some use small eval TTs and try to make those thread-local. Having thread-local memory helps.
Lucasart: You can set thread affinity before you spawn the threads if your operating system supports this. Or you can set thread affinity as the VERY FIRST thing you do before you execute anything else. The O/S will instantly migrate that thread to the selected core. That being said, before is better, as it is hard to say what is executed before getting to your first line of code in whatever procedure you specify in your thread creation code. You would prefer to avoid this unknown code from faulting in stuff on the current node, then setting affinity to a different node.