NUMA 101

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

flok

Re: NUMA 101

Post by flok »

FWIW: on my celeron laptop I see a +/- 2% speed improvement by using "huge pages" - well not that huge; the cpu in my laptop can only do 2MB pages.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: NUMA 101

Post by bob »

zullil wrote:
bob wrote:
zullil wrote:
bob wrote: First, since not all machines support NUMA, I have a -DNUMA Makefile option that turns this on. Leave -DNUMA off, and it doesn't do any NUMA-related tricks at all.
Including -DNUMA leads to the following error. This is on a linux system:

Code: Select all

$ make profile
make -j unix-gcc-profile
make[1]: Entering directory `/home/louis/Documents/Chess/Crafty'
make -j target=UNIX \
		CC=gcc-5 CXX=g++-5 \
		opt='-DTEST -DINLINEASM -DPOPCNT -DCPUS=20 -DAFFINITY -DNUMA' \
		CFLAGS='-Wall -Wno-array-bounds -pipe -O3 -march=native -fprofile-arcs \
		-pthread' \
		CXFLAGS='-Wall -Wno-array-bounds -pipe -O3 -march=native -fprofile-arcs \
		-pthread' \
		LDFLAGS=' -fprofile-arcs -pthread -lstdc++ ' \
		crafty-make
make[2]: Entering directory `/home/louis/Documents/Chess/Crafty'
make[3]: Entering directory `/home/louis/Documents/Chess/Crafty'
gcc-5 -Wall -Wno-array-bounds -pipe -O3 -march=native -fprofile-arcs \
-pthread -DTEST -DINLINEASM -DPOPCNT -DCPUS=20 -DAFFINITY -DNUMA -DUNIX -c crafty.c
g++-5 -c -Wall -Wno-array-bounds -pipe -O3 -march=native -fprofile-arcs \
-pthread -DTEST -DINLINEASM -DPOPCNT -DCPUS=20 -DAFFINITY -DNUMA -DUNIX egtb.cpp
In file included from crafty.c:45:0:
main.c: In function ‘main’:
main.c:4309:26: warning: passing argument 2 of ‘numa_node_to_cpus’ from incompatible pointer type [-Wincompatible-pointer-types]
     numa_node_to_cpus(0, cpus, 64);
                          ^
In file included from main.c:9:0,
                 from crafty.c:45:
/usr/include/numa.h:283:5: note: expected ‘struct bitmask *’ but argument is of type ‘long unsigned int *’
 int numa_node_to_cpus(int, struct bitmask *);
     ^
In file included from crafty.c:45:0:
main.c:4309:5: error: too many arguments to function ‘numa_node_to_cpus’
     numa_node_to_cpus(0, cpus, 64);
     ^
In file included from main.c:9:0,
                 from crafty.c:45:
/usr/include/numa.h:283:5: note: declared here
 int numa_node_to_cpus(int, struct bitmask *);
     ^
make[3]: *** [crafty.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[3]: Leaving directory `/home/louis/Documents/Chess/Crafty'
make[2]: *** [crafty-make] Error 2
make[2]: Leaving directory `/home/louis/Documents/Chess/Crafty'
make[1]: *** [unix-gcc-profile] Error 2
make[1]: Leaving directory `/home/louis/Documents/Chess/Crafty'
make: *** [profile] Error 2
That library call was changed. I don't really use any of those any longer, that one was used strictly to print the greeting noting that this is a NUMA box.

Is this the latest 25.0 or the one I sent you a while back. I did fix that to use a different numa library routine, and thought it was in 25.0 as released?
Bob, the code that causes this error is in the released 25.0.
OK, it is fixed in 25.1 which will be out pretty soon. Some more NUMA updates that I was using privately but am now ready to include since I am sure they work as expected will also be in this next version..
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: NUMA 101

Post by stegemma »

Thanks for your very interesting post, it is very clear and complete.

A whole unknown world has been opened to me!
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: NUMA 101

Post by bob »

stegemma wrote:Thanks for your very interesting post, it is very clear and complete.

A whole unknown world has been opened to me!
There is a lot to think about. For example, when you use magics, you can end up with ALL of the magic lookup tables on a single node? Good, bad or indifferent? I haven't addressed this yet, but one idea is that such data can easily be duplicated. IE generate the original, and then let each thread (on each NUMA node) copy it to a private copy. I have not done this because I am doing this per-thread at the moment, because it is not easy to figure out how many actual numa nodes you have and how cache is configured and shared between nodes. Same sort of idea applies to the actual program executable code. Plenty to think about for the next year or two. :)
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: NUMA 101

Post by stegemma »

bob wrote:
stegemma wrote:Thanks for your very interesting post, it is very clear and complete.

A whole unknown world has been opened to me!
There is a lot to think about. [...] Plenty to think about for the next year or two. :)
This multi-processor + multi-threading stuffs seems to me to be very important, for the future of programming. We are talking about games but now I'm applying the multithreading environment wrote for satana to a web-application server. Having to handle multiple clients is somehow similar to split the search to multiple threads, in fact, and the database looks like the hash table (but more complex, for some reason). The most important aspect is that to reach a true AI we should exasperate multithreading... because the only real AI unit is a brain, with hundred of billion of simple processing units. Let all this processing unit works together sharing memory and other resource is more than a hard challenge, today.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: NUMA 101

Post by wgarvin »

bob wrote:
stegemma wrote:Thanks for your very interesting post, it is very clear and complete.

A whole unknown world has been opened to me!
There is a lot to think about. For example, when you use magics, you can end up with ALL of the magic lookup tables on a single node? Good, bad or indifferent? I haven't addressed this yet, but one idea is that such data can easily be duplicated. IE generate the original, and then let each thread (on each NUMA node) copy it to a private copy. I have not done this because I am doing this per-thread at the moment, because it is not easy to figure out how many actual numa nodes you have and how cache is configured and shared between nodes. Same sort of idea applies to the actual program executable code. Plenty to think about for the next year or two. :)
Read-only data that stays in a cache (e.g. L2 or L1 cache) shouldn't be much of a problem. As long as nobody writes to those cachelines, they can be stored in multiple caches at once.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: NUMA 101

Post by bob »

wgarvin wrote:
bob wrote:
stegemma wrote:Thanks for your very interesting post, it is very clear and complete.

A whole unknown world has been opened to me!
There is a lot to think about. For example, when you use magics, you can end up with ALL of the magic lookup tables on a single node? Good, bad or indifferent? I haven't addressed this yet, but one idea is that such data can easily be duplicated. IE generate the original, and then let each thread (on each NUMA node) copy it to a private copy. I have not done this because I am doing this per-thread at the moment, because it is not easy to figure out how many actual numa nodes you have and how cache is configured and shared between nodes. Same sort of idea applies to the actual program executable code. Plenty to think about for the next year or two. :)
Read-only data that stays in a cache (e.g. L2 or L1 cache) shouldn't be much of a problem. As long as nobody writes to those cachelines, they can be stored in multiple caches at once.
Some of it is big however. IE the rook magics are on the range of 800kb, which blows L1 and L2 (processor-local caches typically).