jdart wrote:Lots of people would be happy with 45-50M nps. Crafty is already exceptionally fast in terms of nps compared to most programs.
But I agree YBWC is a bottleneck. I did some instrumentation on my program a while back and found that some significant amount of thread idle time was due to having no suitable thread candidate that could fulfull the YBWC conditions.
NUMA is also a factor. Having a shared hashtable across all NUMA nodes is always going to be a performance hit.
I also have a to-do on my list to use thread-local storage for various caches that are per-thread. Currently they are allocated locally as class variables at the start but if a thread becomes idle and then active again, it may have been migrated to a different NUMA node and then its cache is not local.
--Jon
I print a percentage of the time where I am sitting with a thread waiting to be "invited in" at a valid YBW position. I simply count time when a thread is waiting for work, sum them up, and compute that percentage by taking elapsed time for all threads (12*time) - sum_of_time_a_thread was waiting, and divide that by 12*time. Here are a few from a longish game (shorter games are worse):
time=41.49(91%) n=1727396828(1.7B) fh1=91% nps=41.6M 50=1
time=33.66(96%) n=1491342621(1.5B) fh1=90% nps=44.3M 50=2
time=53.98(98%) n=2437374448(2.4B) fh1=90% nps=45.2M 50=0
time=54.44(98%) n=2428656401(2.4B) fh1=90% nps=44.6M 50=1
time=1:03(98%) n=2858692971(2.9B) fh1=91% nps=45.2M 50=0
...
and in the endgame:
time=1:11(83%) n=2883587159(2.9B) fh1=86% nps=40.4M 50=0
time=36.72(85%) n=1541041684(1.5B) fh1=87% nps=42.0M 50=2
time=55.91(85%) n=2563288010(2.6B) fh1=84% nps=45.8M 50=3
time=1:25(89%) n=4283303045(4.3B) fh1=81% nps=50.1M 50=0
time=49.32(87%) n=3058792257(3.1B) fh1=92% nps=62.0M 50=1
time=3.04(77%) n=177852925(177.9M) fh1=99% nps=58.5M 50=0
100 minus the percent above gives the time spent waiting. Those NPS values (at least the early ones) could reach 60M on that hardware, as measured by running 12 copies of crafty at the same time (so no shared data of any kind). The waiting accounts for part of the missing NPS. Some sort of cache/memory/conflict issue(s) account for the rest. 98% numbers above are VERY good. But there is STILL a missing 15M NPS left somewhere that I am going to find as time goes on.
Are you running linux there? If so, processor affinity is your friend. I have been using that for quite a while so that (a) I pin a thread to a single processor and (b) then let each thread initialize its local data stuff (split blocks and such) once pinned to a processor so that those pages of memory fault in to the right NUMA memory bank and are always "close by" the primary user of that data.
I don't think the hash table is a NUMA issue. It is infrequently accessed (at least for me, no hashing in q-search which is the biggest part of the tree). You can test the effect by just running one process, and pinning it to cpu 0. Run a 3 minute search and note the NPS. Now modify your code so that it pins itself to one processor on numa node 0, and then malloc/zero the entire hash table. Then re-pin that thread to a different numa node and run the same search. Now ALL of your ttable references will be to the wrong NUMA bank. If it hurts NPS, you can measure exactly how much. I suspect you won't see much difference unless you happen to be a "hash q-search" kind of guy.