info about zappa on 512 cores ?

syzygy · Post by **syzygy** » Sat Jun 30, 2012 1:47 am

Daniel Shawul wrote:Give N small hash tables of the same size with K entries, how do you hash a position with zobrist hash hash_key ? Assume the small hash tables are physically separate (on different processor if you will).

If I understand you correctly, the N small hash tables form one large hash table. Each position maps to exactly one entry (or bucket) in this large hash table.

If N = 2^n and K = 2^k, then the large hash table has 2^(n+k) entries (or buckets), so you use n+k bits of the hash key to find this entry of the (fictitious) large hash table. n of those bits determine which of the N small hash tables you use for this position, the other k bits determine the entry within this small hash table.

If N is not a power of two, you could e.g. use the lower k bits to determine the entry (or bucket) within the small hashtable, and take ((hash_key >> k) % N) to find the small hash table.

If K is not a power of two either, you can take (hash_key % K) and ((hash_key / K) % N).

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 30, 2012 2:32 am

I will do that I guess. He is currently doing a postdoc and I didn't want to disturb him. The best I can do for a guy who found a way out of this CCC grave

Performance guidelines for AMD opteron ccNuma pretty much says I make sure local memory is allocated for each thread.
"Don't do explicity memory allocation, os does it better"
"Don't set thread affinity, os does it better" etc
So in the end the programmer doesn't need to do much to get optimal performance other than "first touch" rule, and maybe some cache optimization specific for numa.
Anyway the scale of SMP zappa used (512 cores) and the itanium architecture are interseting to know the limits of SMP, and when to decide to go message passing instead. Surely there will be a lot of inefficiency for but how much. Most (lazy people) here actually use mpi to program even for numa machines but that probably doesn't translate to chess as efficiency is we care about.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 30, 2012 2:40 am

syzygy wrote:
Daniel Shawul wrote:Give N small hash tables of the same size with K entries, how do you hash a position with zobrist hash hash_key ? Assume the small hash tables are physically separate (on different processor if you will).
If I understand you correctly, the N small hash tables form one large hash table. Each position maps to exactly one entry (or bucket) in this large hash table.

If N = 2^n and K = 2^k, then the large hash table has 2^(n+k) entries (or buckets), so you use n+k bits of the hash key to find this entry of the (fictitious) large hash table. n of those bits determine which of the N small hash tables you use for this position, the other k bits determine the entry within this small hash table.

If N is not a power of two, you could e.g. use the lower k bits to determine the entry (or bucket) within the small hashtable, and take ((hash_key >> k) % N) to find the small hash table.

If K is not a power of two either, you can take (hash_key % K) and ((hash_key / K) % N).

I thought about that but the number of processors is usually not a power of 2. It turns out distributed hash tables have much more important use in web caching and has been researched well. The key elements are decentralization, fault tolerance, and scalability. So I can not make assumptions on the total size of the hash table entries being a power of 2. A client could disconnect at any time and the hash table should keep on working with the remaing tables. When you take these facts into consideration , the method with division and modulo is OK. So each small hash table of equal size with 2^k entries and all N clients total = N * 2^k.

Code: Select all

processor = hash_key % N
hash_key = hash_key / N   /* I think a shift to the right could replace the division here but N is dynamic */
entry = hash_key & &#40;2^k - 1&#41;

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 30, 2012 12:50 pm

I got a couple of suggestions from him now
a) Avoid global idle list at all costs specially on such high number of processors. Instead use a scheme such that an idle processor polls its neighbors for split points. The poll is done with in a distance of 2,4,8,16 to keep processors 'together', and also cut down the amount of work (which I am not sure how). I think this was thoroughly discussed in the recent thread in "YBWC reparenting".' I did do something similar, but one that doesn't consider network topology and just sends a "help" message to a randomly selected processor.
b) Two level hashes. One shared among processors in a cabinet which is probed in qsearch, and the other global hash checked only if the depth is high enough. I think he did not try distributed hash table which is another option.
c) 95% time spent on debugging a parallel search with out a global lock which is a PITA (his words).

cheers

info about zappa on 512 cores ?

Re: transposition tables

Re: info about zappa on 512 cores ?

Re: transposition tables

Re: info about zappa on 512 cores ?