Threads test incl. Crafty 24.1

Sedat Canbaz · Post by **Sedat Canbaz** » Thu Oct 16, 2014 2:24 pm

Btw,
Mr. Vincent Lejeune is leading too, he sent us 8 (eight) benchmarks so far!

And once more I say,
BIG thanks to all chess friends who supported my new benchmark project !!

Code: Select all

kN/s  Cores  EXE   Processors             Speed      Hardware Users
1804     2   w32   Intel Corei3-2100       3.10GHz   Vincent Lejeune
1243     2   x64   AMD Athlon 64 X2 4200+  2.21GHz   Vincent Lejeune
1082     2   w32   Intel Core2Duo T-7300   2.00GHz   Vincent Lejeune
1019     2   w32   AMD Athlon 64 X2 4200+  2.21GHz   Vincent Lejeune
946      1   w32   Intel Corei3-2100       3.10GHz   Vincent Lejeune
667      1   x64   AMD Athlon 64 X2 4200+  2.21GHz   Vincent Lejeune
565      1   w32   Intel Core2Duo T-7300   2.00GHz   Vincent Lejeune
523      1   w32   AMD Athlon 64 X2 4200+  2.21GHz   Vincent Lejeune

And here is my 12 (twelve) benchmarks, and it looks like I am one of the leaders too ))

Code: Select all

kN/s  Cores  EXE   Processors             Speed      Hardware Users
10512    6   x64   Intel Core i7-980X     @4.33GHz   Sedat Canbaz
8394     6   x64   Intel Core i7-980X      3.33GHz   Sedat Canbaz
8150     6   x64   Intel Core i7-970       3.20GHz   Sedat Canbaz
5746     6   w32   Intel Core i7-970       3.20GHz   Sedat Canbaz
4134     4   x64   Intel Core 2 QX9650     3.00GHz   Sedat Canbaz
1905     1   x64   Intel Core i7-980X     @4.33GHz   Sedat Canbaz
1469     1   x64   Intel Core i7-970       3.20GHz   Sedat Canbaz
1099     1   x64   Intel Core 2 QX9650     3.00GHz   Sedat Canbaz
806      4   w32   Samsung Galaxy S4       1.90GHz   Sedat Canbaz
671      3   w32   Samsung Galaxy S4       1.90GHz   Sedat Canbaz
575      3   w32   Samsung Galaxy S4       1.90GHz   Sedat Canbaz
486      2   w32   Samsung Galaxy S4       1.90GHz   Sedat Canbaz

Vinvin · Post by **Vinvin** » Thu Oct 16, 2014 4:20 pm

Sedat Canbaz wrote:Btw,
Mr. Vincent Lejeune is leading too, he sent us 8 (eight) benchmarks so far!
...

I ordered my new computer some days ago :

Code: Select all

Intel S2011 6 Core - i7 4930K 12MB Cache / 12 Thread 3.4Ghz - 130W	    583 €
Asrock s2011 - X79 Extreme6	                                           227 €
Kingston DDR3 - 2133Mhz - 16GB HyperX Beast CL11 ( 2x8GB ) + Headspread  202 €

I'll probably get it next week. I'll send you new benchmarks

But you already have several results for the 4930K.

bob · Post by **bob** » Fri Oct 17, 2014 12:19 am

jdart wrote:Lots of people would be happy with 45-50M nps. Crafty is already exceptionally fast in terms of nps compared to most programs.

But I agree YBWC is a bottleneck. I did some instrumentation on my program a while back and found that some significant amount of thread idle time was due to having no suitable thread candidate that could fulfull the YBWC conditions.

NUMA is also a factor. Having a shared hashtable across all NUMA nodes is always going to be a performance hit.

I also have a to-do on my list to use thread-local storage for various caches that are per-thread. Currently they are allocated locally as class variables at the start but if a thread becomes idle and then active again, it may have been migrated to a different NUMA node and then its cache is not local.

--Jon

I print a percentage of the time where I am sitting with a thread waiting to be "invited in" at a valid YBW position. I simply count time when a thread is waiting for work, sum them up, and compute that percentage by taking elapsed time for all threads (12*time) - sum_of_time_a_thread was waiting, and divide that by 12*time. Here are a few from a longish game (shorter games are worse):

time=41.49(91%) n=1727396828(1.7B) fh1=91% nps=41.6M 50=1
time=33.66(96%) n=1491342621(1.5B) fh1=90% nps=44.3M 50=2
time=53.98(98%) n=2437374448(2.4B) fh1=90% nps=45.2M 50=0
time=54.44(98%) n=2428656401(2.4B) fh1=90% nps=44.6M 50=1
time=1:03(98%) n=2858692971(2.9B) fh1=91% nps=45.2M 50=0

...

and in the endgame:

time=1:11(83%) n=2883587159(2.9B) fh1=86% nps=40.4M 50=0
time=36.72(85%) n=1541041684(1.5B) fh1=87% nps=42.0M 50=2
time=55.91(85%) n=2563288010(2.6B) fh1=84% nps=45.8M 50=3
time=1:25(89%) n=4283303045(4.3B) fh1=81% nps=50.1M 50=0
time=49.32(87%) n=3058792257(3.1B) fh1=92% nps=62.0M 50=1
time=3.04(77%) n=177852925(177.9M) fh1=99% nps=58.5M 50=0

100 minus the percent above gives the time spent waiting. Those NPS values (at least the early ones) could reach 60M on that hardware, as measured by running 12 copies of crafty at the same time (so no shared data of any kind). The waiting accounts for part of the missing NPS. Some sort of cache/memory/conflict issue(s) account for the rest. 98% numbers above are VERY good. But there is STILL a missing 15M NPS left somewhere that I am going to find as time goes on.

Are you running linux there? If so, processor affinity is your friend. I have been using that for quite a while so that (a) I pin a thread to a single processor and (b) then let each thread initialize its local data stuff (split blocks and such) once pinned to a processor so that those pages of memory fault in to the right NUMA memory bank and are always "close by" the primary user of that data.

I don't think the hash table is a NUMA issue. It is infrequently accessed (at least for me, no hashing in q-search which is the biggest part of the tree). You can test the effect by just running one process, and pinning it to cpu 0. Run a 3 minute search and note the NPS. Now modify your code so that it pins itself to one processor on numa node 0, and then malloc/zero the entire hash table. Then re-pin that thread to a different numa node and run the same search. Now ALL of your ttable references will be to the wrong NUMA bank. If it hurts NPS, you can measure exactly how much. I suspect you won't see much difference unless you happen to be a "hash q-search" kind of guy.

jdart · Post by **jdart** » Fri Oct 17, 2014 5:04 am

I use hash in the q-search. I used to have a separate eval cache for the q-search but now it uses the main hash table.

Pinning is ok but if you are not the only CPU-consuming process on the box then you are interfering with the thread scheduler and may not get optimal usage across processes. Maybe not a concern if you are going to just consume 100% of the box you are on, but I remember being on shared unix boxes and even now running on virtual hosts with a fraction of machine resources available is pretty common.

--Jon

bob · Post by **bob** » Fri Oct 17, 2014 2:36 pm

jdart wrote:I use hash in the q-search. I used to have a separate eval cache for the q-search but now it uses the main hash table.

Pinning is ok but if you are not the only CPU-consuming process on the box then you are interfering with the thread scheduler and may not get optimal usage across processes. Maybe not a concern if you are going to just consume 100% of the box you are on, but I remember being on shared unix boxes and even now running on virtual hosts with a fraction of machine resources available is pretty common.

--Jon

I don't consider that usable for chess. Spinlocks are much better than software muteness. But spin locks don't do well on a non-dedicated environment. Ergo, I'm always either dedicated or not running a chess program.

fastgm · Post by **fastgm** » Mon Oct 20, 2014 7:40 pm

Komodo 8, 16 vs. 32 threads - 60"+0.05"
Intermediate result after 750 games:

Code: Select all

Wins   = 136
Draws  = 545
Losses = 69
Av.Op. Elo = 3000

Result     &#58; 408.5/750 (+136,=545,-69&#41;
Perf.      &#58; 54.5 %
Margins    &#58;
 68 %      &#58; (+  0.9,-  0.9 %) -> &#91; 53.5, 55.4 %&#93;
 95 %      &#58; (+  1.9,-  1.8 %) -> &#91; 52.6, 56.3 %&#93;
 99.7 %    &#58; (+  2.8,-  2.8 %) -> &#91; 51.7, 57.3 %&#93;

Elo        &#58; 3031
Margins    &#58;
 68 %      &#58; (+  7,-  7&#41; -> &#91;3025,3038&#93;
 95 %      &#58; (+ 13,- 13&#41; -> &#91;3018,3044&#93;
 99.7 %    &#58; (+ 20,- 19&#41; -> &#91;3012,3051&#93;

Laskos · Post by **Laskos** » Mon Oct 20, 2014 8:09 pm

fastgm wrote:Komodo 8, 16 vs. 32 threads - 60"+0.05"
Intermediate result after 750 games:

Code: Select all

Wins   = 136
Draws  = 545
Losses = 69
Av.Op. Elo = 3000

Result     &#58; 408.5/750 (+136,=545,-69&#41;
Perf.      &#58; 54.5 %
Margins    &#58;
 68 %      &#58; (+  0.9,-  0.9 %) -> &#91; 53.5, 55.4 %&#93;
 95 %      &#58; (+  1.9,-  1.8 %) -> &#91; 52.6, 56.3 %&#93;
 99.7 %    &#58; (+  2.8,-  2.8 %) -> &#91; 51.7, 57.3 %&#93;

Elo        &#58; 3031
Margins    &#58;
 68 %      &#58; (+  7,-  7&#41; -> &#91;3025,3038&#93;
 95 %      &#58; (+ 13,- 13&#41; -> &#91;3018,3044&#93;
 99.7 %    &#58; (+ 20,- 19&#41; -> &#91;3012,3051&#93;

Wow, thanks Andreas, hope you will continue to 3,000 games.

I feel, in order to be fair, I have to give a favor to Bon Hyatt. On 1 your core, the gain from doubling _time_ at 60''+0.05'' gives 90-100 Elo points. But on 16 cores, the doubling in _time_ is about ~70 Elo points at 60''+0.05'' and on these cores.

So, my (using logarithmic model, not even Amdahl) prediction for effective speed-up of 1.34 gives 70*log(1.34)/log(2) ~ 30 Elo points for 32 versus 16 cores.
Bob's prediction for "bad", as he expressed, speed-up of 1.78 gives 70*log(1.78)/log(2) ~ 58 Elo points for 32 versus 16 threads.

For now, 31 Elo points what you see are heavily favoring me, to 97% confidence that I am closer to the final result, but I eagerly wait for new intermediate results

Thank you very much Andreas, your posts are always of excellent quality.

fastgm · Post by **fastgm** » Tue Oct 21, 2014 8:00 pm

Half-time after 1500 games:

Code: Select all

Wins   = 268
Draws  = 1082
Losses = 150
Av.Op. Elo = 3000

Result     &#58; 809.0/1500 (+268,=1082,-150&#41;
Perf.      &#58; 53.9 %
Margins    &#58;
 68 %      &#58; (+  0.7,-  0.7 %) -> &#91; 53.3, 54.6 %&#93;
 95 %      &#58; (+  1.3,-  1.3 %) -> &#91; 52.6, 55.3 %&#93;
 99.7 %    &#58; (+  2.0,-  2.0 %) -> &#91; 51.9, 55.9 %&#93;

Elo        &#58; 3027
Margins    &#58;
 68 %      &#58; (+  5,-  5&#41; -> &#91;3023,3032&#93;
 95 %      &#58; (+  9,-  9&#41; -> &#91;3018,3037&#93;
 99.7 %    &#58; (+ 14,- 14&#41; -> &#91;3013,3041&#93;

Laskos · Post by **Laskos** » Tue Oct 21, 2014 8:06 pm

fastgm wrote:Half-time after 1500 games:

Code: Select all

Wins   = 268
Draws  = 1082
Losses = 150
Av.Op. Elo = 3000

Result     &#58; 809.0/1500 (+268,=1082,-150&#41;
Perf.      &#58; 53.9 %
Margins    &#58;
 68 %      &#58; (+  0.7,-  0.7 %) -> &#91; 53.3, 54.6 %&#93;
 95 %      &#58; (+  1.3,-  1.3 %) -> &#91; 52.6, 55.3 %&#93;
 99.7 %    &#58; (+  2.0,-  2.0 %) -> &#91; 51.9, 55.9 %&#93;

Elo        &#58; 3027
Margins    &#58;
 68 %      &#58; (+  5,-  5&#41; -> &#91;3023,3032&#93;
 95 %      &#58; (+  9,-  9&#41; -> &#91;3018,3037&#93;
 99.7 %    &#58; (+ 14,- 14&#41; -> &#91;3013,3041&#93;

Thank you very much Andreas. I will probably open a thread about your important result, after you finish the test.

Thanks again.

Werewolf · Post by **Werewolf** » Wed Oct 22, 2014 11:18 am

Sorry can I just clarify what we're saying here?

Komodo 8 on 32 threads is around 30 elo stronger than Komodo 8 on 16 threads?

On all my testsuites it's the other way around, but that may be just down to small sample size etc.

Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1

Re: Threads test incl. Crafty 24.1