Formula needed for optimizing over-subscription of threads for the sake of batching

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 01, 2019 6:16 pm

The way scorpio NN works is to launch many threads for the purpose of batching.
Batching improves performance however running many threads on few cores degrades performance due to
time-sharing of threads. Though this is not a siginficant factor as no computation is done on
the CPU for a NN engine (used to do alpha-beta on cpu before but not anymore). So you could even launch
128 threads on a single core and get away with it.

Two key parameters delay=0 or 1 milliseconds, and mt=32 to 256 threads.
Each thread is slept with a Sleep(delay) while it is waiting for NN inference results from GPU.
Without the delay parameter the performance could tank by as much as 100x, and suddenly jump
back up by 100x when the right number of cores is used. Delaying for more than 1 sec doesn't help much, and i think
threads maybe forcefully slept for a minimum of 10 milliseconds anyawy.

With 32 threads, delay=0 is actually better even on 1-core

Code: Select all

              delay=0, mt=32            delay=1, mt=32
1-core          10407                          7636
2-core          11528                          8320

With 64 threads launched, you would need atleast 2-cores for delay=0 to perform reasonably

Code: Select all

              delay=0, mt=64            delay=1, mt=64
1-core          342                          16438
2-cores       17707                          15446

With 128 threads launched, you would need atleast 4-cores for delay=0 to perform reasonably

Code: Select all

              delay=0,mt=128      delay=1,mt=128
 1-core           161          16532
 2-core           376          20114
 4-core           21615        23331
 8-core           28756        25208
16-core           29804        25199
32-core           29797        24500

And with 256 threads launched, your would need 8-cores for delay=0 to perform reasonably

Code: Select all

              delay=0,mt=256      delay=1,mt=256
 1-core           65           15767
 2-core           175          18990          
 4-core           413          22138
 8-core           11924        24816
16-core           27527        26253        
32-core           26689        25156

So my questions are:

a) Come up with a formula to optimize performance. Note that even when you are using 256 threads, only 1 GPU is used.
Launching many threads is only for the sake of batching.

b) What is the underlying mechanism of oversubscribing in linux and windows, any differences ?

regards,
Daniel

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 01, 2019 9:23 pm

Tables with more data points:

With 32 threads launched, you would need atleast 1-cores for delay=0 to perform reasonably

Code: Select all

              delay=0, mt=32      delay=1, mt=32
1-core          10407           7636
2-core          11528           8320
4-core          12331           8656
8-core          12829           8715
16-core         13056           8644
32-core         13110           8398

With 64 threads launched, you would need atleast 2-cores for delay=0 to perform reasonably

Code: Select all

              delay=0, mt=64      delay=1, mt=64
1-core            342          16438
2-core          17707          15446
4-core          22338          16752
8-core          24610          16955
16-core         25778          17465
32-core         25393          16705

With 128 threads launched, you would need atleast 4-cores for delay=0 to perform reasonably

Code: Select all

              delay=0,mt=128      delay=1,mt=128
 1-core           161          16532
 2-core           376          20114
 4-core           21615        23331
 8-core           28756        25208
16-core           29804        25199
32-core           29797        24500

And with 256 threads launched, you would need 8-cores for delay=0 to perform reasonably

Code: Select all

              delay=0,mt=256      delay=1,mt=256
 1-core           65           15767
 2-core           175          18990          
 4-core           413          22138
 8-core           11924        24816
16-core           27527        26253        
32-core           26689        25156

And with 512 threads launched, you would need 16-cores for delay=0 to perform reasonably

Code: Select all

              delay=0,mt=512      delay=1,mt=512
 1-core           -            <100
 2-core           -            11915       
 4-core           -            19258
 8-core           863          26072
16-core           15059        25820       
32-core           25323        26807

My conclusions so far
a) with delay=0 and N cpu cores, you can launch maximum of 32xN threads maximum for a reasonable performance.
It is not the best performance though. Anything more than that and performance tanks by 100 times.
b) batching improves performance upto 128-256 threads. So launching just 32 threads on 1 core is not an option
since with 128 threads you can get 3x the performance of 32 threads
b) With more cores available delay=0 performs better than delay=1
c) even delay=1 can have problems with a 512:1 ratio of over-subscription

Daniel

nionita · Post by **nionita** » Wed Jul 03, 2019 10:37 am

What if you make a small NN to estimate performace with the data points you have and 3 entries (delay, threads, cpu cores)? Even if it overfits, it memorizes your results, which is actually what you want.

Formula needed for optimizing over-subscription of threads for the sake of batching

Formula needed for optimizing over-subscription of threads for the sake of batching

Re: Formula needed for optimizing over-subscription of threads for the sake of batching

Re: Formula needed for optimizing over-subscription of threads for the sake of batching