Batching improves performance however running many threads on few cores degrades performance due to
time-sharing of threads. Though this is not a siginficant factor as no computation is done on
the CPU for a NN engine (used to do alpha-beta on cpu before but not anymore). So you could even launch
128 threads on a single core and get away with it.
Two key parameters delay=0 or 1 milliseconds, and mt=32 to 256 threads.
Each thread is slept with a Sleep(delay) while it is waiting for NN inference results from GPU.
Without the delay parameter the performance could tank by as much as 100x, and suddenly jump
back up by 100x when the right number of cores is used. Delaying for more than 1 sec doesn't help much, and i think
threads maybe forcefully slept for a minimum of 10 milliseconds anyawy.
With 32 threads, delay=0 is actually better even on 1-core
Code: Select all
delay=0, mt=32 delay=1, mt=32
1-core 10407 7636
2-core 11528 8320
Code: Select all
delay=0, mt=64 delay=1, mt=64
1-core 342 16438
2-cores 17707 15446
Code: Select all
delay=0,mt=128 delay=1,mt=128
1-core 161 16532
2-core 376 20114
4-core 21615 23331
8-core 28756 25208
16-core 29804 25199
32-core 29797 24500
Code: Select all
delay=0,mt=256 delay=1,mt=256
1-core 65 15767
2-core 175 18990
4-core 413 22138
8-core 11924 24816
16-core 27527 26253
32-core 26689 25156
So my questions are:
a) Come up with a formula to optimize performance. Note that even when you are using 256 threads, only 1 GPU is used.
Launching many threads is only for the sake of batching.
b) What is the underlying mechanism of oversubscribing in linux and windows, any differences ?
regards,
Daniel