Jetson GPU architecture

Dann Corbit · Post by **Dann Corbit** » Tue Oct 18, 2016 10:25 pm

From the thesis "ASTRO - A LOW-COST, LOW-POWER CLUSTER FOR CPU-GPU HYBRID COMPUTING USING THE JETSON TK1" by Sean Sheen:

"The biggest advantage of the Jetson TK1 is its ability to support zero-copy mechanisms in CUDA. As mentioned previously most CUDA programs required memory to be transferred to and from the GPU which can incur high overhead costs. The TK1 avoids this cost by having the GPU and CPU share the same memory space, instead of transferring memory from the CPU to the GPU, only the pointer of the memory needs to be transferred and no additional space needs to be allocated. By having this
zero-copy mechanism the TK1 can support a wider variety of GPU programming due to the elimination of one of the biggest bottlenecks in CUDA programming."

Sounds like it would be useful for chess.

smatovic · Post by **smatovic** » Thu Oct 20, 2016 4:00 pm

The Jetson TK1 is an embedded system with an Tegra CPU and 192 CUDA Kepler cores, the Jetson TX1 has 256 CUDA Maxwel cores.

For comparison, the GeForce GTX 1080 has 2560 CUDA cores.

...not sure if an cluster of these embedded systems would pay off.

Maybe Ankan can elaborate on the Unified Memory features of different CUDA architectures?

http://www.nvidia.com/object/jetson-tk1 ... v-kit.html
http://www.nvidia.com/object/jetson-tx1-module.html
https://en.wikipedia.org/wiki/List_of_N ... _10_series
https://streamcomputing.eu/blog/2013-11 ... e=facebook

Dann Corbit · Post by **Dann Corbit** » Thu Oct 20, 2016 6:19 pm

The thing that is exciting is not the horsepower but the architecture.
A unified memory space means that you don't have to transfer anything.

I read a lot of papers where it was the transfer of data to and from the GPU card which makes the break even point very hard to reach. For example, a 4x4 matrix multiply on current GPU systems would be smoked by doing the same thing on your CPU, despite the GPU being able to do it much faster because of the time to transfer to and from the board.

If the same technological innovation makes its way to the big GPU systems, then we would see a marvelous increase in compute power from truly heterogeneous CPU/GPU compute engines.

ankan · Post by **ankan** » Thu Oct 20, 2016 6:34 pm

I haven't personally used any of these embedded systems with unified CPU and GPU memory, but as Srdja pointed out they are very low-end/small machines compared to discrete PCIE GPUs (like at least an order of magnitude slower).

In some situations where you need to pass a lot of data or need tightly coupled communication between CPU and GPU unified memory might be very helpful.

With older GPUs I have encountered cases where you need somewhat tightly coupled interaction between CPU and GPU to launch more work depending on results of previous work. However with cuda dynamic parallelism, you can launch work on the GPU itself so no need to go back and forth.
https://devblogs.nvidia.com/parallelfor ... rinciples/

For chess (or perft) one approach could be to just pass the board position to the GPU, perform full search on the GPU itself and get the output (like perft value, best move, PV, score, etc). The data that is passed between CPU and GPU should be relatively very small (compared to the time GPU takes to explore the search tree).

Another approach could be to perform the main search on CPU (e.g: using standard serial alpha-beta/PVS algorithms - where CPU is significantly faster), and use the GPU for certain tasks - like running a really big/complex evaluation function (e.g, one could possibly run a very big neural network quite efficiently on the GPU). Here you would be calling GPU very often (like once for every leaf position) and unified memory might be very helpful.

Note that by unified memory above I mean memory that is physically unified and accessible at high bandwidth by both GPU and CPU.
Recent versions of CUDA also have a concept of 'unified memory' for systems with discrete PCIE GPUs where although there are separate system and device memories, the programmer sees a single unified virtual address space and the GPU driver/OS tries to move data behind the application automatically based on where it's getting used. This feature is mostly for programming convenience and IMO unlikely to be very performant.

ankan · Post by **ankan** » Thu Oct 20, 2016 7:17 pm

Dann Corbit wrote: If the same technological innovation makes its way to the big GPU systems, then we would see a marvelous increase in compute power from truly heterogeneous CPU/GPU compute engines.

Actually there are slightly bigger systems that already have similar technology - game consoles.
For example PS4 has a GPU rated at 1.84 teraflops and has 8 x86-64 AMD CPU cores connected to 8 GB of unified memory that the GPU can access at 176 GBps.
This is quite a bit faster than Jetson X1 (512 GFlops/26GBps) but still nowhere near the fastest discrete GPUs like Titan X (11 TFlops/480GBps).

One problem with game consoles is that there are no programming tools available for general purpose compute (like CUDA or openCL).
Using the same tools for that are used for building 3D games might work for general purpose compute too but they tend to have high learning curve

.

bhamadicharef · Post by **bhamadicharef** » Wed Dec 14, 2016 4:18 am

APALIS > https://www.toradex.com/computer-on-mod ... a-tegra-k1 ... One could stack many of them in rack with a small switch and get some nice Chess engine once OpenCL implementations are more mature ... stacking many of them in rack reminds me of the FPGA-based COPACOBANA and RIVYERA boxes at http://www.sciengines.com !

Jetson GPU architecture

Jetson GPU architecture

Re: Jetson GPU architecture

Re: Jetson GPU architecture

Re: Jetson GPU architecture

Re: Jetson GPU architecture

Re: Jetson GPU architecture