No need, as the renamed DLLs worked fine. I am downloading just to compare in case, but either way, thanks. The speed-up is insane.Milos wrote:Download 9.0 directly it is available from NVIDIA just search other releases.Albert Silver wrote:Ok, that worked, thanks. I had downloaded CUDA 9.1, so renamed the files to 90 where needed. How do I do a full-tune? Or is that no longer done?Milos wrote:You need to compile proper executable, however there are already precompiles available.Albert Silver wrote:Is it enough to download and install CUDA or do I need something else such as special commandline or executable? If so, can you point the way?Milos wrote:You should look at windows version of LC0-cudnn on Titan V since CUDA libs for windows are obviously better for tensor cores (Titan V) than linux ones.Werewolf wrote:How do you work that out? The Titan V isn't 2.2x faster than a 1080 ti according to either the GFLOPS (12300 vs 10600 respectively) or the benchmarks hereMilos wrote:
Judging be the current proper benchmark (LC0 on cuDNN), Titan V is only 3.5 times faster than GTX 960, and 2.2 times 1080ti. And GTX 960 is at least 15 times cheaper than Titan V.
https://docs.google.com/spreadsheets/d/ ... edit#gid=0
Unless I'm misreading them.
Look here:
https://github.com/mooskagh/leela-chess/tree/master/lc0
Tuning is not required since cuDNN is already tuned to your specific GPU.
how good is a GeForce GTX 1060 6GB for Leela ?
Moderators: hgm, Rebel, chrisw
-
- Posts: 3019
- Joined: Wed Mar 08, 2006 9:57 pm
- Location: Rio de Janeiro, Brazil
Re: how good is a GeForce GTX 1060 6GB for Leela ?
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: how good is a GeForce GTX 1060 6GB for Leela ?
You might actually gain some with renaming it since 9.1 since there are some speedups in 9.1 compared to 9.0 but mainly on Tesla-V, not sure for "normal" GPUs.Albert Silver wrote:No need, as the renamed DLLs worked fine. I am downloading just to compare in case, but either way, thanks. The speed-up is insane.Milos wrote:Download 9.0 directly it is available from NVIDIA just search other releases.Albert Silver wrote:Ok, that worked, thanks. I had downloaded CUDA 9.1, so renamed the files to 90 where needed. How do I do a full-tune? Or is that no longer done?Milos wrote:You need to compile proper executable, however there are already precompiles available.Albert Silver wrote:Is it enough to download and install CUDA or do I need something else such as special commandline or executable? If so, can you point the way?Milos wrote:You should look at windows version of LC0-cudnn on Titan V since CUDA libs for windows are obviously better for tensor cores (Titan V) than linux ones.Werewolf wrote:How do you work that out? The Titan V isn't 2.2x faster than a 1080 ti according to either the GFLOPS (12300 vs 10600 respectively) or the benchmarks hereMilos wrote:
Judging be the current proper benchmark (LC0 on cuDNN), Titan V is only 3.5 times faster than GTX 960, and 2.2 times 1080ti. And GTX 960 is at least 15 times cheaper than Titan V.
https://docs.google.com/spreadsheets/d/ ... edit#gid=0
Unless I'm misreading them.
Look here:
https://github.com/mooskagh/leela-chess/tree/master/lc0
Tuning is not required since cuDNN is already tuned to your specific GPU.
Library is hardcoded but acutally dependency on it is not so that's why renaming works.
We will get more performance once NVIDIA releases 9.2 since they claim 2-5x speed up in HPC deep-learning convolution.
OTOH more speed can still be gained when implementation is changed from cuDNN to TensorRT. Hopefully Alexander Lyashuk does it soon. So far he did a great job of completely rewriting LC0 for CUDA.
Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries, that's why so much speed-up.
-
- Posts: 3019
- Joined: Wed Mar 08, 2006 9:57 pm
- Location: Rio de Janeiro, Brazil
Re: how good is a GeForce GTX 1060 6GB for Leela ?
I'm really shocked at how big a speedup there is. If I run it for 30 seconds (the ply26 limit is too early now) it shows nearly 12.5 KNPS on my GTX1060, and just the 26-ply limit is 9.5 KNPS. I'll run some tests to see if this pans out in results.Milos wrote:You might actually gain some with renaming it since 9.1 since there are some speedups in 9.1 compared to 9.0 but mainly on Tesla-V, not sure for "normal" GPUs.Albert Silver wrote:No need, as the renamed DLLs worked fine. I am downloading just to compare in case, but either way, thanks. The speed-up is insane.Milos wrote:Download 9.0 directly it is available from NVIDIA just search other releases.Albert Silver wrote:Ok, that worked, thanks. I had downloaded CUDA 9.1, so renamed the files to 90 where needed. How do I do a full-tune? Or is that no longer done?Milos wrote:You need to compile proper executable, however there are already precompiles available.Albert Silver wrote:Is it enough to download and install CUDA or do I need something else such as special commandline or executable? If so, can you point the way?Milos wrote:You should look at windows version of LC0-cudnn on Titan V since CUDA libs for windows are obviously better for tensor cores (Titan V) than linux ones.Werewolf wrote:How do you work that out? The Titan V isn't 2.2x faster than a 1080 ti according to either the GFLOPS (12300 vs 10600 respectively) or the benchmarks hereMilos wrote:
Judging be the current proper benchmark (LC0 on cuDNN), Titan V is only 3.5 times faster than GTX 960, and 2.2 times 1080ti. And GTX 960 is at least 15 times cheaper than Titan V.
https://docs.google.com/spreadsheets/d/ ... edit#gid=0
Unless I'm misreading them.
Look here:
https://github.com/mooskagh/leela-chess/tree/master/lc0
Tuning is not required since cuDNN is already tuned to your specific GPU.
Library is hardcoded but acutally dependency on it is not so that's why renaming works.
We will get more performance once NVIDIA releases 9.2 since they claim 2-5x speed up in HPC deep-learning convolution.
OTOH more speed can still be gained when implementation is changed from cuDNN to TensorRT. Hopefully Alexander Lyashuk does it soon. So far he did a great job of completely rewriting LC0 for CUDA.
Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries, that's why so much speed-up.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: how good is a GeForce GTX 1060 6GB for Leela ?
I guess that you have to program specifically for the tensor cores.Milos wrote:4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.Dann Corbit wrote:They can multiply a small matrix in a single cycle (the tensor cores).Werewolf wrote:The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 143
- Joined: Wed Jan 17, 2018 1:26 pm
Re: how good is a GeForce GTX 1060 6GB for Leela ?
3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)Dann Corbit wrote:I guess that you have to program specifically for the tensor cores.Milos wrote:4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.Dann Corbit wrote:They can multiply a small matrix in a single cycle (the tensor cores).Werewolf wrote:The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: how good is a GeForce GTX 1060 6GB for Leela ?
Haha, sure you know what they use in CUDA, you probably had a glimpse into its source? Well, not.jkiliani wrote:3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)
Your knowledge of kernel arithmetics doesn't get further than Winograd transforms, does it?
You probably read that one paper from Lavin, I guess that makes you an expert on ML hardware acceleration.
Regarding OpenCL hand implementation of Gian-Carlo, sure it is not that bad, it is only 8x slower than cuDNN implementation with appropriate batch size. Well, if 8x is not bad, I wonder what would you call bad?
That's kind of a speed-up from intel clDNN on integrated graphics 6xx series to 1080Ti on OpenCL.
-
- Posts: 1797
- Joined: Thu Sep 18, 2008 10:24 pm
Re: how good is a GeForce GTX 1060 6GB for Leela ?
Albert, what are we saying here?
You replace the included DLLs with ones from Nvidia and you get an 8x speedup?
Just like that? Why don't they change the GPU download package then to reflect this discovery?
You replace the included DLLs with ones from Nvidia and you get an 8x speedup?
Just like that? Why don't they change the GPU download package then to reflect this discovery?
-
- Posts: 3245
- Joined: Thu Mar 09, 2006 9:10 am
Re: how good is a GeForce GTX 1060 6GB for Leela ?
Are you saying an Nvidia DLL would be suitable for all graphic cards out there ?Werewolf wrote:Albert, what are we saying here?
You replace the included DLLs with ones from Nvidia and you get an 8x speedup?
Just like that? Why don't they change the GPU download package then to reflect this discovery?
My engine was quite strong till I added knowledge to it.
http://www.chess.hylogic.de
http://www.chess.hylogic.de
-
- Posts: 2488
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: how good is a GeForce GTX 1060 6GB for Leela ?
I guess the OpenCL version also works with AMD GPUs while CUDA does not. Since the project depends on voluntary contribution, it may make sense not to shut out a considerable number of potential volunteers.Milos wrote:Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: how good is a GeForce GTX 1060 6GB for Leela ?
I think that the point Milos was making is that the tensor cores do not perform a 3x3 multiply. They perform a 4x4 multiply.jkiliani wrote:3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)Dann Corbit wrote:I guess that you have to program specifically for the tensor cores.Milos wrote:4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.Dann Corbit wrote:They can multiply a small matrix in a single cycle (the tensor cores).Werewolf wrote:The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
Now, a 3x3 matrix fits into a 4x4 matrix, so you can still multiply it in one cycle. But the thing that is missing is that if you have a 4x4 kernel you would get:
2 * (4 * 4 * 4) - 4 * 4 = 112 operations in one cycle
verses
2 * (3 * 3 * 3) - 3 * 3 = 45 operations in one cycle
So you are getting 45/112= 40% of the computer power available.
This assumes that square matrix multiply is 2N^3 - 2N^2 operations (typical count and I doubt you can do better on such a small matrix).
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.