lc0-win-20180512-cuda90-cudnn712-00

Milos · Post by **Milos** » Tue May 22, 2018 2:13 pm

Laskos wrote: ↑Tue May 22, 2018 1:56 pm
Milos wrote: ↑Tue May 22, 2018 1:05 pm
Werewolf wrote: ↑Tue May 22, 2018 12:38 pm

using the latest CUDA, how much stronger would you say LCZero is than the normal package which on my 1060 card runs at about 800 nps?
~80 Elo
Add another 80-100 for Cpuct and FPU settings, different from defaults.

PUCT and FPUR additional settings bring very little if anything unless you go to really long TC.

shrapnel · Post by **shrapnel** » Tue May 22, 2018 2:19 pm

Milos wrote: ↑Tue May 22, 2018 2:13 pmPUCT and FPUR additional settings bring very little if anything unless you go to really long TC.

How long, in your Opinion ?

Laskos · Post by **Laskos** » Tue May 22, 2018 2:36 pm

Milos wrote: ↑Tue May 22, 2018 2:13 pm
Laskos wrote: ↑Tue May 22, 2018 1:56 pm
Milos wrote: ↑Tue May 22, 2018 1:05 pm
~80 Elo
Add another 80-100 for Cpuct and FPU settings, different from defaults.
PUCT and FPUR additional settings bring very little if anything unless you go to really long TC.

I had the impression that it gave me a lot, when I first started experimenting with my GPU, at something like 1'+ 1'' TC. Will have to check.

Milos · Post by **Milos** » Tue May 22, 2018 3:09 pm

Laskos wrote: ↑Tue May 22, 2018 2:36 pm
Milos wrote: ↑Tue May 22, 2018 2:13 pm
Laskos wrote: ↑Tue May 22, 2018 1:56 pm
Add another 80-100 for Cpuct and FPU settings, different from defaults.
PUCT and FPUR additional settings bring very little if anything unless you go to really long TC.
I had the impression that it gave me a lot, when I first started experimenting with my GPU, at something like 1'+ 1'' TC. Will have to check.

I went newest Lc0-cudnn default vs. Lc0-cudnn PUCT=3, FPUR=0 at 1'+0.6''TC on GTX770 which is like half to third speed of GTX1060 and got only +14Elo after 500 games for new settings.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed May 23, 2018 6:23 pm

Albert Silver wrote: ↑Mon May 21, 2018 4:39 am It's a good theory, except that I tested all my PUCT values at 3+0 and 5+0, and then proposed them to GCP. He in turn tested them at very fast TCs, but stopped the test early due to disastrous results.The lower PUCT value was stronger at very short TCs, while the higher PUCT values only shined at longer TCs.

I don't remember the exact discussion but note that I was likely talking about the regular version, not the cuDNN one.

Things get significantly more complex if you use the cuDNN version with batching. Essentially, batching forces you to evaluate a lot of positions at once. It's a bit like getting a 256-core machine. Now, the best parameters for your serial search might not be the best ones to get the most Elo improvement on the 256-core one.

In theory the cuDNN version should be weaker at the same NPS because batching requires you to parallelize the search (regular Leela Chess Zero uses batch size=1), and MCTS or not, this costs efficiency. But in the case of chess + cuDNN the gain is large enough that you get a huge speedup. And because Leela Zero is still a bit blind for some things, it seems to be turning out that going very wide is plugging some of those holes and helps, especially as you win back the efficiency loss with the batching speedup.

I would suspect that fiddling with the virtual loss parameters can also help, but they're not exposed in either engine.

The whole thing sounds very nasty to tune to me. You have a situation where you get huge NPS gains that should come at large losses in search efficiency, and probably does, but you're spending the inefficiency covering up holes in the engine.

For self-play, ideally you don't care at all and run everything serial, and then batch over concurrent games. This gives you the best of both worlds: full batching speedup and no loss of search efficiency. But I don't think Leela Chess Zero has this capability in the client or engines, and AFAIK no-one is developing it. (Someone made something like this for Leela Zero, FWIW, but it had too annoying install dependencies)

I understand there's talk about using lc0 as the default engine for training games. There's an interesting consideration here that the level of the games will change, at least if the visits are kept fixed and the default parameters are used (which IIRC use huge batches).

Milos · Post by **Milos** » Wed May 23, 2018 6:31 pm

Gian-Carlo Pascutto wrote: ↑Wed May 23, 2018 6:23 pm The whole thing sounds very nasty to tune to me. You have a situation where you get huge NPS gains that should come at large losses in search efficiency, and probably does,

This is just your feeling which has nothing to do with reality.
LC0-cudnn is on par with LC0 when same search parameters are used and fixed number of nodes TC.
It is also easily verified looking at GPU usage. LC0 has very high GPU usage processing essential 10x less nodes than LC0-cudnn, which means the only thing that is really inefficient is your handwritten LC0 inference calculations, i.e. that outdated Winograd's 3x3 convolution implementation.

Albert Silver · Post by **Albert Silver** » Wed May 23, 2018 6:37 pm

Gian-Carlo Pascutto wrote: ↑Wed May 23, 2018 6:23 pm
Albert Silver wrote: ↑Mon May 21, 2018 4:39 am It's a good theory, except that I tested all my PUCT values at 3+0 and 5+0, and then proposed them to GCP. He in turn tested them at very fast TCs, but stopped the test early due to disastrous results.The lower PUCT value was stronger at very short TCs, while the higher PUCT values only shined at longer TCs.
I don't remember the exact discussion but note that I was likely talking about the regular version, not the cuDNN one.

Things get significantly more complex if you use the cuDNN version with batching. Essentially, batching forces you to evaluate a lot of positions at once. It's a bit like getting a 256-core machine. Now, the best parameters for your serial search might not be the best ones to get the most Elo improvement on the 256-core one.

In theory the cuDNN version should be weaker at the same NPS because batching requires you to parallelize the search (regular Leela Chess Zero uses batch size=1), and MCTS or not, this costs efficiency. But in the case of chess + cuDNN the gain is large enough that you get a huge speedup. And because Leela Zero is still a bit blind for some things, it seems to be turning out that going very wide is plugging some of those holes and helps, especially as you win back the efficiency loss with the batching speedup.

I would suspect that fiddling with the virtual loss parameters can also help, but they're not exposed in either engine.

The whole thing sounds very nasty to tune to me. You have a situation where you get huge NPS gains that should come at large losses in search efficiency, and probably does, but you're spending the inefficiency covering up holes in the engine.

For self-play, ideally you don't care at all and run everything serial, and then batch over concurrent games. This gives you the best of both worlds: full batching speedup and no loss of search efficiency. But I don't think Leela Chess Zero has this capability in the client or engines, and AFAIK no-one is developing it. (Someone made something like this for Leela Zero, FWIW, but it had too annoying install dependencies)

I understand there's talk about using lc0 as the default engine for training games. There's an interesting consideration here that the level of the games will change, at least if the visits are kept fixed and the default parameters are used (which IIRC use huge batches).

I managed to get this running in CLOP and am running my own tests, which I will share once I have anything. Right now error margin is too large from convergence (+/-52 Elo). Also, maybe something weird in my setup, since tc=1+1 is going a lot faster than my usual 1+1 games. I suspect I misunderstood this to be 1m+1s and instead it is 1s+1s...

jkiliani · Post by **jkiliani** » Wed May 23, 2018 7:12 pm

Gian-Carlo Pascutto wrote: ↑Wed May 23, 2018 6:23 pm For self-play, ideally you don't care at all and run everything serial, and then batch over concurrent games. This gives you the best of both worlds: full batching speedup and no loss of search efficiency. But I don't think Leela Chess Zero has this capability in the client or engines, and AFAIK no-one is developing it. (Someone made something like this for Leela Zero, FWIW, but it had too annoying install dependencies)

I understand there's talk about using lc0 as the default engine for training games. There's an interesting consideration here that the level of the games will change, at least if the visits are kept fixed and the default parameters are used (which IIRC use huge batches).

From how I understand the current plan, batching in self-play with lc0 will be exclusively over concurrent games, so the search efficiency in single games will not be affected at all.

Milos · Post by **Milos** » Wed May 23, 2018 7:43 pm

jkiliani wrote: ↑Wed May 23, 2018 7:12 pm
Gian-Carlo Pascutto wrote: ↑Wed May 23, 2018 6:23 pm For self-play, ideally you don't care at all and run everything serial, and then batch over concurrent games. This gives you the best of both worlds: full batching speedup and no loss of search efficiency. But I don't think Leela Chess Zero has this capability in the client or engines, and AFAIK no-one is developing it. (Someone made something like this for Leela Zero, FWIW, but it had too annoying install dependencies)

I understand there's talk about using lc0 as the default engine for training games. There's an interesting consideration here that the level of the games will change, at least if the visits are kept fixed and the default parameters are used (which IIRC use huge batches).
From how I understand the current plan, batching in self-play with lc0 will be exclusively over concurrent games, so the search efficiency in single games will not be affected at all.

So who is gonna write that concurrent self-play client, you? Do have any idea how much effort is required to do this properly without introducing further bugs? I guess you don't...

IQ · Post by IQ » Wed May 23, 2018 8:44 pm

So who is gonna write that concurrent self-play client, you? Do have any idea how much effort is required to do this properly without introducing further bugs? I guess you don't...

That's done already - try:

lc0-cudnn selfplay --parallelism=8 --backend=multiplexing "--backend-opts=cudnn(threads=2)" --games=1000 --visits=800 --tempdecay-moves=10
you can also pass different arguments for both players by adding "player1: --argument=x player2: --argument=y"

You should really tone down your language and insults, Milos - why don't you contribute instead of complaining? Let us all see your awesome skills.

lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00

Re: lc0-win-20180512-cuda90-cudnn712-00