My non-OC RTX 2070 is very fast with Lc0

chrisw · Post by **chrisw** » Sun Dec 16, 2018 3:34 pm

Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.

A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.

Laskos · Post by **Laskos** » Sun Dec 16, 2018 9:28 pm

chrisw wrote: ↑Sun Dec 16, 2018 3:34 pm
Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.
A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.

I performed some interesting experiments:

Doubling from 2500 to 5000 nodes (both with batch size = 512):
5000:2500 nodes:
Score of lc0_v191_32013_5000 vs lc0_v191_32013_2500: 152 - 18 - 230 [0.667] 400
Elo difference: 121.06 +/- 21.45

Similar to a doubling a strong A/B engine has at this "time control" in my conditions.

Batch size = 512 versus batch size = 1 at 5000 nodes (pretty slow games, as batch size = 1 engine is very slow):
Batch size = 512 versus batch size = 1:
Score of lc0_v191_32013_512 vs lc0_v191_32013_1: 12 - 21 - 67 [0.455] 100
Elo difference: -31.35 +/- 39.06

So, about 0.25 doublings loss from batch size = 1 to batch size = 512
Or, 84% of nodes used with batch size = 1 are used when sending batch size = 512. At 5000 nodes/move. Which is pretty amazingly high to me. I do not understand much, but my guess would be that going to 2 GPUs would scale well in effective speedup, above 1.8 (if NPS is close to scale perfectly, 1.95 or so, which from posted results seems to be the case for 2 GPUs).
Do you or Milos have any idea about how it would scale to 2 GPUs (effective speedup)?

chrisw · Post by **chrisw** » Sun Dec 16, 2018 9:49 pm

Laskos wrote: ↑Sun Dec 16, 2018 9:28 pm
chrisw wrote: ↑Sun Dec 16, 2018 3:34 pm
Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.
A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.
I performed some interesting experiments:

Doubling from 2500 to 5000 nodes (both with batch size = 512):
5000:2500 nodes:
Score of lc0_v191_32013_5000 vs lc0_v191_32013_2500: 152 - 18 - 230 [0.667] 400
Elo difference: 121.06 +/- 21.45

Similar to a doubling a strong A/B engine has at this "time control" in my conditions.

Batch size = 512 versus batch size = 1 at 5000 nodes (pretty slow games, as batch size = 1 engine is very slow)
Batch size = 512 versus batch size = 1:
Score of lc0_v191_32013_512 vs lc0_v191_32013_1: 12 - 21 - 67 [0.455] 100
Elo difference: -31.35 +/- 39.06

So, about 0.25 doublings loss from batch size = 1 to batch size = 512
Or, 84% of nodes used with batch size = 1 are used when sending batch size = 512. At 5000 nodes/move. Which is pretty amazingly high to me. I do not understand much, but my guess would be that going to 2 GPUs would scale well in effective speedup, above 1.8 (if NPS is close to scale perfectly, 1.95 or so, which from posted results seems to be the case for 2 GPUs).
Do you or Milos have any idea about how it would scale to 2 GPUs (effective speedup)?

2 GPUs will either just allow larger batches, or if the software is clever, run two batches at once or something along thse lines. I’ve not studied 2 GPUs yet.

you should run the 1 to 512 tournament for as long as you can, to reduce the error bars, else two things will go down in history as fact. One that 84% is the figure and two that Crem has created a very efficient parallel algorithm.

another thought ... is 512 the largest batch size possible on your system? you could also try setting nodes limit=512, and batch size=512 (or both at largest possible), then the entire searchhas to run on the initial speculative assumption on which nodes to batch. that will be an interesting info.

Milos · Post by **Milos** » Sun Dec 16, 2018 10:27 pm

Laskos wrote: ↑Sun Dec 16, 2018 9:28 pm
chrisw wrote: ↑Sun Dec 16, 2018 3:34 pm
Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.
A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.
I performed some interesting experiments:

Doubling from 2500 to 5000 nodes (both with batch size = 512):
5000:2500 nodes:
Score of lc0_v191_32013_5000 vs lc0_v191_32013_2500: 152 - 18 - 230 [0.667] 400
Elo difference: 121.06 +/- 21.45

Similar to a doubling a strong A/B engine has at this "time control" in my conditions.

Batch size = 512 versus batch size = 1 at 5000 nodes (pretty slow games, as batch size = 1 engine is very slow):
Batch size = 512 versus batch size = 1:
Score of lc0_v191_32013_512 vs lc0_v191_32013_1: 12 - 21 - 67 [0.455] 100
Elo difference: -31.35 +/- 39.06

So, about 0.25 doublings loss from batch size = 1 to batch size = 512
Or, 84% of nodes used with batch size = 1 are used when sending batch size = 512. At 5000 nodes/move. Which is pretty amazingly high to me. I do not understand much, but my guess would be that going to 2 GPUs would scale well in effective speedup, above 1.8 (if NPS is close to scale perfectly, 1.95 or so, which from posted results seems to be the case for 2 GPUs).
Do you or Milos have any idea about how it would scale to 2 GPUs (effective speedup)?

Interesting results.
I believe you have the following case. Since you are running it at relatively low number of nodes per move (you effectively have only 10 batches per move) assuming that most nodes in the batch are from the best move line almost all the nodes will be used in the next move. So parallelisation efficiency effectively becomes ponder hit ratio. Since you are running engine against itself ponder hit ratio is very high. If you ran it against some A/B engine I believe you'd get much higher loss. Also if you ran it for more nodes per move you should also get higher loss.
At this relatively low number of nodes per move I don't think there would be any significant loss with 2 GPUs i.e. scaling would be almost perfect.
Why I believe scaling from 1 to 2 GPUs would be worse than going from e.g. 256 to 512 batch size in case of higher nodes per move is latency. When you increase batch size those nodes tend to be much more temporarily close to each other than when you run 2 batches of 256 on 2 different GPUs.

chrisw · Post by **chrisw** » Sun Dec 16, 2018 11:47 pm

Milos wrote: ↑Sun Dec 16, 2018 10:27 pm
Laskos wrote: ↑Sun Dec 16, 2018 9:28 pm
chrisw wrote: ↑Sun Dec 16, 2018 3:34 pm
Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.
A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.
I performed some interesting experiments:

Doubling from 2500 to 5000 nodes (both with batch size = 512):
5000:2500 nodes:
Score of lc0_v191_32013_5000 vs lc0_v191_32013_2500: 152 - 18 - 230 [0.667] 400
Elo difference: 121.06 +/- 21.45

Similar to a doubling a strong A/B engine has at this "time control" in my conditions.

Batch size = 512 versus batch size = 1 at 5000 nodes (pretty slow games, as batch size = 1 engine is very slow):
Batch size = 512 versus batch size = 1:
Score of lc0_v191_32013_512 vs lc0_v191_32013_1: 12 - 21 - 67 [0.455] 100
Elo difference: -31.35 +/- 39.06

So, about 0.25 doublings loss from batch size = 1 to batch size = 512
Or, 84% of nodes used with batch size = 1 are used when sending batch size = 512. At 5000 nodes/move. Which is pretty amazingly high to me. I do not understand much, but my guess would be that going to 2 GPUs would scale well in effective speedup, above 1.8 (if NPS is close to scale perfectly, 1.95 or so, which from posted results seems to be the case for 2 GPUs).
Do you or Milos have any idea about how it would scale to 2 GPUs (effective speedup)?
Interesting results.
I believe you have the following case. Since you are running it at relatively low number of nodes per move (you effectively have only 10 batches per move) assuming that most nodes in the batch are from the best move line almost all the nodes will be used in the next move. So parallelisation efficiency effectively becomes ponder hit ratio. Since you are running engine against itself ponder hit ratio is very high. If you ran it against some A/B engine I believe you'd get much higher loss. Also if you ran it for more nodes per move you should also get higher loss.
At this relatively low number of nodes per move I don't think there would be any significant loss with 2 GPUs i.e. scaling would be almost perfect.
Why I believe scaling from 1 to 2 GPUs would be worse than going from e.g. 256 to 512 batch size in case of higher nodes per move is latency. When you increase batch size those nodes tend to be much more temporarily close to each other than when you run 2 batches of 256 on 2 different GPUs.

ah yes, of course, it benefits from the prior move. well, so we assume, if the NN cache is still good, move to move. Milos is right, run the tests against an “equivalent” strength A/B searcher that’s inclined to play different types of move. An alternative would be to test against suite of appropriate test positions, then the NN cache has to restart from scratch each time.
If test suite, then maybe all you would be looking for is nodes to solution, or nodes to some fixed pont, like depth or something.

Laskos · Post by **Laskos** » Mon Dec 17, 2018 9:43 am

chrisw wrote: ↑Sun Dec 16, 2018 11:47 pm
Milos wrote: ↑Sun Dec 16, 2018 10:27 pm
Laskos wrote: ↑Sun Dec 16, 2018 9:28 pm
chrisw wrote: ↑Sun Dec 16, 2018 3:34 pm
Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.
A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.
I performed some interesting experiments:

Doubling from 2500 to 5000 nodes (both with batch size = 512):
5000:2500 nodes:
Score of lc0_v191_32013_5000 vs lc0_v191_32013_2500: 152 - 18 - 230 [0.667] 400
Elo difference: 121.06 +/- 21.45

Similar to a doubling a strong A/B engine has at this "time control" in my conditions.

Batch size = 512 versus batch size = 1 at 5000 nodes (pretty slow games, as batch size = 1 engine is very slow):
Batch size = 512 versus batch size = 1:
Score of lc0_v191_32013_512 vs lc0_v191_32013_1: 12 - 21 - 67 [0.455] 100
Elo difference: -31.35 +/- 39.06

So, about 0.25 doublings loss from batch size = 1 to batch size = 512
Or, 84% of nodes used with batch size = 1 are used when sending batch size = 512. At 5000 nodes/move. Which is pretty amazingly high to me. I do not understand much, but my guess would be that going to 2 GPUs would scale well in effective speedup, above 1.8 (if NPS is close to scale perfectly, 1.95 or so, which from posted results seems to be the case for 2 GPUs).
Do you or Milos have any idea about how it would scale to 2 GPUs (effective speedup)?
Interesting results.
I believe you have the following case. Since you are running it at relatively low number of nodes per move (you effectively have only 10 batches per move) assuming that most nodes in the batch are from the best move line almost all the nodes will be used in the next move. So parallelisation efficiency effectively becomes ponder hit ratio. Since you are running engine against itself ponder hit ratio is very high. If you ran it against some A/B engine I believe you'd get much higher loss. Also if you ran it for more nodes per move you should also get higher loss.
At this relatively low number of nodes per move I don't think there would be any significant loss with 2 GPUs i.e. scaling would be almost perfect.
Why I believe scaling from 1 to 2 GPUs would be worse than going from e.g. 256 to 512 batch size in case of higher nodes per move is latency. When you increase batch size those nodes tend to be much more temporarily close to each other than when you run 2 batches of 256 on 2 different GPUs.
ah yes, of course, it benefits from the prior move. well, so we assume, if the NN cache is still good, move to move. Milos is right, run the tests against an “equivalent” strength A/B searcher that’s inclined to play different types of move. An alternative would be to test against suite of appropriate test positions, then the NN cache has to restart from scratch each time.
If test suite, then maybe all you would be looking for is nodes to solution, or nodes to some fixed pont, like depth or something.

Hmmmm, it seems to give even smaller difference against SF10, but error margins are still large. It is also might be due to the "compression" of Elo differences against regular engines. Lc0 at 5000 nodes/move, batch sizes 512 and 32 used, as 1 is too slow. SF10 at a million nodes/move.

Score of lc0_v191_32013_512 vs SF10: 71 - 124 - 205 [0.434] 400
Elo difference: -46.31 +/- 23.76

Score of lc0_v191_32013_32 vs SF10: 78 - 112 - 210 [0.458] 400
Elo difference: -29.60 +/- 23.46

My picture seems to be that the scaling to 2 GPUs will be good with good settings. I was thinking that my new system will have either one RTX 2070 if the effective seepdup is below 1.5-1.6, or 2x RTX 2070 if above. It seems to me it is almost surely above.

chrisw · Post by **chrisw** » Mon Dec 17, 2018 11:37 am

Laskos wrote: ↑Mon Dec 17, 2018 9:43 am
chrisw wrote: ↑Sun Dec 16, 2018 11:47 pm
Milos wrote: ↑Sun Dec 16, 2018 10:27 pm
Laskos wrote: ↑Sun Dec 16, 2018 9:28 pm
chrisw wrote: ↑Sun Dec 16, 2018 3:34 pm
Laskos wrote: ↑Sun Dec 16, 2018 2:46 pm
chrisw wrote: ↑Sun Dec 16, 2018 12:51 pm
Laskos wrote: ↑Sun Dec 16, 2018 12:26 pm
Milos wrote: ↑Sat Dec 15, 2018 2:08 pm
Milos wrote: ↑Sat Dec 15, 2018 1:56 am
Laskos wrote: ↑Sat Dec 15, 2018 12:56 am Thanks, I was expecting 1.5-1.6 faster, and it comes at little below 1.5 faster. Clearly not the best value. I hope, in some 6-9 months I will have a Ryzen 8 or 16 core machine with 2 x RTX 2070. I am not sure about the effective speed-up from 2 GPUs, it can be anything from 1.4 to 1.95. Nobody seems to have performed strength tests on that, just NPS, which scale almost perfectly to 2 GPUs with Lc0.
I know it's not the same, but there is an interesting test you can perform. Run Lc0 with for example batch size 32 vs batch size 512 with fixed number of nodes per move and check Elo difference.
Yes, that's an interesting experiment. I will do that.
Here are some results:

5000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 51 - 79 - 270 [0.465] 400
Elo difference: -24.36 +/- 19.35

20000 nodes/move
Score of lc0_v191_32013_512 vs lc0_v191_32013_32: 7 - 14 - 79 [0.465] 100
Elo difference: -24.36 +/- 31.03

It seems to lose not that much from 32 to 512 batch size. That would seem to indicate good effective speed-up from 1 GPU to 2. Do you have any estimate based on that? I would guess as high as 1.8 or even better.
it certainly does show the 32 batch nodes are way more effectively used than the 512 batch.

Not sure what would happen if you set batch=1, you might crash it. But a batch=1 (or batch=2, if 1 won't work) versus a batch=512 will show up how many of those 512 nodes get not to be used at all. Well, not how many, but it will give an idea.

Graph of ELO diff/batch size ?
I will do that, but first I have to derive how much Elo gives the doubling from 2500 to 5000 nodes/move.
A few, one would imagine. If you give 5000 nodes for the search, it will generate 10 batches (at 512). That's ten speculative batch decisions about 512 imagined possible positions at various (random) points in the tree. Most (many?) of those are just going to waste. Double to 10000 nodes and 20 batches, then (I am guessing here) that more of the first 10 batched nodes will now be useful. And so on, if you allow a bigger search. Basically, as the tree expands, it eats more of the prior batched nodes, making them useful. Somewhere, presumably, there's a trade off between tree size and batch size. But I guess that is what you are trying to find already. Nice if it were a linear relationship. Anyway, your scaling issue, then becomes one of trying to pre-guess the best batch size prior to the start of the search. You have a rough idea of the number of nodes to search, then you set the batch size.
I performed some interesting experiments:

Doubling from 2500 to 5000 nodes (both with batch size = 512):
5000:2500 nodes:
Score of lc0_v191_32013_5000 vs lc0_v191_32013_2500: 152 - 18 - 230 [0.667] 400
Elo difference: 121.06 +/- 21.45

Similar to a doubling a strong A/B engine has at this "time control" in my conditions.

Batch size = 512 versus batch size = 1 at 5000 nodes (pretty slow games, as batch size = 1 engine is very slow):
Batch size = 512 versus batch size = 1:
Score of lc0_v191_32013_512 vs lc0_v191_32013_1: 12 - 21 - 67 [0.455] 100
Elo difference: -31.35 +/- 39.06

So, about 0.25 doublings loss from batch size = 1 to batch size = 512
Or, 84% of nodes used with batch size = 1 are used when sending batch size = 512. At 5000 nodes/move. Which is pretty amazingly high to me. I do not understand much, but my guess would be that going to 2 GPUs would scale well in effective speedup, above 1.8 (if NPS is close to scale perfectly, 1.95 or so, which from posted results seems to be the case for 2 GPUs).
Do you or Milos have any idea about how it would scale to 2 GPUs (effective speedup)?
Interesting results.
I believe you have the following case. Since you are running it at relatively low number of nodes per move (you effectively have only 10 batches per move) assuming that most nodes in the batch are from the best move line almost all the nodes will be used in the next move. So parallelisation efficiency effectively becomes ponder hit ratio. Since you are running engine against itself ponder hit ratio is very high. If you ran it against some A/B engine I believe you'd get much higher loss. Also if you ran it for more nodes per move you should also get higher loss.
At this relatively low number of nodes per move I don't think there would be any significant loss with 2 GPUs i.e. scaling would be almost perfect.
Why I believe scaling from 1 to 2 GPUs would be worse than going from e.g. 256 to 512 batch size in case of higher nodes per move is latency. When you increase batch size those nodes tend to be much more temporarily close to each other than when you run 2 batches of 256 on 2 different GPUs.
ah yes, of course, it benefits from the prior move. well, so we assume, if the NN cache is still good, move to move. Milos is right, run the tests against an “equivalent” strength A/B searcher that’s inclined to play different types of move. An alternative would be to test against suite of appropriate test positions, then the NN cache has to restart from scratch each time.
If test suite, then maybe all you would be looking for is nodes to solution, or nodes to some fixed pont, like depth or something.
Hmmmm, it seems to give even smaller difference against SF10, but error margins are still large. It is also might be due to the "compression" of Elo differences against regular engines. Lc0 at 5000 nodes/move, batch sizes 512 and 32 used, as 1 is too slow. SF10 at a million nodes/move.

Score of lc0_v191_32013_512 vs SF10: 71 - 124 - 205 [0.434] 400
Elo difference: -46.31 +/- 23.76

Score of lc0_v191_32013_32 vs SF10: 78 - 112 - 210 [0.458] 400
Elo difference: -29.60 +/- 23.46

My picture seems to be that the scaling to 2 GPUs will be good with good settings. I was thinking that my new system will have either one RTX 2070 if the effective seepdup is below 1.5-1.6, or 2x RTX 2070 if above. It seems to me it is almost surely above.

I think we are kind of interested in different aspects here. You want to find the best batch number to use given the set up you have. i want to see how efficient the LC0 parallelisation is done. But anyways, the same experiments should answer both questions ...

add a run at 256, 16, 8 and you got five graphing points.

I think some shots against some test suites would be worth it, then we do exclude the pre warmed up NN cache effect. Methodology: run 512 batcher against suite for, say, 10000 nodes, and get the actual node count for the penultimate “event” (when it spits out the line before the last one it gets to). Then same thing for smaller batches, on average, a smaller batch ought to get to the same event in fewer nodes.

At the end of all that, we should know
a) optimal batch sizings
b) advantage from the Milos spotted ponder effect on NN cache
c) parallelisation efficiency

Albert Silver · Post by **Albert Silver** » Sun Dec 23, 2018 9:09 pm

Albert Silver wrote: ↑Fri Dec 14, 2018 11:11 pm
Laskos wrote: ↑Thu Dec 06, 2018 3:43 pm
Albert Silver wrote: ↑Thu Dec 06, 2018 2:25 pm
Laskos wrote: ↑Thu Dec 06, 2018 3:59 am
brianr wrote: ↑Thu Dec 06, 2018 3:33 am OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Although still paired with a very old i5 (will compare with Threadripper later), the 2080ti yields:

info depth 17 seldepth 43 time 69846 nodes 3024184 score cp 26 hashfull 681 nps 43297

So the combined result of the two is 77knps (at that same peak point) though it does slow after:

info depth 12 seldepth 39 time 68041 nodes 5268761 score cp 21 hashfull 461 nps 77435 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8e7 c1f4 e8g8 e2e3 b8d7 f1e2 d5c4 e2c4 a7a6 a2a4 c7c5 e1g1 c5d4 d1d4 b7b6 f1d1 c8b7 f3e5 e7c5 d4d3 d7e5 f4e5 d8c8 d3e2 c5e7 c4d3 c8c5

Curiously, this is actually not as big an improvement with two as one might think. I reran the 2080ti (was 43knps as you will recall) on the Threadripper, and got a very large speed increase using the new backend "roundrobin" but on one GPU, and it peaked at over 60knps.

info depth 13 seldepth 38 time 138773 nodes 8348579 score cp 21 hashfull 692 nps 60159 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 b1d2 d7d5 f1g2 e8g8 g1f3 d5c4 a2a3 b4d2 c1d2 b8c6 e2e3 b7b5 b2b3 c4b3 d1b3 a8b8 e1g1 c8b7 f1c1 a7a6 f3e1 c6e7 g2b7 b8b7

Laskos · Post by **Laskos** » Mon Dec 24, 2018 2:49 am

Albert Silver wrote: ↑Sun Dec 23, 2018 9:09 pm
Albert Silver wrote: ↑Fri Dec 14, 2018 11:11 pm
Laskos wrote: ↑Thu Dec 06, 2018 3:43 pm
Albert Silver wrote: ↑Thu Dec 06, 2018 2:25 pm
Laskos wrote: ↑Thu Dec 06, 2018 3:59 am
brianr wrote: ↑Thu Dec 06, 2018 3:33 am OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Although still paired with a very old i5 (will compare with Threadripper later), the 2080ti yields:

info depth 17 seldepth 43 time 69846 nodes 3024184 score cp 26 hashfull 681 nps 43297
So the combined result of the two is 77knps (at that same peak point) though it does slow after:

info depth 12 seldepth 39 time 68041 nodes 5268761 score cp 21 hashfull 461 nps 77435 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8e7 c1f4 e8g8 e2e3 b8d7 f1e2 d5c4 e2c4 a7a6 a2a4 c7c5 e1g1 c5d4 d1d4 b7b6 f1d1 c8b7 f3e5 e7c5 d4d3 d7e5 f4e5 d8c8 d3e2 c5e7 c4d3 c8c5

Curiously, this is actually not as big an improvement with two as one might think. I reran the 2080ti (was 43knps as you will recall) on the Threadripper, and got a very large speed increase using the new backend "roundrobin" but on one GPU, and it peaked at over 60knps.

info depth 13 seldepth 38 time 138773 nodes 8348579 score cp 21 hashfull 692 nps 60159 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 b1d2 d7d5 f1g2 e8g8 g1f3 d5c4 a2a3 b4d2 c1d2 b8c6 e2e3 b7b5 b2b3 c4b3 d1b3 a8b8 e1g1 c8b7 f1c1 a7a6 f3e1 c6e7 g2b7 b8b7

Seems almost perfect add-up of NPS. What engine version was used?

In the second case, hmmm, wasn't the second GPU used somehow too? "Roundrobin" isn't a multi-GPU option?

I got with the new engine (v0.20 rc2) the peak with 2070 (same net 11250):
info depth 13 seldepth 40 time 208040 nodes 6935721 score cp 22 hashfull 585 nps 33338

More than 10% faster than with v0.19.

Albert Silver · Post by **Albert Silver** » Mon Dec 24, 2018 4:51 am

Laskos wrote: ↑Mon Dec 24, 2018 2:49 am
Albert Silver wrote: ↑Sun Dec 23, 2018 9:09 pm
Albert Silver wrote: ↑Fri Dec 14, 2018 11:11 pm
Laskos wrote: ↑Thu Dec 06, 2018 3:43 pm
Albert Silver wrote: ↑Thu Dec 06, 2018 2:25 pm
Laskos wrote: ↑Thu Dec 06, 2018 3:59 am
brianr wrote: ↑Thu Dec 06, 2018 3:33 am OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Although still paired with a very old i5 (will compare with Threadripper later), the 2080ti yields:

info depth 17 seldepth 43 time 69846 nodes 3024184 score cp 26 hashfull 681 nps 43297
So the combined result of the two is 77knps (at that same peak point) though it does slow after:

info depth 12 seldepth 39 time 68041 nodes 5268761 score cp 21 hashfull 461 nps 77435 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 b1c3 f8e7 c1f4 e8g8 e2e3 b8d7 f1e2 d5c4 e2c4 a7a6 a2a4 c7c5 e1g1 c5d4 d1d4 b7b6 f1d1 c8b7 f3e5 e7c5 d4d3 d7e5 f4e5 d8c8 d3e2 c5e7 c4d3 c8c5

Curiously, this is actually not as big an improvement with two as one might think. I reran the 2080ti (was 43knps as you will recall) on the Threadripper, and got a very large speed increase using the new backend "roundrobin" but on one GPU, and it peaked at over 60knps.

info depth 13 seldepth 38 time 138773 nodes 8348579 score cp 21 hashfull 692 nps 60159 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 b1d2 d7d5 f1g2 e8g8 g1f3 d5c4 a2a3 b4d2 c1d2 b8c6 e2e3 b7b5 b2b3 c4b3 d1b3 a8b8 e1g1 c8b7 f1c1 a7a6 f3e1 c6e7 g2b7 b8b7
Seems almost perfect add-up of NPS. What engine version was used?

In the second case, hmmm, wasn't the second GPU used somehow too? "Roundrobin" isn't a multi-GPU option?

I got with the new engine (v0.20 rc2) the peak with 2070 (same net 11250):
info depth 13 seldepth 40 time 208040 nodes 6935721 score cp 22 hashfull 585 nps 33338

More than 10% faster than with v0.19.

No, in the second case there was no GPU usage for sure. Roundrobin is a new multiGPU option I used in v20, and that is remarkably efficient in single-GPU as well. Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000

My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0

Re: My non-OC RTX 2070 is very fast with Lc0