2900 Elo points progress, 10 million games, 330 nets

crem · Post by **crem** » Sun Nov 25, 2018 11:21 pm

1. So far in our tests we fail to reach A0 strength given the same number of training games (44 million).
2. We don't know why it is. Maybe we have a bug (probably), maybe we use wrong FPU, maybe we guessed wrong Cpuct, maybe we understood the paper incorrectly, maybe we don't shuffle training games good enough, maybe we release new network too rarely, maybe something else.
3. I agree that the best (or rather the only) way to get consistent improvements is to run lots of small tests with different ideas.
4. Currently the way to do such tests is not developed (it's discussed for 6 months already, but it's constantly being preempted by more urgent tasks [rushing release for some CCCC/TCEC season, or changing Lc0 so that new features could be added in more elegant way, or implementing some new Lc0 feaure myself because it's more fun]).
5. Without implemented easy way of testing, setting up and running a fresh tests is a cumbersome task. Especially if it requires engine changes, then currently it takes weeks to roll it out. Server-side part is not one-click thing either, requires some hours of wiring up training scripts, data transfer, typing some SQL, making sure that clients still not send training data from old test after restart etc.
6. Often things are not changed just because the changes needed for a new idea are not implemented yet. Or sometime it's because all devs are too busy with their non-Lc0 life for a week or two etc.
7. Yes, current use of contributors' GPU is not optimal. But to make it more optimal, things have to be implemented, and devs just cannot keep up.
8. Current idea (from my perception) is "We'll do testing properly (on many small-scale experiments that anyone can submit, and statistically sound conclusions) when we have a framework. Until that's ready, let's run full-size test with intuitively guessed params/ideas and hope it will be stronger that everything we had before."

So, yes we fail to reach A0 level, yes we should run well designed experiments, yes we should have done lots of them, yes they should be small and frequent instead of rare and large (and largely based on just intuitive guess instead of some scientifically sound method), but there's really no infrastructure and very little dev time to implement this infrastructure. And even for doing it manually, idea of starting a new small test every week is too time-consuming.

I totally agree that if some team of 2-3 full time developers would appear, they would leave LCZero project behind within one month. I don't know what to do with that knowledge though.

PS. For "More resources were used than in DeepMind A0 project, not being at all near A0 level strength with 20xxx and 30xxx nets." I hope you mean one run of DM vs one run of Lc0. For total amount of resources (for trial and testing), I'm sure DeepMind used hundreds if not thousands times more resources than we did so far.

MikeB · Post by **MikeB** » Mon Nov 26, 2018 3:40 am

Laskos wrote: ↑Sun Nov 25, 2018 10:46 am All wasted in the 30xxx run.

Score of lc0_v19_31214 vs SF_10: 26 - 97 - 77 [0.323] 200
Elo difference: -128.95 +/- 38.69
Finished match

Score of lc0_v19_31542 vs SF_10: 27 - 85 - 88 [0.355] 200
Elo difference: -103.73 +/- 36.38
Finished match

According to my rough order of magnitude estimate and if I am not wrong, ~ 10 MWh consumed, or about $3,000 in average European country.

No other run of Leela had such waste.

And compared to the run 10xxx, not yet very close to it:

Score of lc0_v19_11261 vs SF_10: 99 - 82 - 219 [0.521] 400
Elo difference: 14.77 +/- 22.90
Finished match

+1
That's a real shame.

lucasart · Post by **lucasart** » Mon Nov 26, 2018 5:42 am

crem wrote: ↑Sun Nov 25, 2018 11:21 pm 1. So far in our tests we fail to reach A0 strength given the same number of training games (44 million).
2. We don't know why it is. Maybe we have a bug (probably), maybe we use wrong FPU, maybe we guessed wrong Cpuct, maybe we understood the paper incorrectly, maybe we don't shuffle training games good enough, maybe we release new network too rarely, maybe something else.
3. I agree that the best (or rather the only) way to get consistent improvements is to run lots of small tests with different ideas.
4. Currently the way to do such tests is not developed (it's discussed for 6 months already, but it's constantly being preempted by more urgent tasks [rushing release for some CCCC/TCEC season, or changing Lc0 so that new features could be added in more elegant way, or implementing some new Lc0 feaure myself because it's more fun]).
5. Without implemented easy way of testing, setting up and running a fresh tests is a cumbersome task. Especially if it requires engine changes, then currently it takes weeks to roll it out. Server-side part is not one-click thing either, requires some hours of wiring up training scripts, data transfer, typing some SQL, making sure that clients still not send training data from old test after restart etc.
6. Often things are not changed just because the changes needed for a new idea are not implemented yet. Or sometime it's because all devs are too busy with their non-Lc0 life for a week or two etc.
7. Yes, current use of contributors' GPU is not optimal. But to make it more optimal, things have to be implemented, and devs just cannot keep up.
8. Current idea (from my perception) is "We'll do testing properly (on many small-scale experiments that anyone can submit, and statistically sound conclusions) when we have a framework. Until that's ready, let's run full-size test with intuitively guessed params/ideas and hope it will be stronger that everything we had before."

So, yes we fail to reach A0 level, yes we should run well designed experiments, yes we should have done lots of them, yes they should be small and frequent instead of rare and large (and largely based on just intuitive guess instead of some scientifically sound method), but there's really no infrastructure and very little dev time to implement this infrastructure. And even for doing it manually, idea of starting a new small test every week is too time-consuming.

I totally agree that if some team of 2-3 full time developers would appear, they would leave LCZero project behind within one month. I don't know what to do with that knowledge though.

PS. For "More resources were used than in DeepMind A0 project, not being at all near A0 level strength with 20xxx and 30xxx nets." I hope you mean one run of DM vs one run of Lc0. For total amount of resources (for trial and testing), I'm sure DeepMind used hundreds if not thousands times more resources than we did so far.

Instead of wasting so much electricity, which don't you ask Demis Hassabis for insights?

He may not give you the exact secret sauce for everything, but he can at least bring you some clarifications on what you've assumed from his paper (where unclear), perhaps what parameters he used, or at least ideas on how to estimate such parameters.

A simple email could save the planet a few GWh. Think of the polar bears, man…

Laskos · Post by **Laskos** » Mon Nov 26, 2018 7:56 am

crem wrote: ↑Sun Nov 25, 2018 11:21 pm 1. So far in our tests we fail to reach A0 strength given the same number of training games (44 million).
2. We don't know why it is. Maybe we have a bug (probably), maybe we use wrong FPU, maybe we guessed wrong Cpuct, maybe we understood the paper incorrectly, maybe we don't shuffle training games good enough, maybe we release new network too rarely, maybe something else.
3. I agree that the best (or rather the only) way to get consistent improvements is to run lots of small tests with different ideas.
4. Currently the way to do such tests is not developed (it's discussed for 6 months already, but it's constantly being preempted by more urgent tasks [rushing release for some CCCC/TCEC season, or changing Lc0 so that new features could be added in more elegant way, or implementing some new Lc0 feaure myself because it's more fun]).
5. Without implemented easy way of testing, setting up and running a fresh tests is a cumbersome task. Especially if it requires engine changes, then currently it takes weeks to roll it out. Server-side part is not one-click thing either, requires some hours of wiring up training scripts, data transfer, typing some SQL, making sure that clients still not send training data from old test after restart etc.
6. Often things are not changed just because the changes needed for a new idea are not implemented yet. Or sometime it's because all devs are too busy with their non-Lc0 life for a week or two etc.
7. Yes, current use of contributors' GPU is not optimal. But to make it more optimal, things have to be implemented, and devs just cannot keep up.
8. Current idea (from my perception) is "We'll do testing properly (on many small-scale experiments that anyone can submit, and statistically sound conclusions) when we have a framework. Until that's ready, let's run full-size test with intuitively guessed params/ideas and hope it will be stronger that everything we had before."

So, yes we fail to reach A0 level, yes we should run well designed experiments, yes we should have done lots of them, yes they should be small and frequent instead of rare and large (and largely based on just intuitive guess instead of some scientifically sound method), but there's really no infrastructure and very little dev time to implement this infrastructure. And even for doing it manually, idea of starting a new small test every week is too time-consuming.

I totally agree that if some team of 2-3 full time developers would appear, they would leave LCZero project behind within one month. I don't know what to do with that knowledge though.

PS. For "More resources were used than in DeepMind A0 project, not being at all near A0 level strength with 20xxx and 30xxx nets." I hope you mean one run of DM vs one run of Lc0. For total amount of resources (for trial and testing), I'm sure DeepMind used hundreds if not thousands times more resources than we did so far.

Nice to hear from you, and you are probably the last target of my pretty impolite (maybe unjustly) post. You developed the excellent engine, and the initial EXTREMELY successful 6x64 runs were mostly supervised by you. First, those are 12-15 faster nets, so 10 million fully blown 20x256 nets games are computing-effort-wise equivalent to 120 million games of 6x64 runs. Second, everybody knows that reaching the global optimum with DCNNs is some sort of "black magic", it is acknowledged in serious journals like "Nature". Wasn't it better to have a "toy model" with 6x64 nets, which in early runs reached some local optima using only 100-150 nets for training and having 15 faster games? And this "toy model" is not that "toyish", these were playing good chess, some 3000+ CCRL 40/4 Elo level on a reasonable GPU, nothing really "toyish" about them. I am often using toy models as a start in my research and even in some posts on this forum, trying to find some insights into the "real life" situation or the final model. 6x64 runs have a much more simplified landscape to find tricks of reaching sweet points in parameters for achieving better global results. The 20x256 runs not only are very slow, but the learning landscape is extremely weird, and trying to figure out everything at once could be almost an unreachable goal. To use them as "experimental bedrock", while being 15 times slower, and they settle to _local_ optima with some 1000 nets instead of 100, is just a squandering of resources. From toy models to the real life is the usual procedure of the scientific method.
Sorry again, but those 10 million wasted games with the fully blown slow net, trying to figure out everything at once in an extremely complicated landscape, irritated me, you might end up learning nothing or very litle as procedure goes out of this 30xxx run.

Anyway, I am not any expert on all this, and such things like server maintenance could blow up with 6x64 nets coming out every 5 minutes or so.

pohl4711 · Post by **pohl4711** » Mon Nov 26, 2018 8:29 am

crem wrote: ↑Sun Nov 25, 2018 11:21 pm
So, yes we fail to reach A0 level

I am not sure, if this is true. Leela with late 11xxx nets (11250 or so) is at an Elo-level around Fire 7.1. I doubt, that A0 would score better, when it had to play with valid testconditions - the competition vs. SF8 was a bad joke (fixed time per move, very small hash for SF, no openings). With valid conditions, I believe SF8 would have played around 80-100 Elo better vs A0.

duncan · Post by **duncan** » Mon Nov 26, 2018 12:15 pm

crem wrote: ↑Sun Nov 25, 2018 11:21 pm 1. So far in our tests we fail to reach A0 strength given the same number of training games (44 million).
2. We don't know why it is. Maybe we have a bug (probably), maybe we use wrong FPU, maybe we guessed wrong Cpuct, maybe we understood the paper incorrectly, maybe we don't shuffle training games good enough, maybe we release new network too rarely, maybe something else.
3. I agree that the best (or rather the only) way to get consistent improvements is to run lots of small tests with different ideas.
4. Currently the way to do such tests is not developed (it's discussed for 6 months already, but it's constantly being preempted by more urgent tasks [rushing release for some CCCC/TCEC season, or changing Lc0 so that new features could be added in more elegant way, or implementing some new Lc0 feaure myself because it's more fun]).
5. Without implemented easy way of testing, setting up and running a fresh tests is a cumbersome task. Especially if it requires engine changes, then currently it takes weeks to roll it out. Server-side part is not one-click thing either, requires some hours of wiring up training scripts, data transfer, typing some SQL, making sure that clients still not send training data from old test after restart etc.
6. Often things are not changed just because the changes needed for a new idea are not implemented yet. Or sometime it's because all devs are too busy with their non-Lc0 life for a week or two etc.
7. Yes, current use of contributors' GPU is not optimal. But to make it more optimal, things have to be implemented, and devs just cannot keep up.
8. Current idea (from my perception) is "We'll do testing properly (on many small-scale experiments that anyone can submit, and statistically sound conclusions) when we have a framework. Until that's ready, let's run full-size test with intuitively guessed params/ideas and hope it will be stronger that everything we had before."

So, yes we fail to reach A0 level, yes we should run well designed experiments, yes we should have done lots of them, yes they should be small and frequent instead of rare and large (and largely based on just intuitive guess instead of some scientifically sound method), but there's really no infrastructure and very little dev time to implement this infrastructure. And even for doing it manually, idea of starting a new small test every week is too time-consuming.

I totally agree that if some team of 2-3 full time developers would appear, they would leave LCZero project behind within one month. I don't know what to do with that knowledge though.

PS. For "More resources were used than in DeepMind A0 project, not being at all near A0 level strength with 20xxx and 30xxx nets." I hope you mean one run of DM vs one run of Lc0. For total amount of resources (for trial and testing), I'm sure DeepMind used hundreds if not thousands times more resources than we did so far.

Any possibility of paying someone to create the framework and fund raising with that specific objective in mind as it is so important ?

btw what would you charge ?

jp · Post by jp » Mon Nov 26, 2018 12:35 pm

crem wrote: ↑Sun Nov 25, 2018 11:21 pm

The worst part of the 3xxxx testing is the TB rescoring, because that is not "zero". (We talked about this in another thread.)
The only good excuse for going non-zero is that you believe 100% that the "zero" approach has maxed out & no more improvement is possible.

chrisw · Post by **chrisw** » Mon Nov 26, 2018 9:45 pm

crem wrote: ↑Sun Nov 25, 2018 11:21 pm 1. So far in our tests we fail to reach A0 strength given the same number of training games (44 million).
2. We don't know why it is. Maybe we have a bug (probably), maybe we use wrong FPU, maybe we guessed wrong Cpuct, maybe we understood the paper incorrectly, maybe we don't shuffle training games good enough, maybe we release new network too rarely, maybe something else.
3. I agree that the best (or rather the only) way to get consistent improvements is to run lots of small tests with different ideas.
4. Currently the way to do such tests is not developed (it's discussed for 6 months already, but it's constantly being preempted by more urgent tasks [rushing release for some CCCC/TCEC season, or changing Lc0 so that new features could be added in more elegant way, or implementing some new Lc0 feaure myself because it's more fun]).
5. Without implemented easy way of testing, setting up and running a fresh tests is a cumbersome task. Especially if it requires engine changes, then currently it takes weeks to roll it out. Server-side part is not one-click thing either, requires some hours of wiring up training scripts, data transfer, typing some SQL, making sure that clients still not send training data from old test after restart etc.
6. Often things are not changed just because the changes needed for a new idea are not implemented yet. Or sometime it's because all devs are too busy with their non-Lc0 life for a week or two etc.
7. Yes, current use of contributors' GPU is not optimal. But to make it more optimal, things have to be implemented, and devs just cannot keep up.
8. Current idea (from my perception) is "We'll do testing properly (on many small-scale experiments that anyone can submit, and statistically sound conclusions) when we have a framework. Until that's ready, let's run full-size test with intuitively guessed params/ideas and hope it will be stronger that everything we had before."

So, yes we fail to reach A0 level, yes we should run well designed experiments, yes we should have done lots of them, yes they should be small and frequent instead of rare and large (and largely based on just intuitive guess instead of some scientifically sound method), but there's really no infrastructure and very little dev time to implement this infrastructure. And even for doing it manually, idea of starting a new small test every week is too time-consuming.

I totally agree that if some team of 2-3 full time developers would appear, they would leave LCZero project behind within one month. I don't know what to do with that knowledge though.

PS. For "More resources were used than in DeepMind A0 project, not being at all near A0 level strength with 20xxx and 30xxx nets." I hope you mean one run of DM vs one run of Lc0. For total amount of resources (for trial and testing), I'm sure DeepMind used hundreds if not thousands times more resources than we did so far.

For such a brilliant piece of critical analysis of project, I award Alexander the Order of Lenin.

Dann Corbit · Post by **Dann Corbit** » Mon Nov 26, 2018 10:51 pm

Can't we still collect the old net and use it, if we want to?

jp · Post by jp » Tue Nov 27, 2018 9:29 am

Dann Corbit wrote: ↑Mon Nov 26, 2018 10:51 pm Can't we still collect the old net and use it, if we want to?

They're all available. By "use", do you mean for play or more training?

2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets

Re: 2900 Elo points progress, 10 million games, 330 nets