A paper about parameter tuning

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Jan 13, 2010 3:56 pm

I think I understand your criticism now.

Your point is that they only had 1 run (with 1 set of starting parameters) for their algorithm, so essentially the rate of improvement of their algorithm has an infinite error margin. So although this one run had an improvement of roughly 500 ELO, that just doesn't mean anything regarding the performance of their algorithm, rather than the chess engine it produced in this case.

mcostalba · Post by **mcostalba** » Wed Jan 13, 2010 4:10 pm

Gian-Carlo Pascutto wrote:I think I understand your criticism now.

Your point is that they only had 1 run (with 1 set of starting parameters) for their algorithm, so essentially the rate of improvement of their algorithm has an infinite error margin. So although this one run had an improvement of roughly 500 ELO, that just doesn't mean anything regarding the performance of their algorithm, rather than the chess engine it produced in this case.

Yes, IMO a more serious approach would have been to just consider their algorithm in first instance.

Do 10 trials starting every time from a different random values set and play say 1000 games each trial. Then at the end measure the ELO of the _final_ versions of the 10 engines produced (1 each trial) and _verify_ the ELO range of the obtained engines is within a small margin. This means that algorithm is reproducible and repeatable on a 1000 games trial's length, in other words that the algorithm converges in a reasonable number of games.

Then, you could start to compare with others.

Dann Corbit · Post by **Dann Corbit** » Wed Jan 13, 2010 8:45 pm

mcostalba wrote:
Dann Corbit wrote:
So, if we don't believe it, we can try it ourselves.
I think the logic should be reversed. Anyhow I could even try myself in case I would trust the authors, but in this case they have made just 1000 games or so on a carefully choosen (not working) engine just to show they have something to publish....

They ran 16,000 training games (page 7, paragraph 1).

The graph in figure 2 that shows Elo has a scale that shows 10,000 games as the right hand limit.

They played approximately 1000 online games (page 8, paragraph 3) which is enough to make a valid measurement of strength {about 400 is the lower limit to make a judgement, according to Heinz}.

I think that they have enough games to know that the effect is real.

Now, will it translate to some program besides a stripped down bodo?
Nobody knows. But the research is the most interesting research in self-learning evaluation that has ever been published.

IMO-YMMV

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jan 13, 2010 9:55 pm

Rémi Coulom wrote:http://books.nips.cc/papers/files/nips2 ... 9_0508.pdf

I did not read carefully. Some of you might be interested.

Rémi

Thank you!

http://jveness.info/publications/default.html

MattieShoes · Post by **MattieShoes** » Thu Jan 14, 2010 10:57 am

zamar wrote:
Rémi Coulom wrote:http://books.nips.cc/papers/files/nips2 ... 9_0508.pdf
1) Simple things are expressed in complicated math symbols. Writer uses great effort to write his ideas in very abstract form and reader uses great effort to extract these ideas from that abstract form. What's the point? I could get ideas much more easily from well documented C-like pseudo-code.

Is there something like this around? I would be interested to see it.

joelv · Post by **joelv** » Fri Jan 15, 2010 7:16 am

Aaron Becker wrote: I can't speak for the authors, but I think they're trying to prove that their technique is superior to the earlier TD techniques that they refer to as TD-Leaf and TD-Root, and I think they make a reasonable case. You say that you could use any learning technique to improve on random data, but those earlier TD techniques don't work well without a reasonable starting vector. Obviously improving a real top engine would be a better result, but rewriting the evaluation of an engine like Stockfish to be expressed as a weighted feature vector would be a lot of work on its own, and improving values that have been painstakingly hand-tuned is an awfully high bar to clear for a new technique.

That's pretty much spot on.

Yes, I am also curious to know whether this technique (or an adaptation of) could help a strong engine. It is something I would personally be interested in trying in the future...that said, I think it would be crazy to invest such a huge effort in a new technique, without first validating it with some self play / random weight experiments on a simple program.

Some more general comments (directed at the whole thread):

a) This paper looks at learning from self play, starting from random weights....and nothing more. So please try and understand it in this context, and don't get angry when it doesn't answer _your_ question.

b) I agree that the paper could have been improved by reporting the results of multiple learning runs. If I get a chance in the future, I will address this. For what it is worth, in my private testing (with different, and progressively more sophisticated sets of features), the TreeStrap method consistently outperformed TD-Leaf by similar margins.

c) Apologies if the paper is difficult to read. The mathematics is really not that much. If you can understand TD-Leaf, or an ANN, then you can easily understand this work. I might post a chess programmer friendly version on my website if there is sufficient interest, in the meantime, please email me if you have any questions.

d) Automatic evaluation improvement of strong Chess engines is an entire research topic in itself. It's different and more open ended than learning from scratch with self-play. And I dare say more difficult. e.g. You can try to use the information provided by strong engines in a number of different ways (build a score based training set, try to match moves, etc)... Or, as mentioned somewhere in this thread, you could look at methods that only tune a small number of parameters in isolation... Or you can use a cluster and some computation heavy technique like simulated annealing (with ELO as the objective function)... Take your pick, be creative, and enjoy the age of auto-tuning.

mcostalba · Post by **mcostalba** » Fri Jan 15, 2010 9:50 am

joelv wrote:
That's pretty much spot on.

Yes, I am also curious to know whether this technique (or an adaptation of) could help a strong engine. It is something I would personally be interested in trying in the future...that said, I think it would be crazy to invest such a huge effort in a new technique, without first validating it with some self play / random weight experiments on a simple program.

Actually IMHO you have not validated anything. Instead you have _tried_ to prove that your algorithm is better then others. But these are two different and _separated_ goals and the first should be attempted before the second.

joelv wrote: a) This paper looks at learning from self play, starting from random weights....and nothing more. So please try and understand it in this context, and don't get angry when it doesn't answer _your_ question.

In chess engines testing numbers are more important then ideas (because they require much longer times to get properly), and much more important then struggle to be the first. In your paper experimental data is small, not clearly documented and too focused on the attempted direction.

joelv wrote: c) Apologies if the paper is difficult to read. The mathematics is really not that much. If you can understand TD-Leaf, or an ANN, then you can easily understand this work. I might post a chess programmer friendly version on my website if there is sufficient interest, in the meantime, please email me if you have any questions.

For your information who said that mathematics is difficult actually is a mathematician. He, very kindly, used the word 'difficult' instead of the more rude 'useless', because he is a gentleman

I am not a mathematician but I can read the paper either and I can understand the notation, and because I am _not_ a gentleman I add that if an idea is sound doesn't need to be 'defended' by an abstract notation, but can be expressed in a simple working code snippet that at the end of the day, is more precise anyway, because you have to unambiguously define all the details.

joelv wrote: d) Automatic evaluation improvement of strong Chess engines is an entire research topic in itself. It's different and more open ended than learning from scratch with self-play. And I dare say more difficult.

Yes, I dare to say too

and I also dare to say more useful and effective. If the effort of parametrize all the tuning parameters of a real engine scares the people I can say that we, for our auto-tuning, use a complete parametrized version of Stockfish where all the evaluation parameters (and not only them) are reachable and changeable as UCI options: many hundreds of parameter can be set and changed on the fly without restarting the engine. I would also add that in a real engine you have to proceed for parameter's families because is naive to even think to optimize all the values in one go.

I have really appreciated you have replied directly in this thread.

Thanks
Marco

joelv · Post by **joelv** » Fri Jan 15, 2010 11:14 am

mcostalba wrote: Actually IMHO you have not validated anything. Instead you have _tried_ to prove that your algorithm is better then others. But these are two different and _separated_ goals and the first should be attempted before the second.

You are welcome to your opinion. However, I genuinely believe this work is an improvement on TD based methods for learning from self play, and thus would like to encourage others to try the idea. If the current set of experiments aren't so convincing, maybe I can add some more material to my webpage to address this...

Suppose I did what you suggested earlier, and repeated the experiment 10 times (I don't have the CPU time to think about any more) with different sets of random starting weights and reported the results. Assuming they were similar to what was reported, would this be validation enough for you? If not, what would it take?

From my point of view, I actually am not that interested in just chess. When it comes down to it, there is no need to learn from scratch using self-play with chess, for any kind of practical reason. A better validation of the idea (in my opinion) would be seeing the method reliably working across a number of different games, across a number of different implementations.

mcostalba wrote: For your information who said that mathematics is difficult actually is a mathematician. He, very kindly, used the word 'difficult' instead of the more rude 'useless', because he is a gentleman

I am not a mathematician but I can read the paper either and I can understand the notation, and because I am _not_ a gentleman I add that if an idea is sound doesn't need to be 'defended' by an abstract notation, but can be expressed in a simple working code snippet that at the end of the day, is more precise anyway, because you have to unambiguously define all the details.

It has nothing at all to do with 'defending' anything.

The reason to present the technique abstractly is to highlight the fact that the method (in principle) can work with alternate forms of parametrized evaluation functions (that can be optimized using gradient descent), rather than the linear evaluation function used in the paper. For example, you could train an ANN with it (though I am not saying it would work well).

I at least thought this might be useful to someone. Honestly.

mcostalba wrote: Yes, I dare to say too and I also dare to say more useful and effective. If the effort of parametrize all the tuning parameters of a real engine scares the people I can say that we, for our auto-tuning, use a complete parametrized version of Stockfish where all the evaluation parameters (and not only them) are reachable and changeable as UCI options: many hundreds of parameter can be set and changed on the fly without restarting the engine. I would also add that in a real engine you have to proceed for parameter's families because is naive to even think to optimize all the values in one go.

Is this parametrized version of Stockfish available as open source? Sounds like it would be the perfect test-bed for some of the ideas in the paper...

michiguel · Post by **michiguel** » Fri Jan 15, 2010 11:35 am

joelv wrote:
Aaron Becker wrote: I can't speak for the authors, but I think they're trying to prove that their technique is superior to the earlier TD techniques that they refer to as TD-Leaf and TD-Root, and I think they make a reasonable case. You say that you could use any learning technique to improve on random data, but those earlier TD techniques don't work well without a reasonable starting vector. Obviously improving a real top engine would be a better result, but rewriting the evaluation of an engine like Stockfish to be expressed as a weighted feature vector would be a lot of work on its own, and improving values that have been painstakingly hand-tuned is an awfully high bar to clear for a new technique.
That's pretty much spot on.

Yes, I am also curious to know whether this technique (or an adaptation of) could help a strong engine. It is something I would personally be interested in trying in the future...that said, I think it would be crazy to invest such a huge effort in a new technique, without first validating it with some self play / random weight experiments on a simple program.

Some more general comments (directed at the whole thread):

a) This paper looks at learning from self play, starting from random weights....and nothing more. So please try and understand it in this context, and don't get angry when it doesn't answer _your_ question.

b) I agree that the paper could have been improved by reporting the results of multiple learning runs. If I get a chance in the future, I will address this. For what it is worth, in my private testing (with different, and progressively more sophisticated sets of features), the TreeStrap method consistently outperformed TD-Leaf by similar margins.

c) Apologies if the paper is difficult to read. The mathematics is really not that much. If you can understand TD-Leaf, or an ANN, then you can easily understand this work. I might post a chess programmer friendly version on my website if there is sufficient interest, in the meantime, please email me if you have any questions.

d) Automatic evaluation improvement of strong Chess engines is an entire research topic in itself. It's different and more open ended than learning from scratch with self-play. And I dare say more difficult. e.g. You can try to use the information provided by strong engines in a number of different ways (build a score based training set, try to match moves, etc)... Or, as mentioned somewhere in this thread, you could look at methods that only tune a small number of parameters in isolation... Or you can use a cluster and some computation heavy technique like simulated annealing (with ELO as the objective function)... Take your pick, be creative, and enjoy the age of auto-tuning.

I read you paper.

Congratulations, it is very good and I think Marco's criticisms are way too retentive and they are not addressing what you are trying to tackle. You showed that the learning speed of your algorithm is ~30-fold faster than other ones (you should have mentioned this), and that is closer to a breakthrough rather than an improvement. Of course, not every algorithm works in every single environment, but that is not you job to prove.

If what I bolded is correct (and expected from the data you showed), the criticisms completely evaporate.

I am not a mathematician, but I could follow the text very well. I have some questions about the implementation, but that is always the case. An appendix with pseudo code would help with the impact of the paper.

Miguel

mcostalba · Post by **mcostalba** » Fri Jan 15, 2010 12:02 pm

joelv wrote: Suppose I did what you suggested earlier, and repeated the experiment 10 times (I don't have the CPU time to think about any more) with different sets of random starting weights and reported the results. Assuming they were similar to what was reported, would this be validation enough for you? If not, what would it take?

Also 5 runs are enough. Here are the conditions:

1) Take 5 different starting random sets and run the trial for N games.

2) At the end run a tournament among the 5 resulting engines + one randomly chose among the 5 random starting engines (to show that we actually had an increment).

3) If the estimate ELO of the all 5 trained engines ends within a 20 ELO range and is higher of the starting one by a margin comparable to your paper's result we (I) will clap at the miracle

Now _you_ have to choose N, of course bigger the N easier is to achieve the task because you allow more time to your algorithm to converge. The choice is up to you but if you really want to amaze me you can take something like N=1000 (that I consider very small) and show that 1000 training games are enough to stabilize the result.

I also would like that test conditions are documented:

- Time control used for trial games
- Book used for trial games, ponder off, etc.
- Hardware used CPU and number of cores used, and possibly average middle game search depth (if it is easy for you to get this important number)

- Time control used for ELO measuring and final tournament verification
- Book used for final ELO verification, ponder off, etc.
- Number of tournament rounds : how many games to measure estimate ELO of the resulting engines.
- Hardware used for final tournament.

joelv wrote: Is this parametrized version of Stockfish available as open source? Sounds like it would be the perfect test-bed for some of the ideas in the paper...

I have sent you a pm on this point.

A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning