A paper about parameter tuning

joelv · Post by **joelv** » Fri Jan 15, 2010 7:16 am

Aaron Becker wrote: I can't speak for the authors, but I think they're trying to prove that their technique is superior to the earlier TD techniques that they refer to as TD-Leaf and TD-Root, and I think they make a reasonable case. You say that you could use any learning technique to improve on random data, but those earlier TD techniques don't work well without a reasonable starting vector. Obviously improving a real top engine would be a better result, but rewriting the evaluation of an engine like Stockfish to be expressed as a weighted feature vector would be a lot of work on its own, and improving values that have been painstakingly hand-tuned is an awfully high bar to clear for a new technique.

That's pretty much spot on.

Yes, I am also curious to know whether this technique (or an adaptation of) could help a strong engine. It is something I would personally be interested in trying in the future...that said, I think it would be crazy to invest such a huge effort in a new technique, without first validating it with some self play / random weight experiments on a simple program.

Some more general comments (directed at the whole thread):

a) This paper looks at learning from self play, starting from random weights....and nothing more. So please try and understand it in this context, and don't get angry when it doesn't answer _your_ question.

b) I agree that the paper could have been improved by reporting the results of multiple learning runs. If I get a chance in the future, I will address this. For what it is worth, in my private testing (with different, and progressively more sophisticated sets of features), the TreeStrap method consistently outperformed TD-Leaf by similar margins.

c) Apologies if the paper is difficult to read. The mathematics is really not that much. If you can understand TD-Leaf, or an ANN, then you can easily understand this work. I might post a chess programmer friendly version on my website if there is sufficient interest, in the meantime, please email me if you have any questions.

d) Automatic evaluation improvement of strong Chess engines is an entire research topic in itself. It's different and more open ended than learning from scratch with self-play. And I dare say more difficult. e.g. You can try to use the information provided by strong engines in a number of different ways (build a score based training set, try to match moves, etc)... Or, as mentioned somewhere in this thread, you could look at methods that only tune a small number of parameters in isolation... Or you can use a cluster and some computation heavy technique like simulated annealing (with ELO as the objective function)... Take your pick, be creative, and enjoy the age of auto-tuning.

mcostalba · Post by **mcostalba** » Fri Jan 15, 2010 9:50 am

joelv wrote:
That's pretty much spot on.

Yes, I am also curious to know whether this technique (or an adaptation of) could help a strong engine. It is something I would personally be interested in trying in the future...that said, I think it would be crazy to invest such a huge effort in a new technique, without first validating it with some self play / random weight experiments on a simple program.

Actually IMHO you have not validated anything. Instead you have _tried_ to prove that your algorithm is better then others. But these are two different and _separated_ goals and the first should be attempted before the second.

joelv wrote: a) This paper looks at learning from self play, starting from random weights....and nothing more. So please try and understand it in this context, and don't get angry when it doesn't answer _your_ question.

In chess engines testing numbers are more important then ideas (because they require much longer times to get properly), and much more important then struggle to be the first. In your paper experimental data is small, not clearly documented and too focused on the attempted direction.

joelv wrote: c) Apologies if the paper is difficult to read. The mathematics is really not that much. If you can understand TD-Leaf, or an ANN, then you can easily understand this work. I might post a chess programmer friendly version on my website if there is sufficient interest, in the meantime, please email me if you have any questions.

For your information who said that mathematics is difficult actually is a mathematician. He, very kindly, used the word 'difficult' instead of the more rude 'useless', because he is a gentleman

I am not a mathematician but I can read the paper either and I can understand the notation, and because I am _not_ a gentleman I add that if an idea is sound doesn't need to be 'defended' by an abstract notation, but can be expressed in a simple working code snippet that at the end of the day, is more precise anyway, because you have to unambiguously define all the details.

joelv wrote: d) Automatic evaluation improvement of strong Chess engines is an entire research topic in itself. It's different and more open ended than learning from scratch with self-play. And I dare say more difficult.

Yes, I dare to say too

and I also dare to say more useful and effective. If the effort of parametrize all the tuning parameters of a real engine scares the people I can say that we, for our auto-tuning, use a complete parametrized version of Stockfish where all the evaluation parameters (and not only them) are reachable and changeable as UCI options: many hundreds of parameter can be set and changed on the fly without restarting the engine. I would also add that in a real engine you have to proceed for parameter's families because is naive to even think to optimize all the values in one go.

I have really appreciated you have replied directly in this thread.

Thanks
Marco

joelv · Post by **joelv** » Fri Jan 15, 2010 11:14 am

mcostalba wrote: Actually IMHO you have not validated anything. Instead you have _tried_ to prove that your algorithm is better then others. But these are two different and _separated_ goals and the first should be attempted before the second.

You are welcome to your opinion. However, I genuinely believe this work is an improvement on TD based methods for learning from self play, and thus would like to encourage others to try the idea. If the current set of experiments aren't so convincing, maybe I can add some more material to my webpage to address this...

Suppose I did what you suggested earlier, and repeated the experiment 10 times (I don't have the CPU time to think about any more) with different sets of random starting weights and reported the results. Assuming they were similar to what was reported, would this be validation enough for you? If not, what would it take?

From my point of view, I actually am not that interested in just chess. When it comes down to it, there is no need to learn from scratch using self-play with chess, for any kind of practical reason. A better validation of the idea (in my opinion) would be seeing the method reliably working across a number of different games, across a number of different implementations.

mcostalba wrote: For your information who said that mathematics is difficult actually is a mathematician. He, very kindly, used the word 'difficult' instead of the more rude 'useless', because he is a gentleman

I am not a mathematician but I can read the paper either and I can understand the notation, and because I am _not_ a gentleman I add that if an idea is sound doesn't need to be 'defended' by an abstract notation, but can be expressed in a simple working code snippet that at the end of the day, is more precise anyway, because you have to unambiguously define all the details.

It has nothing at all to do with 'defending' anything.

The reason to present the technique abstractly is to highlight the fact that the method (in principle) can work with alternate forms of parametrized evaluation functions (that can be optimized using gradient descent), rather than the linear evaluation function used in the paper. For example, you could train an ANN with it (though I am not saying it would work well).

I at least thought this might be useful to someone. Honestly.

mcostalba wrote: Yes, I dare to say too and I also dare to say more useful and effective. If the effort of parametrize all the tuning parameters of a real engine scares the people I can say that we, for our auto-tuning, use a complete parametrized version of Stockfish where all the evaluation parameters (and not only them) are reachable and changeable as UCI options: many hundreds of parameter can be set and changed on the fly without restarting the engine. I would also add that in a real engine you have to proceed for parameter's families because is naive to even think to optimize all the values in one go.

Is this parametrized version of Stockfish available as open source? Sounds like it would be the perfect test-bed for some of the ideas in the paper...

michiguel · Post by **michiguel** » Fri Jan 15, 2010 11:35 am

joelv wrote:
Aaron Becker wrote: I can't speak for the authors, but I think they're trying to prove that their technique is superior to the earlier TD techniques that they refer to as TD-Leaf and TD-Root, and I think they make a reasonable case. You say that you could use any learning technique to improve on random data, but those earlier TD techniques don't work well without a reasonable starting vector. Obviously improving a real top engine would be a better result, but rewriting the evaluation of an engine like Stockfish to be expressed as a weighted feature vector would be a lot of work on its own, and improving values that have been painstakingly hand-tuned is an awfully high bar to clear for a new technique.
That's pretty much spot on.

Yes, I am also curious to know whether this technique (or an adaptation of) could help a strong engine. It is something I would personally be interested in trying in the future...that said, I think it would be crazy to invest such a huge effort in a new technique, without first validating it with some self play / random weight experiments on a simple program.

Some more general comments (directed at the whole thread):

a) This paper looks at learning from self play, starting from random weights....and nothing more. So please try and understand it in this context, and don't get angry when it doesn't answer _your_ question.

b) I agree that the paper could have been improved by reporting the results of multiple learning runs. If I get a chance in the future, I will address this. For what it is worth, in my private testing (with different, and progressively more sophisticated sets of features), the TreeStrap method consistently outperformed TD-Leaf by similar margins.

c) Apologies if the paper is difficult to read. The mathematics is really not that much. If you can understand TD-Leaf, or an ANN, then you can easily understand this work. I might post a chess programmer friendly version on my website if there is sufficient interest, in the meantime, please email me if you have any questions.

d) Automatic evaluation improvement of strong Chess engines is an entire research topic in itself. It's different and more open ended than learning from scratch with self-play. And I dare say more difficult. e.g. You can try to use the information provided by strong engines in a number of different ways (build a score based training set, try to match moves, etc)... Or, as mentioned somewhere in this thread, you could look at methods that only tune a small number of parameters in isolation... Or you can use a cluster and some computation heavy technique like simulated annealing (with ELO as the objective function)... Take your pick, be creative, and enjoy the age of auto-tuning.

I read you paper.

Congratulations, it is very good and I think Marco's criticisms are way too retentive and they are not addressing what you are trying to tackle. You showed that the learning speed of your algorithm is ~30-fold faster than other ones (you should have mentioned this), and that is closer to a breakthrough rather than an improvement. Of course, not every algorithm works in every single environment, but that is not you job to prove.

If what I bolded is correct (and expected from the data you showed), the criticisms completely evaporate.

I am not a mathematician, but I could follow the text very well. I have some questions about the implementation, but that is always the case. An appendix with pseudo code would help with the impact of the paper.

Miguel

mcostalba · Post by **mcostalba** » Fri Jan 15, 2010 12:02 pm

joelv wrote: Suppose I did what you suggested earlier, and repeated the experiment 10 times (I don't have the CPU time to think about any more) with different sets of random starting weights and reported the results. Assuming they were similar to what was reported, would this be validation enough for you? If not, what would it take?

Also 5 runs are enough. Here are the conditions:

1) Take 5 different starting random sets and run the trial for N games.

2) At the end run a tournament among the 5 resulting engines + one randomly chose among the 5 random starting engines (to show that we actually had an increment).

3) If the estimate ELO of the all 5 trained engines ends within a 20 ELO range and is higher of the starting one by a margin comparable to your paper's result we (I) will clap at the miracle

Now _you_ have to choose N, of course bigger the N easier is to achieve the task because you allow more time to your algorithm to converge. The choice is up to you but if you really want to amaze me you can take something like N=1000 (that I consider very small) and show that 1000 training games are enough to stabilize the result.

I also would like that test conditions are documented:

- Time control used for trial games
- Book used for trial games, ponder off, etc.
- Hardware used CPU and number of cores used, and possibly average middle game search depth (if it is easy for you to get this important number)

- Time control used for ELO measuring and final tournament verification
- Book used for final ELO verification, ponder off, etc.
- Number of tournament rounds : how many games to measure estimate ELO of the resulting engines.
- Hardware used for final tournament.

joelv wrote: Is this parametrized version of Stockfish available as open source? Sounds like it would be the perfect test-bed for some of the ideas in the paper...

I have sent you a pm on this point.

joelv · Post by **joelv** » Sun Jan 17, 2010 1:58 am

Thanks for this. N=1000 will be fine.

I have spare CPU time at the beginning of February. Once the results are on my website, I will let you know.

diep · Post by **diep** » Sun Jan 17, 2010 4:21 pm

Sorry for the selective snipping...

mcostalba wrote: In chess engines testing numbers are more important then ideas (because they require much longer times to get properly), and much more important then struggle to be the first. In your paper experimental data is small, not clearly documented and too focused on the attempted direction.

Do i read this correct you say that persons who test things are more important than persons who have ideas?

Please confirm if so.

Vincent

diep · Post by **diep** » Sun Jan 17, 2010 5:31 pm

mcostalba wrote:
joelv wrote: Suppose I did what you suggested earlier, and repeated the experiment 10 times (I don't have the CPU time to think about any more) with different sets of random starting weights and reported the results. Assuming they were similar to what was reported, would this be validation enough for you? If not, what would it take?
Also 5 runs are enough. Here are the conditions:

1) Take 5 different starting random sets and run the trial for N games.

2) At the end run a tournament among the 5 resulting engines + one randomly chose among the 5 random starting engines (to show that we actually had an increment).

3) If the estimate ELO of the all 5 trained engines ends within a 20 ELO range and is higher of the starting one by a margin comparable to your paper's result we (I) will clap at the miracle

Now _you_ have to choose N, of course bigger the N easier is to achieve the task because you allow more time to your algorithm to converge. The choice is up to you but if you really want to amaze me you can take something like N=1000 (that I consider very small) and show that 1000 training games are enough to stabilize the result.

I also would like that test conditions are documented:

- Time control used for trial games
- Book used for trial games, ponder off, etc.
- Hardware used CPU and number of cores used, and possibly average middle game search depth (if it is easy for you to get this important number)

- Time control used for ELO measuring and final tournament verification
- Book used for final ELO verification, ponder off, etc.
- Number of tournament rounds : how many games to measure estimate ELO of the resulting engines.
- Hardware used for final tournament.

joelv wrote: Is this parametrized version of Stockfish available as open source? Sounds like it would be the perfect test-bed for some of the ideas in the paper...
I have sent you a pm on this point.

Nice from you to post all this, where is the tunable stockfish code in the meantime and the tuner used to tune it?

Thanks,
Vincent

mcostalba · Post by **mcostalba** » Sun Jan 17, 2010 6:13 pm

diep wrote:Sorry for the selective snipping...

mcostalba wrote: In chess engines testing numbers are more important then ideas (because they require much longer times to get properly), and much more important then struggle to be the first. In your paper experimental data is small, not clearly documented and too focused on the attempted direction.
Do i read this correct you say that persons who test things are more important than persons who have ideas?

Please confirm if so.

Vincent

Yes I confirm, especially when it is the same person as in my case. I (but not only me) spend _much_ more time in testing then in thinking. I have bunches of new ideas or new possible tweaks and them come out very easily, but I am able to test only a small part becasue testing resources are a real bottleneck for us.

We use the following methodology for testing:

- If an idea touches many aspects then we split the idea in many single focused parts and test independently each part.

- We test each idea up to a point where it is more or less reliably verified it works, and it means a lot of games

- We can use different ways to early filter out not working ideas, but at the end _all_ the code changes that are committed are tested with real games and it takes a lot.

So for us testing is the biggest part in development.

mcostalba · Post by **mcostalba** » Sun Jan 17, 2010 6:17 pm

diep wrote: Nice from you to post all this, where is the tunable stockfish code in the meantime and the tuner used to tune it?

It is private and private it will remain because we think that the auto-tuning system we have developed is superior to what is available in literature (starting from an already almost tuned engine, not an academic toy of a random parameters vector) and is the biggest reason of the elo boost in SF from 1.3 to 1.6 passing from 1.4 and 1.5.1

The methodology was developed and finalized during the SF 1.4 release cycle.

A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning