A paper about parameter tuning

Dann Corbit · Post by **Dann Corbit** » Wed Jan 13, 2010 12:07 pm

Gian-Carlo Pascutto wrote:
This is total crap, of corse, becasue you cannot increase by training with less games it takes to measure the ELO gain itself, and in 100 games you measure nothing, so this is just another piece of crap.

Just to be more clear, if the engine increased by 500 ELO points in 100 games it is _theoretically_ only for pure luck.
While there are some things in the paper that can be criticized, your criticism itself is just as flawed.

ELO gains cannot be measured exactly without an infinite number of games, so "measuring ELO gain" can realistically only be done with some error margin.

Measuring an improvement of 500 ELO in 100 games is going to be statistically significant. The error margin might be 300 ELO, but this means worst case there is still an improvement of 200 ELO. An ELO improvement of 500 ELO over 100 games means a score of something like 5 - 95!

The thing that I gleaned from the paper is that the new approach looks much better than the old one.

So, if we don't believe it, we can try it ourselves.

mcostalba · Post by **mcostalba** » Wed Jan 13, 2010 12:55 pm

Gian-Carlo Pascutto wrote:
This is total crap, of corse, becasue you cannot increase by training with less games it takes to measure the ELO gain itself, and in 100 games you measure nothing, so this is just another piece of crap.

Just to be more clear, if the engine increased by 500 ELO points in 100 games it is _theoretically_ only for pure luck.
While there are some things in the paper that can be criticized, your criticism itself is just as flawed.

ELO gains cannot be measured exactly without an infinite number of games, so "measuring ELO gain" can realistically only be done with some error margin.

Measuring an improvement of 500 ELO in 100 games is going to be statistically significant. The error margin might be 300 ELO, but this means worst case there is still an improvement of 200 ELO. An ELO improvement of 500 ELO over 100 games means a score of something like 5 - 95!

Please take a look at the graph's error bars, they were so kind to add them too

My point is that paper is not serious otherwise they had to think better to what they were doing.

There are 2 kinds of papers, papers that starts from experimental results and try to fihure out a satisfying underlying idea (and are the rare and good ones), and papers that start from an idea and try to figure out satisfying experimental results (and is just garbage) and I think this one goes in the second bucket.

mcostalba · Post by **mcostalba** » Wed Jan 13, 2010 12:58 pm

Dann Corbit wrote:
So, if we don't believe it, we can try it ourselves.

I think the logic should be reversed. Anyhow I could even try myself in case I would trust the authors, but in this case they have made just 1000 games or so on a carefully choosen (not working) engine just to show they have something to publish....

zamar · Post by **zamar** » Wed Jan 13, 2010 1:12 pm

michiguel wrote:
A scientific paper should try to tackle a problem of artificial intelligence, not how to make a competitive chess engine. Of course, you should start from scratch if that is your goal.

Miguel

I understand. But if they say that by using this method you can get +X elo in y games and I know that we could very likely obtain similar result with simple gradient hill climbing through self-play, it's not clear to me what they are trying to prove. That their method among hundreds of other methods works? Yes it works as almost any other optimization method when you start with random data. I can find tens of other methods which work just as well or better considering elo gain. Should I write a paper on each of them?

Aaron Becker · Post by **Aaron Becker** » Wed Jan 13, 2010 2:03 pm

zamar wrote: I understand. But if they say that by using this method you can get +X elo in y games and I know that we could very likely obtain similar result with simple gradient hill climbing through self-play, it's not clear to me what they are trying to prove. That their method among hundreds of other methods works? Yes it works as almost any other optimization method when you start with random data. I can find tens of other methods which work just as well or better considering elo gain. Should I write a paper on each of them?

I can't speak for the authors, but I think they're trying to prove that their technique is superior to the earlier TD techniques that they refer to as TD-Leaf and TD-Root, and I think they make a reasonable case. You say that you could use any learning technique to improve on random data, but those earlier TD techniques don't work well without a reasonable starting vector. Obviously improving a real top engine would be a better result, but rewriting the evaluation of an engine like Stockfish to be expressed as a weighted feature vector would be a lot of work on its own, and improving values that have been painstakingly hand-tuned is an awfully high bar to clear for a new technique.

Also, you're misinterpreting their results. The number of games given in the plot is the number of games they used to train their evaluation heuristic. They determined the resulting engine's elo separately, because their method does not allow simultaneous training and tournament play. They say they use BayesElo to compute their error bars, so presumably their elo figures are determined from a large number of testing games, separate from the smaller number of training games.

mcostalba · Post by **mcostalba** » Wed Jan 13, 2010 2:17 pm

Aaron Becker wrote: Also, you're misinterpreting their results. The number of games given in the plot is the number of games they used to train their evaluation heuristic. They determined the resulting engine's elo separately, because their method does not allow simultaneous training and tournament play. They say they use BayesElo to compute their error bars, so presumably their elo figures are determined from a large number of testing games, separate from the smaller number of training games.

The error bars they use would need at least 10.000 games for sample and there are many of them and in totally you would have needed 100.000 games to measure ELO of all the points, while the whole test spawned just 1000 games. Also if you consider each sample point for each different techniqie the number would have been much higher then 100.000 games just to measure ELO with the published precision. So simply I don't believe they have done that (also because it is not documented and for surely it would have been considering that the ELO measuring effort would have been order of magintude higher then the actual learning alghortim test.

But the fundamental point that made me think they have no idea what means ELO measuring in chess engines is that they stated that after 100 games there was a 500 ELO gains. This is wrong, but not wrong because of the number in itself, but because also if after 100 games the resulting engine would have been 500 ELO stronger (when properly measured) you, as a serious experimenter, should notice that this result can only be pure luck, so you should or redo the test or don't publish the result after just 100 games.

Publishing _any_ result after just 100 games it means you have not understood the problem you are trying to tackle.

Aaron Becker · Post by **Aaron Becker** » Wed Jan 13, 2010 2:27 pm

mcostalba wrote: The error bars they use would need at least 10.000 games for sample and there are many of them and in totally you would have needed 100.000 games to measure ELO of all the points, while the whole test spawned just 1000 games. Also if you consider each sample point for each different techniqie the number would have been much higher then 100.000 games just to measure ELO with the published precision. So simply I don't believe they have done that (also because it is not documented and for surely it would have been considering that the ELO measuring effort would have been order of magintude higher then the actual learning alghortim test.

But the fundamental point that made me think they have no idea what means ELO measuring in chess engines is that they stated that after 100 games there was a 500 ELO gains. This is wrong, but not wrong because of the number in itself, but because also if after 100 games the resulting engine would have been 500 ELO stronger (when properly measured) you, as a serious experimenter, should notice that this result can only be pure luck, so you should or redo the test or don't publish the result after just 100 games.

Publishing _any_ result after just 100 games it means you have not understood the problem you are trying to tackle.

You're completely misreading their experimental design. They trained their engines for 1000 games each, taking a several snapshots throughout the process for each learning technique. They then took all of the engines produced from this process and played them in a tournament. Only this tournament, and not the training games, were used to compute elo. According to their paper, the tournament was around 16000 games. The error bars they came up with range from +-30 to +-60 elo. This is not the kind of precision that requires 10K games per engine. This is explained pretty clearly in 6.1.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Jan 13, 2010 3:30 pm

mcostalba wrote:This is wrong, but not wrong because of the number in itself, but because also if after 100 games the resulting engine would have been 500 ELO stronger (when properly measured) you, as a serious experimenter, should notice that this result can only be pure luck,

Why should it be luck?

Their method also learns inside the search tree, so even though there might only have been 100 games, there can be an order of magnitude more positions available to do learning on.

zamar · Post by **zamar** » Wed Jan 13, 2010 3:34 pm

Aaron Becker wrote:Obviously improving a real top engine would be a better result, but rewriting the evaluation of an engine like Stockfish to be expressed as a weighted feature vector would be a lot of work on its own,

You don't need to express whole evaluation function as a weighted vector, just select most meaningful properties, fine tune them and prove that improvement has been made. 50 carefully selected variables should clearly be enough for testing purpose. And it can be done in less than a day. I can't understand why they used 1800 features.

and improving values that have been painstakingly hand-tuned

Actually most of Stockfish's current evaluation function is automatically or at least semi-automatically tuned through self-play. But currently we have reached a point where no reasonable progress cannot anymore be done with current techniques we use. That's why I'm interested in these papers to get fresh ideas and learn possibly better techniques.

But I'd really hope that they would be doing some more serious tests than elo increase from random data to master level.

Of course I can try this stuff myself, but it can take weeks or months to implement this properly, so I'd just liked to be a bit more convinced that there is even some kind of likelyhood that this stuff actually works at higher level!

mcostalba · Post by **mcostalba** » Wed Jan 13, 2010 3:47 pm

Aaron Becker wrote: You're completely misreading their experimental design. They trained their engines for 1000 games each, taking a several snapshots throughout the process for each learning technique. They then took all of the engines produced from this process and played them in a tournament. Only this tournament, and not the training games, were used to compute elo. According to their paper, the tournament was around 16000 games. The error bars they came up with range from +-30 to +-60 elo. This is not the kind of precision that requires 10K games per engine. This is explained pretty clearly in 6.1.

Ok, I have counted 24 sample points, it means 12 games each round for 1333 rounds to get to 16000 matches, so each engine has played 1333 games to measure its ELO.

But the point is that after 100 games it _happened_ that the engine that implemented the author's idea gained 500 ELO, but it is (at least for me) absolutely not repeatable result: if you start again and measure ELO reached after 100 games, you could have any other number, including a regression.

So I think using just one learning trial is very misleading they should have done more then one learning trial and check the progression in each trial because I think can be very different for each trial at least until you don't reach thousands of games.

A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning

Re: A paper about parameter tuning