brianr wrote:It depends on what you want as tuning approaches vary (score, bm, result). With the CCRL games you already have a result.
Of course, but using the result from the database has several problems.
* You don't see the same kind of positions that you will see during a search, because some moves in the search have terrible quality. Giraffe's random-move trick is a way to address this. I am addressing it much more directly, by extracting my positions from actual calls to the evaluation function.
* If you are trying to learn how good or bad a particular feature is (say, having a white queen on b7 early in the game), it is possible that engines only do that when it's a good idea, for reasons that your evaluation function may not understand. But the training would learn that it is a good idea, period.
* Using all the positions from each game means you have many positions with similar inputs and identical results, so in effect you have a lot less data than you think you have.
* You don't know if an early position was won by black because it was a good position for black or if the engine playing white was weak. You can try to use the Elo rating difference to adjust for this, but what I am producing is a more pure data set to concentrate on the virtues of the position.
The
CPW page on Texel's Tuning Method expresses similar concerns.
I just modified Stockfish to provide scores for FENs and have 100+MM. You could also use another super strong oracle engine like Komodo (less straightforward but doable with 'go depth <shallow>'
Yes, this is a reasonable alternative, but I am afraid this could result in the network repeating the vices of the oracle. Imagine Stockfish doesn't like non-isolated doubled pawns in the opening (it's hard to find a realistic example, so just bear with me), but it turns out they don't actually hurt your winning chances. I think using results of games will be less affected by the vice in the engine I am learning from.
Giraffe's approach of making random moves is also helpful as tuning seems to require poor positions as well as "good" ones.
Agreed. But is one random move enough, or should it be several? I think sampling from the positions that the evaluation function actually sees eliminates this issue.
In any case on the order of 1MM positions is far to few, IMO
This is possibly true, and getting a sense for it is part of why I posted this question. I know AlphaGo used 30M positions to train their evaluation function, but I intend to have much simpler networks than what they used, so perhaps I don't need as much data.
We'll see. I might have to run this for much longer, or perhaps I have to switch to some cheaper alternative, if I find out I am starved for more positions.
Finally, testing methodology and process must be precise. Adding tuning to testing for actual improvements requires meticulous attention to details. I have been spinning wheels with this combo for years with occasional 100 elo jumps (very few and far between)
Agreed, but I haven't gotten that far yet. I have several ideas for how to create evaluation functions once I have this data, and in deciding among them I will definitely need to be disciplined about testing.
EDIT: Oh, and thanks for your comments! Much appreciated.