marcelk wrote:Don wrote:marcelk wrote:
So far I have done evaluation parameter tuning and opening book learning and I consider those solved. My immediate next step will be search tuning. After that evaluation pattern synthesis and I think somewhere around that point it would probably stop. (I can remotely imagine a program to learn by itself about passers and king safety, but I cannot see an algorithm coming up with concepts such as null-move search..)
So how do you do parameter tuning? This dicussion should be in programming and technical discussions section - but I am interested in hearing your ideas.
[ I have asked the moderators to move the thread ]
Without going into the specifics of the realization (which I prefer not to disclose before the program development has leveled out), it is:
1. a least square fit,
2. established through hill-climbing,
3. on an oracle function that is comparing the result of 2-ply searches with the original game outcome,
4. in positions that were randomly selected from GM games (no filtering, not even for duplicates),
5. with the tuning holistically applied to the set of ~300 parameters in my program,
6. and resisting the urge to correct 'wrong' parameters manually.
My results so far show the program is extremely eager to sacrifice 1 or 2 pawns in the opening (which I need to correct with book) and has an aggressive, Tal-like, attacking style that the super-strong programs have no problem defending against, but the weaker ones can't handle. (But since Rookie v3 gets out-searched by 6 ply by those strong programs I think that is solvable by focussing on search next)
I'm speculating that this style is because the tuning now favors positions where humans tend to lose their games. Therefore I'm currently experimenting with replacing the set of GM positions with a hybrid set from GMs (45%), CCRL games (45%) and self server games (10%).
It's difficult to determine success with automatic tuning because it depends on what you are comparing it to. If you can get more out if that if you had hand tune the weights yourself, then it's a limited form of success, but it may still be far inferior to weights someone else might be able to produce manually.
On another subject: In learning there is the concept of the "training signal", for example what do you compare against to know something is good or bad? In your description the training signal was the game outcome. With Temporal Different Learning the training signal is the evaluation of future positions and your distance from them. With population based algorithms (such as PBIL and genetic algorithms) the training signal is the "fitness function."
You want something that produces the strongest signal with least effort. I'm not sure game outcome is an ideal signal because it seems very weak to me. In other words a LOT of effort to get back 1 or 2 bits of information. However it does have some nice characteristics too, you cannot argue with results.
A very strong signal is produced by temporal concepts. You use the score of a future position to indicate the goodness of a current position and basically operate upon the delta. An example to clarify: if your evaluation at move 5 reports a negative score but at move 15 you find that you are winning, then there is some reason to believe your score at move 5 was in error. It's possible that the opponent blundered of course, but in self-play it's a reasonable assumption that even if the score was negative there may have been more resources in the position than you were aware of. So in this kind of learning the idea is to assign some credit to each move leading up to the current state and gradually make weight modifications that bring the reality in line with the programs thinking.
The beauty of this is that the final result essentially becomes a training signal too, but so does every intermediate position.
Years ago we used TDL on a version of Cilkchess and had some success. Like your experience we found that the program was aggressive - it was making all sorts of interesting sacrifices. I cannot say that they were all sound, but we were testing at limited depth, I think 6 ply. At any depth, a move is a calculated gamble and being wrong about a sacrifice, whether winning or losing, can cost you. So if there is to be error then why not error on the side of being too aggressive once in a while?
We didn't keep that version even though it tested quite well for technical reasons.