Ok, sorry for spreading wrong information.Ferdy wrote: ↑Tue Jan 19, 2021 8:56 amThe material.epd is not a subset of ccrl_3200_texel.epd.Desperado wrote: ↑Sun Jan 17, 2021 9:12 pm Now the really interesting part...
THE VALIDATION SETUP
Algorithm: cpw-algorithm
Stepsize: 5
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
THE CHALLENGE SETUPCode: Select all
MG: 65 350 350 410 970 EG: 115 330 370 615 1080 best: 0.136443 epoch: 31
Algorithm: cpw-algorithm
Stepsize: 8,4,2,1
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
THE TRIAL SETUPCode: Select all
MG: 64 348 344 404 952 EG: 124 384 436 696 1260 best: 0.135957 epoch: 56 (Ferdy reports mse: 0.13605793198602772)
Algorithm: cpw-algorithm
Stepsize: 8,7,6,5,4,3,2,1
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
The better the mse the more the material phase values diverge. Ferdy choosed the data (it is a subset of ccrl_3200_texel.epd)Code: Select all
MG: 62 333 333 389 899 EG: 134 416 476 752 1400 best: 0.135663 epoch: 107 (even better)
Tapered Evaluation and MSE (Texel Tuning)
Moderators: hgm, Rebel, chrisw
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Tapered Evaluation and MSE (Texel Tuning)
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: Tapered Evaluation and MSE (Texel Tuning)
Best BS post ever, from someone with a broad practical experience in applying Texel tuninghgm wrote: ↑Mon Jan 18, 2021 10:39 pmBasically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.
I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.
And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.
Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.
There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.
The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...
Sven Schüle (engine author: Jumbo, KnockOut, Surprise)
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Tapered Evaluation and MSE (Texel Tuning)
You can compare 2 mse from different data, but that is wrong. The results of the tuner do not prove that what you are doing is right.Ferdy wrote: ↑Tue Jan 19, 2021 7:15 amI can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.
The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results
On the contrary, it certainly introduces an error into your algorithm.
HGM gave more information on the effect.
That means you risk a regression if you do not assure that the mse of the total set is better, for sure!
You must at least adjust the basis for comparing two vectors and you must have the mse for the current data set for two vectors you are comparing. Only then they are comparable at all. (This requirement is mandatory). In this case and without considering the whole dataset, you could at least determine if a new vector in this new dataset is better. (However, a regression is still not excluded).
The properties of the position are irrelvant in that context.
Don't get me wrong, I like the idea to explore a bigger space of positions while keeping the sample size constant. Really nice idea!
But it requires an update of mse to the corresponding data. Then you can compare to it.
But, what HGM did not point out was that there are algorithms that work on "your" idea.
You need to define a threshold function, that controls the potential regression. The basic idea is, you would be able to leave locals.
Your second answer does not address my question at all. Sorry.
Either you present the mse of the vectors i provided or you show with a smaller mse that the material values of a piece
stay close together. E.g. pawn 40,60 instead of 10,110 for example.
-
- Posts: 27819
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Tapered Evaluation and MSE (Texel Tuning)
Except that you just told us that you did not correctly apply the method at all. Correctly applying it would be to minimize the MSE. Instead you sabotaged the optimizer to prevent it from finding the minimum MSE. As you now repeat, correctly applying the method would make the engine weaker.
-
- Posts: 4833
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: Tapered Evaluation and MSE (Texel Tuning)
I know what I am doing.Desperado wrote: ↑Tue Jan 19, 2021 9:38 amYou can compare 2 mse from different data, but that is wrong. The results of the tuner do not prove that what you are doing is right.Ferdy wrote: ↑Tue Jan 19, 2021 7:15 amI can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.
The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results
On the contrary, it certainly introduces an error into your algorithm.
HGM gave more information on the effect.
That means you risk a regression if you do not assure that the mse of the total set is better, for sure!
You must at least adjust the basis for comparing two vectors and you must have the mse for the current data set for two vectors you are comparing. Only then they are comparable at all. (This requirement is mandatory). In this case and without considering the whole dataset, you could at least determine if a new vector in this new dataset is better. (However, a regression is still not excluded).
The properties of the position are irrelvant in that context.
Don't get me wrong, I like the idea to explore a bigger space of positions while keeping the sample size constant. Really nice idea!
But it requires an update of mse to the corresponding data. Then you can compare to it.
But, what HGM did not point out was that there are algorithms that work on "your" idea.
You need to define a threshold function, that controls the potential regression. The basic idea is, you would be able to leave locals.
Your second answer does not address my question at all. Sorry.
Either you present the mse of the vectors i provided or you show with a smaller mse that the material values of a piece
stay close together. E.g. pawn 40,60 instead of 10,110 for example.
I hope you fixed the bugs in your tuner
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Tapered Evaluation and MSE (Texel Tuning)
Hi Sven,Sven wrote: ↑Tue Jan 19, 2021 9:26 amBest BS post ever, from someone with a broad practical experience in applying Texel tuninghgm wrote: ↑Mon Jan 18, 2021 10:39 pmBasically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.
I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.
And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.
Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.
There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.
The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...
i agree in one point at least with HGM. It is not consequent to ignore results that you do not like. That makes the complete process random.
Of course you do not need to put BS values for material into your engine either because they result in a lower mse. Another useful handling would be
to find out the reason why the tuner reports an obvious BS vector as best. You can try to improve your data for example, that is useful too.
On the other side, i agree with you that the Texel Tuning Method can help a lot. The difficulty is to prepare the prerequisites.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Tapered Evaluation and MSE (Texel Tuning)
Ferdy wrote: ↑Tue Jan 19, 2021 10:03 amI know what I am doing.Desperado wrote: ↑Tue Jan 19, 2021 9:38 amYou can compare 2 mse from different data, but that is wrong. The results of the tuner do not prove that what you are doing is right.Ferdy wrote: ↑Tue Jan 19, 2021 7:15 amI can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.
The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results
On the contrary, it certainly introduces an error into your algorithm.
HGM gave more information on the effect.
That means you risk a regression if you do not assure that the mse of the total set is better, for sure!
You must at least adjust the basis for comparing two vectors and you must have the mse for the current data set for two vectors you are comparing. Only then they are comparable at all. (This requirement is mandatory). In this case and without considering the whole dataset, you could at least determine if a new vector in this new dataset is better. (However, a regression is still not excluded).
The properties of the position are irrelvant in that context.
Don't get me wrong, I like the idea to explore a bigger space of positions while keeping the sample size constant. Really nice idea!
But it requires an update of mse to the corresponding data. Then you can compare to it.
But, what HGM did not point out was that there are algorithms that work on "your" idea.
You need to define a threshold function, that controls the potential regression. The basic idea is, you would be able to leave locals.
Your second answer does not address my question at all. Sorry.
Either you present the mse of the vectors i provided or you show with a smaller mse that the material values of a piece
stay close together. E.g. pawn 40,60 instead of 10,110 for example.
I hope you fixed the bugs in your tuner
That is not a serious or professional answer . A lot of people know what they are doing here.I know what I am doing.
It does not mean they don't make mistakes! If you don't trust me, than trust other experts. Ask them...
Keep an open mind and do not be offended
Well the point is, there are no bugs in my tuner. The longer the thread was ongoing it got clear it is the data.
I showed that the tuner produces on the given data lower mse than other people, for example you did report.
The only way to bring evidence is to report an mse that is lower than what i reported and has "normal" values.
Since I provided these facts you do not answer factually or ignore the questions. Please report facts and not statements like "i know what i am doing". Thanks a lot (especially for the effort and the numbers you already reported)
Last edited by Desperado on Tue Jan 19, 2021 10:30 am, edited 1 time in total.
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: Tapered Evaluation and MSE (Texel Tuning)
Parameter tuning without measuring playing strength is nonsense. And I have not seen any proof that "MSE(paramVector1) < MSE(paramVector2)" always implies "Elo(paramVector1) >= Elo(paramVector2)". Whether a parameter vector that results in a smaller MSE for a given data set really improves playing strength depends heavily on many properties of the engine. I would accept a statement like "then it is your engine that sucks" since Jumbo is still slightly below 2600 Elo @CCRL ... but it was even far below that level before tuning it.hgm wrote: ↑Tue Jan 19, 2021 10:01 amExcept that you just told us that you did not correctly apply the method at all. Correctly applying it would be to minimize the MSE. Instead you sabotaged the optimizer to prevent it from finding the minimum MSE. As you now repeat, correctly applying the method would make the engine weaker.
Also I did not "sabotage the optimizer", that is another bullshit statement. I used the same algorithm and the same step size as the original author of Texel tuning.
Last edited by Sven on Tue Jan 19, 2021 10:28 am, edited 1 time in total.
Sven Schüle (engine author: Jumbo, KnockOut, Surprise)
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: Tapered Evaluation and MSE (Texel Tuning)
My point was not that I did not like results, it was that I discarded tuning results due to a significant difference in playing strength.Desperado wrote: ↑Tue Jan 19, 2021 10:16 amHi Sven,Sven wrote: ↑Tue Jan 19, 2021 9:26 amBest BS post ever, from someone with a broad practical experience in applying Texel tuninghgm wrote: ↑Mon Jan 18, 2021 10:39 pmBasically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.
I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.
And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.
Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.
There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.
The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...
i agree in one point at least with HGM. It is not consequent to ignore results that you do not like. That makes the complete process random.
Of course you do not need to put BS values for material into your engine either because they result in a lower mse. Another useful handling would be
to find out the reason why the tuner reports an obvious BS vector as best. You can try to improve your data for example, that is useful too.
On the other side, i agree with you that the Texel Tuning Method can help a lot. The difficulty is to prepare the prerequisites.
Sven Schüle (engine author: Jumbo, KnockOut, Surprise)
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Tapered Evaluation and MSE (Texel Tuning)
I agree, that is another subject. The translation into elo is important and is measured completely different.Sven wrote: ↑Tue Jan 19, 2021 10:27 amMy point was not that I did not like results, it was that I discarded tuning results due to a significant difference in playing strength.Desperado wrote: ↑Tue Jan 19, 2021 10:16 amHi Sven,Sven wrote: ↑Tue Jan 19, 2021 9:26 amBest BS post ever, from someone with a broad practical experience in applying Texel tuninghgm wrote: ↑Mon Jan 18, 2021 10:39 pmBasically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.
I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.
And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.
Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.
There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.
The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...
i agree in one point at least with HGM. It is not consequent to ignore results that you do not like. That makes the complete process random.
Of course you do not need to put BS values for material into your engine either because they result in a lower mse. Another useful handling would be
to find out the reason why the tuner reports an obvious BS vector as best. You can try to improve your data for example, that is useful too.
On the other side, i agree with you that the Texel Tuning Method can help a lot. The difficulty is to prepare the prerequisites.
Well, it is not the fault of the tuner or the tuning algorithm itself if a smaller mse does not translate in better gameplay. At the same time i would always ask, why to hell do i get a lower mse that does not do that? Your goal should be to maximize the ratio when it helps and to close lacks as often as possible. The efficiency will rise then. imho.
@HG if there is something wrong in the complete scenario, then it would certainly be the human made assumption that the lowest mse leads to always better gameplay. The tuner or the tuning algorithm only provides this information. It does not tell anybody that the result will perform better in gameplay. You simply do not ask the question to him, you only ask how the best vector would look like for a given data set. That is basically a different question.
Last edited by Desperado on Tue Jan 19, 2021 11:03 am, edited 1 time in total.