1 Million Training Positions for Texel Tuning

JoAnnP38 · Post by **JoAnnP38** » Thu Feb 23, 2023 2:35 pm

I know that training data sets are probably a dime a dozen, but if you are just starting out and want some labeled data to use for Texel-based tuning, feel free to grab my training data.

There are 1,000,000 unique positions taken from OTB grandmaster games where each player has an ELO of at least 2500. Theoretically, only quiet positions are in the dataset which I determined by comparing my evaluation function result with my quiescence search result. This is much too large to be used directly, but I plan to randomly sample from this file whenever I start a training or optimization session. Of course for advanced programs positions based on human games may not be ideal. Also, I skip the first 4 full moves. Here are a few lines from the comma-delimited file:

Code: Select all

Hash,Ply,GamePly,FEN,Score
20C2CC5E683EF587,8,106,rnbqkb1r/pp2pppp/3p1n2/8/3NP3/8/PPP2PPP/RNBQKB1R w KQkq - 1 5,0.0
DB55D4FCAADC775E,9,106,rnbqkb1r/pp2pppp/3p1n2/8/3NP3/2N5/PPP2PPP/R1BQKB1R b KQkq - 2 5,0.0
09A2250F4DFC8F82,10,106,rnbqkb1r/1p2pppp/p2p1n2/8/3NP3/2N5/PPP2PPP/R1BQKB1R w KQkq - 0 6,0.0
7AAA9EE3CCC6940C,11,106,rnbqkb1r/1p2pppp/p2p1n2/8/3NP3/2N1B3/PPP2PPP/R2QKB1R b KQkq - 1 6,0.0
0CDD6CD2D4E4045E,12,106,rnbqkb1r/1p3ppp/p2ppn2/8/3NP3/2N1B3/PPP2PPP/R2QKB1R w KQkq - 0 7,0.0
7F5DEF2BD5C14EFD,13,106,rnbqkb1r/1p3ppp/p2ppn2/8/3NP3/2N1B3/PPP1BPPP/R2QK2R b KQkq - 1 7,0.0
7942D00AB98A01FC,14,106,rnb1kb1r/1pq2ppp/p2ppn2/8/3NP3/2N1B3/PPP1BPPP/R2QK2R w KQkq - 2 8,0.0

I haven't started to use this data yet (but soon) so it is entirely possible that you will find some issues with it. Please let me know and I can quickly correct. BTW, the Score column is actually the result (i.e. 1.0 white wins, 0.5 draw and 0.0 black wins). I'm not 100% certain that is the best way to represent this but I should have a better idea in the coming days.

dangi12012 · Post by **dangi12012** » Thu Feb 23, 2023 11:55 pm

Ive been looking for something like this and used newer TCEC Stockfish and Komodo games. Many MB of quiet positions in there and engines are rated 3300Elo+

From a software engineering perspective you should train on engine games and really only the best engines for quiesce outcomes of games.
Also take symmetry into account to get the number of unique positions to really be unique.

So IMO you should think about training data + use engine games + remove results based on time loss. (winning position but lost on time should not train as a bad position)

https://github.com/TCEC-Chess/tcecgames

alessandro · Post by **alessandro** » Fri Feb 24, 2023 3:41 pm

JoAnnP38 wrote: ↑Thu Feb 23, 2023 2:35 pm I know that training data sets are probably a dime a dozen, but if you are just starting out and want some labeled data to use for Texel-based tuning, feel free to grab my training data.

There are 1,000,000 unique positions taken from OTB grandmaster games where each player has an ELO of at least 2500. Theoretically, only quiet positions are in the dataset which I determined by comparing my evaluation function result with my quiescence search result. This is much too large to be used directly, but I plan to randomly sample from this file whenever I start a training or optimization session. Of course for advanced programs positions based on human games may not be ideal. Also, I skip the first 4 full moves. Here are a few lines from the comma-delimited file:
Code: Select all
Hash,Ply,GamePly,FEN,Score
20C2CC5E683EF587,8,106,rnbqkb1r/pp2pppp/3p1n2/8/3NP3/8/PPP2PPP/RNBQKB1R w KQkq - 1 5,0.0
DB55D4FCAADC775E,9,106,rnbqkb1r/pp2pppp/3p1n2/8/3NP3/2N5/PPP2PPP/R1BQKB1R b KQkq - 2 5,0.0
09A2250F4DFC8F82,10,106,rnbqkb1r/1p2pppp/p2p1n2/8/3NP3/2N5/PPP2PPP/R1BQKB1R w KQkq - 0 6,0.0
7AAA9EE3CCC6940C,11,106,rnbqkb1r/1p2pppp/p2p1n2/8/3NP3/2N1B3/PPP2PPP/R2QKB1R b KQkq - 1 6,0.0
0CDD6CD2D4E4045E,12,106,rnbqkb1r/1p3ppp/p2ppn2/8/3NP3/2N1B3/PPP2PPP/R2QKB1R w KQkq - 0 7,0.0
7F5DEF2BD5C14EFD,13,106,rnbqkb1r/1p3ppp/p2ppn2/8/3NP3/2N1B3/PPP1BPPP/R2QK2R b KQkq - 1 7,0.0
7942D00AB98A01FC,14,106,rnb1kb1r/1pq2ppp/p2ppn2/8/3NP3/2N1B3/PPP1BPPP/R2QK2R w KQkq - 2 8,0.0
I haven't started to use this data yet (but soon) so it is entirely possible that you will find some issues with it. Please let me know and I can quickly correct. BTW, the Score column is actually the result (i.e. 1.0 white wins, 0.5 draw and 0.0 black wins). I'm not 100% certain that is the best way to represent this but I should have a better idea in the coming days.

Thanks for sharing your data with us. I suggest to filter anyway the positions and use only the positions where the score provided by the static evaluation and the score given with a quiescence search are the same, by using your own quiescence search. I've seen that my quiescence sill finds some differencies in some positions (that's the beauty, different implementations = different results)

algerbrex · Post by **algerbrex** » Fri Feb 24, 2023 6:57 pm

JoAnnP38 wrote: ↑Thu Feb 23, 2023 2:35 pm This is much too large to be used directly, ...

Just a note. If you haven't already started optimizing your tuner, you'll be happy to know there is definitely room for improvement

I've managed to get Blunder's tuner to comfortably handle 1M+ positions when tuning, and it finishes even several thousand iterations in a reasonable amount of time (10-20 minutes).

The biggest improvement is to use a sparse array for the feature vector of a position. If you're not sure what this all means, I'll be happy to explain more because I'm not entirely sure about all your implementation details and how idiosyncratic they may be.

Whiskers · Post by **Whiskers** » Fri Feb 24, 2023 10:46 pm

algerbrex wrote: ↑Fri Feb 24, 2023 6:57 pm
JoAnnP38 wrote: ↑Thu Feb 23, 2023 2:35 pm This is much too large to be used directly, ...
Just a note. If you haven't already started optimizing your tuner, you'll be happy to know there is definitely room for improvement I've managed to get Blunder's tuner to comfortably handle 1M+ positions when tuning, and it finishes even several thousand iterations in a reasonable amount of time (10-20 minutes).

The biggest improvement is to use a sparse array for the feature vector of a position. If you're not sure what this all means, I'll be happy to explain more because I'm not entirely sure about all your implementation details and how idiosyncratic they may be.

Forgive me if I'm getting this totally wrong, but does this basically mean
1. reformatting your dataset so that instead of FENs, it contains the weights that apply to the evaluation of that particular position
2. rewriting your evaluation function so that instead of using a board it just uses the weights to make an evaluation? something like (for i = 0; i < N; i++){evl += params{i} * weights{i} ; )}

algerbrex · Post by **algerbrex** » Sat Feb 25, 2023 6:39 am

Whiskers wrote: ↑Fri Feb 24, 2023 10:46 pm
algerbrex wrote: ↑Fri Feb 24, 2023 6:57 pm
JoAnnP38 wrote: ↑Thu Feb 23, 2023 2:35 pm This is much too large to be used directly, ...
Just a note. If you haven't already started optimizing your tuner, you'll be happy to know there is definitely room for improvement I've managed to get Blunder's tuner to comfortably handle 1M+ positions when tuning, and it finishes even several thousand iterations in a reasonable amount of time (10-20 minutes).

The biggest improvement is to use a sparse array for the feature vector of a position. If you're not sure what this all means, I'll be happy to explain more because I'm not entirely sure about all your implementation details and how idiosyncratic they may be.
Forgive me if I'm getting this totally wrong, but does this basically mean
1. reformatting your dataset so that instead of FENs, it contains the weights that apply to the evaluation of that particular position
2. rewriting your evaluation function so that instead of using a board it just uses the weights to make an evaluation? something like (for i = 0; i < N; i++){evl += params{i} * weights{i} ; )}

Actually, you don't need to reformat your data set! Your dataset can stay exactly the same.

As you wrote for your second point, that approach is commonly used to model the evaluation just by using two vectors. The set of evaluation weights are treated as one long vector, and then the challenge is to find a feature vector for each position such that the dot-product (pairwise element multiplication, and then adding each product together) between the weight vector and this feature vector give you the correct evaluation score. You don't have to do model the evaluation this way, but it makes computing the gradients fairly straightforward.

The feature isn't terribly hard to figure out how to compute, but it defintely takes some thought how to compute it. The way I've figured out how to compute mine for this current version of Blunder I'm writing is to basically sit down and break down the main expression for my evaluation (mgPhase * mgScore + egPhase * egScore), and then asking myself how I can rework this expression to be the dot product between a weight and feature vector. Let me know if you want to elaborate on this more.

Once you've figured out how to compute this feature vector, you'll likely realize that the majority of the entries in the vector are 0's. This wastes time when computing the dot product since that specfic index with contribute nothing to the final product. So what you can do instead is pre-process this features vector: Go through and pick out every index with a non-zero entry, and use a tuple like data structure to add the non-zero entry along with the index it occured at, to a new array.

To compute the dot product you can then just iterate over this feature array, which will be much smaller, and peform the dot product by multiplying the value of each feature with the weight at the index you saved that was paired with the feature's value. In my code this looks something like this:

Code: Select all

func computeDotProduct(weights []float64, features []Feature) (sum float64) {
	for _, feature := range features {
		sum += weights[feature.Index] * feature.Value
	}
	return sum
}

Where Feature looks like:

Code: Select all

type Feature struct {
	Value float64
	Index uint16
}

Whiskers · Post by **Whiskers** » Sat Feb 25, 2023 7:53 am

algerbrex wrote: ↑Sat Feb 25, 2023 6:39 am
Whiskers wrote: ↑Fri Feb 24, 2023 10:46 pm
algerbrex wrote: ↑Fri Feb 24, 2023 6:57 pm
JoAnnP38 wrote: ↑Thu Feb 23, 2023 2:35 pm This is much too large to be used directly, ...
Just a note. If you haven't already started optimizing your tuner, you'll be happy to know there is definitely room for improvement I've managed to get Blunder's tuner to comfortably handle 1M+ positions when tuning, and it finishes even several thousand iterations in a reasonable amount of time (10-20 minutes).

The biggest improvement is to use a sparse array for the feature vector of a position. If you're not sure what this all means, I'll be happy to explain more because I'm not entirely sure about all your implementation details and how idiosyncratic they may be.
Forgive me if I'm getting this totally wrong, but does this basically mean
1. reformatting your dataset so that instead of FENs, it contains the weights that apply to the evaluation of that particular position
2. rewriting your evaluation function so that instead of using a board it just uses the weights to make an evaluation? something like (for i = 0; i < N; i++){evl += params{i} * weights{i} ; )}

Actually, you don't need to reformat your data set! Your dataset can stay exactly the same.

As you wrote for your second point, that approach is commonly used to model the evaluation just by using two vectors. The set of evaluation weights are treated as one long vector, and then the challenge is to find a feature vector for each position such that the dot-product (pairwise element multiplication, and then adding each product together) between the weight vector and this feature vector give you the correct evaluation score. You don't have to do model the evaluation this way, but it makes computing the gradients fairly straightforward.

The feature isn't terribly hard to figure out how to compute, but it defintely takes some thought how to compute it. The way I've figured out how to compute mine for this current version of Blunder I'm writing is to basically sit down and break down the main expression for my evaluation (mgPhase * mgScore + egPhase * egScore), and then asking myself how I can rework this expression to be the dot product between a weight and feature vector. Let me know if you want to elaborate on this more.

Once you've figured out how to compute this feature vector, you'll likely realize that the majority of the entries in the vector are 0's. This wastes time when computing the dot product since that specfic index with contribute nothing to the final product. So what you can do instead is pre-process this features vector: Go through and pick out every index with a non-zero entry, and use a tuple like data structure to add the non-zero entry along with the index it occured at, to a new array.

To compute the dot product you can then just iterate over this feature array, which will be much smaller, and peform the dot product by multiplying the value of each feature with the weight at the index you saved that was paired with the feature's value. In my code this looks something like this:
Code: Select all
func computeDotProduct(weights []float64, features []Feature) (sum float64) {
	for _, feature := range features {
		sum += weights[feature.Index] * feature.Value
	}
	return sum
}
Where Feature looks like:
Code: Select all
type Feature struct {
	Value float64
	Index uint16
}

How is that so much faster than a normal evaluation if you still have to pick out all the weights from the current board state and don't know them beforehand? That's what my "reformatting the dataset" point was meant to address - so that you can just read "w[0] = 1, w[3] = -2, w[45] = 1, w[37] = -1" off of the file rather than looking through the board for passed pawns.

algerbrex · Post by **algerbrex** » Sat Feb 25, 2023 8:17 am

Whiskers wrote: ↑Sat Feb 25, 2023 7:53 am How is that so much faster than a normal evaluation if you still have to pick out all the weights from the current board state and don't know them beforehand? That's what my "reformatting the dataset" point was meant to address - so that you can just read "w[0] = 1, w[3] = -2, w[45] = 1, w[37] = -1" off of the file rather than looking through the board for passed pawns.

I'm not quite sure I follow what you mean.

The process I'm describing includes reading over a dataset file with quiet FENs. These FENs are read and converted into an array of objects that store the features of the FEN and the outcome of the fen. This array of objects is then iterated over to compute the gradient of the cost function.
The computation of each feature vector for a position can be a bit invovled, but creating this array of objects is all done before any tuning starts, so theirs no overhead, as if the features of each position is computed from scratch each time we compute the gradient. And notice we never modify the dataset file.

The code I showed is faster because normally you just compute the dot product between the feature vector for a particular position and the weight vector. But since the feature vector for positions is usually mostly zeros, time can be saved when computing the dot product by using the approach I showed in my last post.

Or am I still missing your question?

Whiskers · Post by **Whiskers** » Sat Feb 25, 2023 4:36 pm

algerbrex wrote: ↑Sat Feb 25, 2023 8:17 am
Whiskers wrote: ↑Sat Feb 25, 2023 7:53 am How is that so much faster than a normal evaluation if you still have to pick out all the weights from the current board state and don't know them beforehand? That's what my "reformatting the dataset" point was meant to address - so that you can just read "w[0] = 1, w[3] = -2, w[45] = 1, w[37] = -1" off of the file rather than looking through the board for passed pawns.
I'm not quite sure I follow what you mean.

The process I'm describing includes reading over a dataset file with quiet FENs. These FENs are read and converted into an array of objects that store the features of the FEN and the outcome of the fen. This array of objects is then iterated over to compute the gradient of the cost function.
The computation of each feature vector for a position can be a bit invovled, but creating this array of objects is all done before any tuning starts, so theirs no overhead, as if the features of each position is computed from scratch each time we compute the gradient. And notice we never modify the dataset file.

The code I showed is faster because normally you just compute the dot product between the feature vector for a particular position and the weight vector. But since the feature vector for positions is usually mostly zeros, time can be saved when computing the dot product by using the approach I showed in my last post.

Or am I still missing your question?

I’m sorry if we’re talking past each other lol. It’s the “convert fens to an array of objects” part that seems like it would take the longest. After all, don’t you still need to perform calculations for say mobility?

algerbrex · Post by **algerbrex** » Sun Feb 26, 2023 7:26 pm

Whiskers wrote: ↑Sat Feb 25, 2023 4:36 pm I’m sorry if we’re talking past each other lol. It’s the “convert fens to an array of objects” part that seems like it would take the longest. After all, don’t you still need to perform calculations for say mobility?

Ah, ok.

So, when I talk about reading in the FENs and creating an array of objects, I mean we read in each fen, and extract the "features" from that position, as well as record the outcome, and store this in an a position object. This is what that looks like in my code:

Code: Select all

type TuningPosition struct {
	Features []Feature
	Outcome  float64
}

That feature vector (read: list) is what we use to compute the evaluation of each position, by taking the dot product between it and the weight vector.

Now, you're correct, it takes time to extract the features from each position. But (1) the important thing is, is that this done only once, at the beginning of the program, and from that point onward you have a very compact representation of all the FENs which allows you to quickly compute the evaluation. And (2) this extraction of features actually doesn't take that long. When I start up my tuner it takes about 20s to convert the FEN dataset into an array of TuningPosition objects. And after that we never have to worry about feature extraction again.

1 Million Training Positions for Texel Tuning

1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning

Re: 1 Million Training Positions for Texel Tuning