I have noticed that the weights in a polyglot bin book (created with polyglot) depend on the order of games in the pgn collection used to build the book.
I took the file gm2016.pgn from the database at http://www.nk-qy.info/40h/. It contains 137405 games. The games in the file are ordered by ECO. I created a polyglot book with default parameters. I then ordered the pgn by date (using scid) and exported, and then created another popyglot book. I loaded the books in scid to see the weights. Here is what I found.
Ordered by date: 1.e4 42%, 1.d4 37%, 1.Nf3 12%, 1.c4 7%, ...
Ordered by ECO: 1.d4 78%, 1.Nf3 10%, 1.c4 6%, 1.e4 6%
I also did a similar experiment with jja and also with other databases, and found that the weights are very distorted if the pgn is ordered by ECO. It seems that the later ECO codes (1.d4, 1.NF3) dominate the weights if the pgn is ordered by ascending ECO codes. When ordered by date, the ECO codes get sufficieently randomised, and the weights are what I would expect.
Is this a bug?
polyglot weights bug?
Moderator: Ras
-
- Posts: 348
- Joined: Thu Jul 21, 2022 12:30 am
- Full name: Chesskobra
-
- Posts: 1632
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: polyglot weights bug?
Polyglot uses a pretty simple algorithm, if this is happening it must be a bug. Maybe Scid messed up the database somehow?
-
- Posts: 348
- Joined: Thu Jul 21, 2022 12:30 am
- Full name: Chesskobra
Re: polyglot weights bug?
It is unlikely that scid is messing up. I now ordered the games by descending ECO (E99 to A00), and now the weights are
1.Nf3 28%, 1.e4 25%, 1.d4 22%, 1.c4 20%.
This again supports my suspicion that the games that appear later in the pgn file dominate the weights.
1.Nf3 28%, 1.e4 25%, 1.d4 22%, 1.c4 20%.
This again supports my suspicion that the games that appear later in the pgn file dominate the weights.
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: polyglot weights bug?
Well, the weights are 16 bit, and they count half-points. So after reaching 32K they would overflow.
I am not sure how Polyglot handles this. It could be it just renormalizes the weights it obtained so far, by dividing those by 2. That means indeed that later games will get more weight.
I am not sure what else it could do. Ignoring the overflow would cause the relative weights to be basically random, as some weights would have been overflowing, and others not. There is no easy fix for this, as in principle any position could be visited so often that (some of) its weights will overflow, depending on the PGN collection you feed it.
I am not sure how Polyglot handles this. It could be it just renormalizes the weights it obtained so far, by dividing those by 2. That means indeed that later games will get more weight.
I am not sure what else it could do. Ignoring the overflow would cause the relative weights to be basically random, as some weights would have been overflowing, and others not. There is no easy fix for this, as in principle any position could be visited so often that (some of) its weights will overflow, depending on the PGN collection you feed it.
-
- Posts: 348
- Joined: Thu Jul 21, 2022 12:30 am
- Full name: Chesskobra
Re: polyglot weights bug?
Thank you for the explanation. So if I understand right, if the pgn is ordered so that openings sequences get randomised in the pgn, then despite the overflow, we get reasonable weights. Also, a few plies from the start position, probably most positions appear less than 32k times in the database, so the weights may be quite reliable. Perhaps weights may be assigned to only positions a few plies away, and then for positions within the initial few plies the weights may be constructed based on the weights downstream. Or the normalised weights may be calculated as we process the games. Not sure if any of this is possible.
-
- Posts: 1632
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: polyglot weights bug?
It would be better to use 32 bit (or 64 bit) weights internally when generating the database and normalizing these when they are stored to disk.hgm wrote: ↑Tue Aug 01, 2023 9:56 pm Well, the weights are 16 bit, and they count half-points. So after reaching 32K they would overflow.
I am not sure how Polyglot handles this. It could be it just renormalizes the weights it obtained so far, by dividing those by 2. That means indeed that later games will get more weight.
I am not sure what else it could do. Ignoring the overflow would cause the relative weights to be basically random, as some weights would have been overflowing, and others not. There is no easy fix for this, as in principle any position could be visited so often that (some of) its weights will overflow, depending on the PGN collection you feed it.
I remember now that a very long time ago I had a discussion with Fabien about this, and that he made a version of Polyglot which uses 32 bit weights internally that could only be used to generate databases. Probably I still have the source somewhere in a backup.
Many years back I made my own Polyglot book generator which doesn't exhibit this behavior, so I completely forgot this problem exists.
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: polyglot weights bug?
Well, the Polyglot algorithm for building the book is extremely primitive; it just counts points achieved with the move without paying attention to the number of plays. So a move that was played 1000 times, with a 10% score gets a weight of 200, and the chances it will be played are twice as large as a move that was played 100 times with a 50% score (which gets a weight of 100). In other words, it completely goes by the confidence that the players appeared to put in the moves, completely ignoring the actual performance.
A more sensible algorithm would calculate performance, and not just by statistics of the position itself, but by minimaxing that statistics through the tree. And then take some weighted average between performance and player preference, where the weight of the latter depends on player Elo.
A more sensible algorithm would calculate performance, and not just by statistics of the position itself, but by minimaxing that statistics through the tree. And then take some weighted average between performance and player preference, where the weight of the latter depends on player Elo.
-
- Posts: 1632
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: polyglot weights bug?
I agree, there are a lot of things that could be improved. The book generator I made in the past uses the Elo rating of the players and adjusts the weights by means of Elo's percentage expectancy formula. Normally I use 5 to 10 million games to build my book, I've been thinking about minimaxing it, but I'm afraid that it would take too long.
Often the most frequently played move is also the best move, a book based on pure statistics is not so bad at all.
Often the most frequently played move is also the best move, a book based on pure statistics is not so bad at all.
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: polyglot weights bug?
You can get away with it if the PGN doesn't span a large time interval. Otherwise there is the danger that at some point in the interval a refutation was discovered for a line that used to be popular because of a good score, and then quickly gets out of fashion because of the refutation. Then the statistic remains good.