polyglot weights bug?

Discussion of chess software programming and technical issues.

Moderator: Ras

chesskobra
Posts: 348
Joined: Thu Jul 21, 2022 12:30 am
Full name: Chesskobra

polyglot weights bug?

Post by chesskobra »

I have noticed that the weights in a polyglot bin book (created with polyglot) depend on the order of games in the pgn collection used to build the book.

I took the file gm2016.pgn from the database at http://www.nk-qy.info/40h/. It contains 137405 games. The games in the file are ordered by ECO. I created a polyglot book with default parameters. I then ordered the pgn by date (using scid) and exported, and then created another popyglot book. I loaded the books in scid to see the weights. Here is what I found.

Ordered by date: 1.e4 42%, 1.d4 37%, 1.Nf3 12%, 1.c4 7%, ...

Ordered by ECO: 1.d4 78%, 1.Nf3 10%, 1.c4 6%, 1.e4 6%

I also did a similar experiment with jja and also with other databases, and found that the weights are very distorted if the pgn is ordered by ECO. It seems that the later ECO codes (1.d4, 1.NF3) dominate the weights if the pgn is ordered by ascending ECO codes. When ordered by date, the ECO codes get sufficieently randomised, and the weights are what I would expect.

Is this a bug?
Joost Buijs
Posts: 1632
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: polyglot weights bug?

Post by Joost Buijs »

Polyglot uses a pretty simple algorithm, if this is happening it must be a bug. Maybe Scid messed up the database somehow?
chesskobra
Posts: 348
Joined: Thu Jul 21, 2022 12:30 am
Full name: Chesskobra

Re: polyglot weights bug?

Post by chesskobra »

It is unlikely that scid is messing up. I now ordered the games by descending ECO (E99 to A00), and now the weights are

1.Nf3 28%, 1.e4 25%, 1.d4 22%, 1.c4 20%.

This again supports my suspicion that the games that appear later in the pgn file dominate the weights.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: polyglot weights bug?

Post by hgm »

Well, the weights are 16 bit, and they count half-points. So after reaching 32K they would overflow.

I am not sure how Polyglot handles this. It could be it just renormalizes the weights it obtained so far, by dividing those by 2. That means indeed that later games will get more weight.

I am not sure what else it could do. Ignoring the overflow would cause the relative weights to be basically random, as some weights would have been overflowing, and others not. There is no easy fix for this, as in principle any position could be visited so often that (some of) its weights will overflow, depending on the PGN collection you feed it.
chesskobra
Posts: 348
Joined: Thu Jul 21, 2022 12:30 am
Full name: Chesskobra

Re: polyglot weights bug?

Post by chesskobra »

Thank you for the explanation. So if I understand right, if the pgn is ordered so that openings sequences get randomised in the pgn, then despite the overflow, we get reasonable weights. Also, a few plies from the start position, probably most positions appear less than 32k times in the database, so the weights may be quite reliable. Perhaps weights may be assigned to only positions a few plies away, and then for positions within the initial few plies the weights may be constructed based on the weights downstream. Or the normalised weights may be calculated as we process the games. Not sure if any of this is possible.
Joost Buijs
Posts: 1632
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: polyglot weights bug?

Post by Joost Buijs »

hgm wrote: Tue Aug 01, 2023 9:56 pm Well, the weights are 16 bit, and they count half-points. So after reaching 32K they would overflow.

I am not sure how Polyglot handles this. It could be it just renormalizes the weights it obtained so far, by dividing those by 2. That means indeed that later games will get more weight.

I am not sure what else it could do. Ignoring the overflow would cause the relative weights to be basically random, as some weights would have been overflowing, and others not. There is no easy fix for this, as in principle any position could be visited so often that (some of) its weights will overflow, depending on the PGN collection you feed it.
It would be better to use 32 bit (or 64 bit) weights internally when generating the database and normalizing these when they are stored to disk.

I remember now that a very long time ago I had a discussion with Fabien about this, and that he made a version of Polyglot which uses 32 bit weights internally that could only be used to generate databases. Probably I still have the source somewhere in a backup.

Many years back I made my own Polyglot book generator which doesn't exhibit this behavior, so I completely forgot this problem exists.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: polyglot weights bug?

Post by hgm »

Well, the Polyglot algorithm for building the book is extremely primitive; it just counts points achieved with the move without paying attention to the number of plays. So a move that was played 1000 times, with a 10% score gets a weight of 200, and the chances it will be played are twice as large as a move that was played 100 times with a 50% score (which gets a weight of 100). In other words, it completely goes by the confidence that the players appeared to put in the moves, completely ignoring the actual performance.

A more sensible algorithm would calculate performance, and not just by statistics of the position itself, but by minimaxing that statistics through the tree. And then take some weighted average between performance and player preference, where the weight of the latter depends on player Elo.
Joost Buijs
Posts: 1632
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: polyglot weights bug?

Post by Joost Buijs »

I agree, there are a lot of things that could be improved. The book generator I made in the past uses the Elo rating of the players and adjusts the weights by means of Elo's percentage expectancy formula. Normally I use 5 to 10 million games to build my book, I've been thinking about minimaxing it, but I'm afraid that it would take too long.

Often the most frequently played move is also the best move, a book based on pure statistics is not so bad at all.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: polyglot weights bug?

Post by hgm »

You can get away with it if the PGN doesn't span a large time interval. Otherwise there is the danger that at some point in the interval a refutation was discovered for a line that used to be popular because of a good score, and then quickly gets out of fashion because of the refutation. Then the statistic remains good.