Utility to delete PGN games with same final position?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Aser Huerga
Posts: 812
Joined: Tue Jun 16, 2009 10:09 am
Location: Spain

Utility to delete PGN games with same final position?

Post by Aser Huerga »

Hello,

I'm preparing a collection of games truncated to 3 moves for testing purposes. I used Norm Pollock PGN collections (thanks Norm).
I plan to check positions balance, but by now, I have a PGN with duplicated final positions (sometimes same games, sometimes different move order), so need a utility to delete games with same final position.

EPD collection is not valid because first 3 moves would be lost, and I'd like complete games.

I try pgnscanner, pgn-extract, CDB ... but i don't get satisfactory results.
If there is no utility to delete PGN games with the same final position, maybe someone could easily make a script or something to do it?

Any hint will be highly appreciated!
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Utility to delete PGN games with same final position?

Post by wgarvin »

I don't know what PGN utilities are already available, but if you have a library that can read PGN, you could write the utility yourself:

(1) It needs to read games from the PGN file one by one (keep the block of text for the game around in a buffer, in case you need to write it to the output file)
(2) Use the board representation from your own chess engine (or someone else's). Set up the initial position, and make the first 3 moves in the PGN. (3) After 3 moves, store the Zobrist key of the position in a big hash table. If it was already in the hash table, its a duplicate so you can move on to the next game. If it was not already in the hash table, write out a copy of the game into the output PGN file.

Parsing PGN properly is the hardest part, but someone probably has a library that can do that. The best case would be if you have source for an engine that can read PGN games and continue them. You could use its board, its Zobrist hash and its PGN reading stuff.
rjgibert
Posts: 317
Joined: Mon Jun 26, 2006 9:44 am

Re: Utility to delete PGN games with same final position?

Post by rjgibert »

If in one game, the game ends with the same KQK mating position as prior game, that game you want to filter out. If another game ending with a unique KQK mate occurs, that game you want to keep. I'm curious as to how such a distinction can be useful?
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Utility to delete PGN games with same final position?

Post by michiguel »

rjgibert wrote:If in one game, the game ends with the same KQK mating position as prior game, that game you want to filter out. If another game ending with a unique KQK mate occurs, that game you want to keep. I'm curious as to how such a distinction can be useful?
He wants to have starting positions for games. In fact, the first three moves.
Finding game duplications do not work well enough. You may end up with different transpositions to the same starting position.

I need this tool too so I may eventually write it myself, if no one finds a solutions.

Miguel
rjgibert
Posts: 317
Joined: Mon Jun 26, 2006 9:44 am

Re: Utility to delete PGN games with same final position?

Post by rjgibert »

I would aim for a unique move 40 position or last position if game is shorter. This would save going to the end of the game and be more reliable. There are opening variations that go past move 30, so a smaller number might not be enough. Using last position of game will filter out a lot of games you want to keep.

EDIT: This isn't enough either. Sometimes a game will have a position repetition inserted. The problem will require a bit more thought.
Last edited by rjgibert on Thu Jun 10, 2010 7:53 am, edited 1 time in total.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Utility to delete PGN games with same final position?

Post by michiguel »

rjgibert wrote:I would aim for a unique move 40 position or last position if game is shorter. This would save going to the end of the game and be more reliable. There are opening variations that go past move 30, so a smaller number might not be enough. Using last position of game will filter out a lot of games you want to keep.
Sorry, I was not clear. The algorithm is
1) Get a pgn collection
2) truncate all games to <n> plies with pgnextract.
3) discard games that end up in the same position (retain only one). In this case, it is the position after <n> plies, because after that, the moves have been deleted.

That is what I meant and I am almost sure that is what Aser needs.

Miguel
PS: In my case, for 2) I am trying to build an opening book for my engine, get two instances , make them play to each other, and make one of them to resign after <n> plies. The difference is that the games will be chosen if the moves are statistically sound.
rjgibert
Posts: 317
Joined: Mon Jun 26, 2006 9:44 am

Re: Utility to delete PGN games with same final position?

Post by rjgibert »

See my EDIT of my prior post.
rjgibert
Posts: 317
Joined: Mon Jun 26, 2006 9:44 am

Re: Utility to delete PGN games with same final position?

Post by rjgibert »

michiguel wrote:
rjgibert wrote:I would aim for a unique move 40 position or last position if game is shorter. This would save going to the end of the game and be more reliable. There are opening variations that go past move 30, so a smaller number might not be enough. Using last position of game will filter out a lot of games you want to keep.
Sorry, I was not clear. The algorithm is
1) Get a pgn collection
2) truncate all games to <n> plies with pgnextract.
3) discard games that end up in the same position (retain only one). In this case, it is the position after <n> plies, because after that, the moves have been deleted.

That is what I meant and I am almost sure that is what Aser needs.

Miguel
PS: In my case, for 2) I am trying to build an opening book for my engine, get two instances , make them play to each other, and make one of them to resign after <n> plies. The difference is that the games will be chosen if the moves are statistically sound.
In 3), you need to have a zobrist hash table of all the end positions and use this to compare the zobrist hash key of all the positions of the candidate game after <m> moves to <n>. This will filter otherwise identical games that differ by the insertion of a repetition. It will also filter games that reach the same position via a longer non-repeating sequence. IOW, you need to check a range of moves of the candidate game.

Some openings offer the option to repeat position. Other openings, like the Scheveshnikov Sicilian can reach the same position by inserting the moves e6 by black and Bf4 by white followed by a later e5 by black and Bg5 by white. A longer sequence that does not include a repetition, but reaches the same position.

With these modifications, it will still not be perfect, but perhaps good enough for practical purposes.
User avatar
Aser Huerga
Posts: 812
Joined: Tue Jun 16, 2009 10:09 am
Location: Spain

Re: Utility to delete PGN games with same final position?

Post by Aser Huerga »

michiguel wrote:
rjgibert wrote:I would aim for a unique move 40 position or last position if game is shorter. This would save going to the end of the game and be more reliable. There are opening variations that go past move 30, so a smaller number might not be enough. Using last position of game will filter out a lot of games you want to keep.
Sorry, I was not clear. The algorithm is
1) Get a pgn collection
2) truncate all games to <n> plies with pgnextract.
3) discard games that end up in the same position (retain only one). In this case, it is the position after <n> plies, because after that, the moves have been deleted.

That is what I meant and I am almost sure that is what Aser needs.
That is exactly what I'm looking for. Read a PGN collection, search final position in the PGN (or position at ply 6 or X) and keep only one game per position. Doing this manually is a lot of work!
Since a utility that truncates a PGN to a given ply exist, I would expect a utility that delete final duplicates exist too ... :roll:
User avatar
Aser Huerga
Posts: 812
Joined: Tue Jun 16, 2009 10:09 am
Location: Spain

Re: Utility to delete PGN games with same final position?

Post by Aser Huerga »

A partner in CCRL has made a script using well know utilities as pgn-extract, pgnscanner, etc. and it seems that it works well, because Ferdinand Mosca got the same results with different method. Thanks both. PM me if you are interested.


I want to thank too Miguel Ballicora for his efforts in incorporate such a utility in Gaviota.

Best regards.