handling huge pgn databases

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

handling huge pgn databases

Post by Jonathan003 »

What software is able to handle huge pgn databases? Databases of 60.0000.000 games.
I want to merger two pgn databases from 30.000.0000 games. And search the resulting database for doubles and delete them. And with doubles I mean games with exact the same moves and results. So, two game with exact the same moves and result are considered doubles also If the player names, date tournament or other information is different. And if doubles are found I want to keep the better game (elo, more recent...).

I have tied SCID 4.7 and I get an error message saying something like "to many player names" if I try to merge the two databases.
I hope someone can help me with this. I have no programming knowledge.
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: handling huge pgn databases

Post by dkappe »

Jonathan003 wrote: Mon Jan 31, 2022 6:51 pm What software is able to handle huge pgn databases? Databases of 60.0000.000 games.
I want to merger two pgn databases from 30.000.0000 games. And search the resulting database for doubles and delete them. And with doubles I mean games with exact the same moves and results. So, two game with exact the same moves and result are considered doubles also If the player names, date tournament or other information is different. And if doubles are found I want to keep the better game (elo, more recent...).

I have tied SCID 4.7 and I get an error message saying something like "to many player names" if I try to merge the two databases.
I hope someone can help me with this. I have no programming knowledge.
For bulk operations such as removing duplicates I’d suggest the blazing fast pgn-extract (https://www.cs.kent.ac.uk/people/staff/djb/pgn-extract/). Obviously for more complex operations you’ll want a database like Hiarcs Chess Explorer, SCID or Chessbase, but for an initial cut at a big pgn like a monthly lichess file, pgn-extract is the only way to fly.

For the more complex logic you’re suggesting, you might want to extract the duplicates using pgn-extract tool and then massage them with something like scid.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: handling huge pgn databases

Post by Jonathan003 »

I will try pgn-extract.
Although my experience with pgn-extract for cleaning huge pgn databases for making polyglot bin books, is that it is much slower then importing and exporting the pgn databases in SCID 4.7, to clean the pgn database before making the bin books. In general SCID 4.7 is much faster to handle huge pgn databases then for example Chessbase.
The only problem is that I get errors when trying to merge the two huge pgn databases.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: handling huge pgn databases

Post by Desperado »

Jonathan003 wrote: Mon Jan 31, 2022 8:25 pm I will try pgn-extract.
Although my experience with pgn-extract for cleaning huge pgn databases for making polyglot bin books, is that it is much slower then importing and exporting the pgn databases in SCID 4.7, to clean the pgn database before making the bin books. In general SCID 4.7 is much faster to handle huge pgn databases then for example Chessbase.
The only problem is that I get errors when trying to merge the two huge pgn databases.
Well, if the only problem is to merge two pgn (text) files, then you can simply open the command prompt (cmd.exe) and type

1. Option

Code: Select all

type file1.epd >> file2.epd
The content of file1 will be added to file2.

Merging many files does work like this

2.Option

Code: Select all

for %f in (*.epd) do type “%f” >> c:\different_folder\merged.epd
The reason why you need to be careful with the second command is, that you might produce a recursion,
because the merged.epd matches the pattern "*.epd" too. If you choose merged.txt and your pattern is *.epd for example,
you simple can execute the command in the same folder (and rename the file extension after the operation has been completed)

For merging two files, the first option is easiest way to go.
Modern Times
Posts: 3554
Joined: Thu Jun 07, 2012 11:02 pm

Re: handling huge pgn databases

Post by Modern Times »

If I want to merge many pgn files (which are plain txt files) in a single folder, I do this at the command prompt

copy *.pgn merged.pgn

However that is with small files, I have never tried with big files
User avatar
phhnguyen
Posts: 1440
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: handling huge pgn databases

Post by phhnguyen »

Since you accepted converting into another database format then exporting back to PGN, you may try OCGDB, the forum threat is here.

The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.

The latest release (version Beta 3) have enough functions for your needs:

- Create a new database from multi PGN files:

Code: Select all

ocgdb -db bigdb.ocgdb.db3 -pgn file1.pgn  -pgn file2.pgn -pgn file3.pgn -o moves2 -cpu 4
- Check duplicate games and remove the redundant ones:

Code: Select all

ocgdb -db bigdb.ocgdb.db3 -dup -o remove;printall -cpu 4
- Export a database into a PGN file

Code: Select all

ocgdb -db bigdb.ocgdb.db3 -pgn bigdb.pgn -export

The other strong point of OCGDB is that it could do position searching, simple, fast, for both PGN files and databases. For examples:

Code: Select all

// find all positions having 3 White Queens
ocgdb -db bigdb.ocgdb.db3 -q "Q = 3" -cpu 4 -o printpgn

// Find all positions having two Black Rooks in the middle squares
ocgdb -pgn file1.pgn -q "r[e4, e5, d4, d5] = 2" -cpu 4 -o printfen


// White Pawns in d4, e5, f4, g4, Black King in b7
ocgdb -db bigdb.ocgdb.db3 -q "P[d4, e5, f4, g4] = 4 and kb7" -cpu 4
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
Ozymandias
Posts: 1535
Joined: Sun Oct 25, 2009 2:30 am

Re: handling huge pgn databases

Post by Ozymandias »

Modern Times wrote: Tue Feb 01, 2022 3:44 am If I want to merge many pgn files (which are plain txt files) in a single folder, I do this at the command prompt

copy *.pgn merged.pgn

However that is with small files, I have never tried with big files
It works flawlessly up to hundreds of GBs. Fastest too (depends on the HDD, mostly).
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: handling huge pgn databases

Post by Jonathan003 »

phhnguyen wrote: Tue Feb 01, 2022 5:58 am The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
Thanks for the information, I will try this 'Open Chess Game Database Standard (OCGDB)' tool.
Can I download these 94 million games database from Lichess somewhere?
I know I can download Lichess databases here: https://database.lichess.org/
But I'm looking for collections of high quality Lichess databases.
Like all standard human games played on Lichess.org, where one of the players have a minimum rating of 2000 Elo.
KLc
Posts: 140
Joined: Wed Jun 03, 2020 6:46 am
Full name: Kurt Lanc

Re: handling huge pgn databases

Post by KLc »

There's the Lichess Elite Database https://database.nikonoel.fr. Otherwise, you have to filter yourself (e.g. with pgn-extract).
User avatar
phhnguyen
Posts: 1440
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: handling huge pgn databases

Post by phhnguyen »

Jonathan003 wrote: Wed Feb 02, 2022 1:13 am
phhnguyen wrote: Tue Feb 01, 2022 5:58 am The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
Thanks for the information, I will try this 'Open Chess Game Database Standard (OCGDB)' tool.
Can I download these 94 million games database from Lichess somewhere?
I know I can download Lichess databases here: https://database.lichess.org/
But I'm looking for collections of high quality Lichess databases.
Like all standard human games played on Lichess.org, where one of the players have a minimum rating of 2000 Elo.
I have used that link too to download Lichess games and don't know anywhere for better ones.

You can create yourself a high quality with OCGDB by adding para -elo 200 when creating as below:

Code: Select all

ocgdb -db bigdb.ocgdb.db3 -pgn file1.pgn  -pgn file2.pgn -pgn file3.pgn -o moves2 -cpu 4 -elo 2000
OCGDB could run much faster if you filter out more games.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager