What software is able to handle huge pgn databases? Databases of 60.0000.000 games.
I want to merger two pgn databases from 30.000.0000 games. And search the resulting database for doubles and delete them. And with doubles I mean games with exact the same moves and results. So, two game with exact the same moves and result are considered doubles also If the player names, date tournament or other information is different. And if doubles are found I want to keep the better game (elo, more recent...).
I have tied SCID 4.7 and I get an error message saying something like "to many player names" if I try to merge the two databases.
I hope someone can help me with this. I have no programming knowledge.
handling huge pgn databases
Moderators: hgm, Rebel, chrisw
-
- Posts: 239
- Joined: Fri Jul 06, 2018 4:23 pm
- Full name: Jonathan Cremers
-
- Posts: 1632
- Joined: Tue Aug 21, 2018 7:52 pm
- Full name: Dietrich Kappe
Re: handling huge pgn databases
For bulk operations such as removing duplicates I’d suggest the blazing fast pgn-extract (https://www.cs.kent.ac.uk/people/staff/djb/pgn-extract/). Obviously for more complex operations you’ll want a database like Hiarcs Chess Explorer, SCID or Chessbase, but for an initial cut at a big pgn like a monthly lichess file, pgn-extract is the only way to fly.Jonathan003 wrote: ↑Mon Jan 31, 2022 6:51 pm What software is able to handle huge pgn databases? Databases of 60.0000.000 games.
I want to merger two pgn databases from 30.000.0000 games. And search the resulting database for doubles and delete them. And with doubles I mean games with exact the same moves and results. So, two game with exact the same moves and result are considered doubles also If the player names, date tournament or other information is different. And if doubles are found I want to keep the better game (elo, more recent...).
I have tied SCID 4.7 and I get an error message saying something like "to many player names" if I try to merge the two databases.
I hope someone can help me with this. I have no programming knowledge.
For the more complex logic you’re suggesting, you might want to extract the duplicates using pgn-extract tool and then massage them with something like scid.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
-
- Posts: 239
- Joined: Fri Jul 06, 2018 4:23 pm
- Full name: Jonathan Cremers
Re: handling huge pgn databases
I will try pgn-extract.
Although my experience with pgn-extract for cleaning huge pgn databases for making polyglot bin books, is that it is much slower then importing and exporting the pgn databases in SCID 4.7, to clean the pgn database before making the bin books. In general SCID 4.7 is much faster to handle huge pgn databases then for example Chessbase.
The only problem is that I get errors when trying to merge the two huge pgn databases.
Although my experience with pgn-extract for cleaning huge pgn databases for making polyglot bin books, is that it is much slower then importing and exporting the pgn databases in SCID 4.7, to clean the pgn database before making the bin books. In general SCID 4.7 is much faster to handle huge pgn databases then for example Chessbase.
The only problem is that I get errors when trying to merge the two huge pgn databases.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: handling huge pgn databases
Well, if the only problem is to merge two pgn (text) files, then you can simply open the command prompt (cmd.exe) and typeJonathan003 wrote: ↑Mon Jan 31, 2022 8:25 pm I will try pgn-extract.
Although my experience with pgn-extract for cleaning huge pgn databases for making polyglot bin books, is that it is much slower then importing and exporting the pgn databases in SCID 4.7, to clean the pgn database before making the bin books. In general SCID 4.7 is much faster to handle huge pgn databases then for example Chessbase.
The only problem is that I get errors when trying to merge the two huge pgn databases.
1. Option
Code: Select all
type file1.epd >> file2.epd
Merging many files does work like this
2.Option
Code: Select all
for %f in (*.epd) do type “%f” >> c:\different_folder\merged.epd
because the merged.epd matches the pattern "*.epd" too. If you choose merged.txt and your pattern is *.epd for example,
you simple can execute the command in the same folder (and rename the file extension after the operation has been completed)
For merging two files, the first option is easiest way to go.
-
- Posts: 3557
- Joined: Thu Jun 07, 2012 11:02 pm
Re: handling huge pgn databases
If I want to merge many pgn files (which are plain txt files) in a single folder, I do this at the command prompt
copy *.pgn merged.pgn
However that is with small files, I have never tried with big files
copy *.pgn merged.pgn
However that is with small files, I have never tried with big files
-
- Posts: 1444
- Joined: Wed Apr 21, 2010 4:58 am
- Location: Australia
- Full name: Nguyen Hong Pham
Re: handling huge pgn databases
Since you accepted converting into another database format then exporting back to PGN, you may try OCGDB, the forum threat is here.
The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
The latest release (version Beta 3) have enough functions for your needs:
- Create a new database from multi PGN files:
- Check duplicate games and remove the redundant ones:
- Export a database into a PGN file
The other strong point of OCGDB is that it could do position searching, simple, fast, for both PGN files and databases. For examples:
The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
The latest release (version Beta 3) have enough functions for your needs:
- Create a new database from multi PGN files:
Code: Select all
ocgdb -db bigdb.ocgdb.db3 -pgn file1.pgn -pgn file2.pgn -pgn file3.pgn -o moves2 -cpu 4
Code: Select all
ocgdb -db bigdb.ocgdb.db3 -dup -o remove;printall -cpu 4
Code: Select all
ocgdb -db bigdb.ocgdb.db3 -pgn bigdb.pgn -export
The other strong point of OCGDB is that it could do position searching, simple, fast, for both PGN files and databases. For examples:
Code: Select all
// find all positions having 3 White Queens
ocgdb -db bigdb.ocgdb.db3 -q "Q = 3" -cpu 4 -o printpgn
// Find all positions having two Black Rooks in the middle squares
ocgdb -pgn file1.pgn -q "r[e4, e5, d4, d5] = 2" -cpu 4 -o printfen
// White Pawns in d4, e5, f4, g4, Black King in b7
ocgdb -db bigdb.ocgdb.db3 -q "P[d4, e5, f4, g4] = 4 and kb7" -cpu 4
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
The most features chess GUI, based on opensource Banksia - the chess tournament manager
-
- Posts: 1535
- Joined: Sun Oct 25, 2009 2:30 am
Re: handling huge pgn databases
It works flawlessly up to hundreds of GBs. Fastest too (depends on the HDD, mostly).Modern Times wrote: ↑Tue Feb 01, 2022 3:44 am If I want to merge many pgn files (which are plain txt files) in a single folder, I do this at the command prompt
copy *.pgn merged.pgn
However that is with small files, I have never tried with big files
-
- Posts: 239
- Joined: Fri Jul 06, 2018 4:23 pm
- Full name: Jonathan Cremers
Re: handling huge pgn databases
Thanks for the information, I will try this 'Open Chess Game Database Standard (OCGDB)' tool.phhnguyen wrote: ↑Tue Feb 01, 2022 5:58 am The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
Can I download these 94 million games database from Lichess somewhere?
I know I can download Lichess databases here: https://database.lichess.org/
But I'm looking for collections of high quality Lichess databases.
Like all standard human games played on Lichess.org, where one of the players have a minimum rating of 2000 Elo.
-
- Posts: 140
- Joined: Wed Jun 03, 2020 6:46 am
- Full name: Kurt Lanc
Re: handling huge pgn databases
There's the Lichess Elite Database https://database.nikonoel.fr. Otherwise, you have to filter yourself (e.g. with pgn-extract).
-
- Posts: 1444
- Joined: Wed Apr 21, 2010 4:58 am
- Location: Australia
- Full name: Nguyen Hong Pham
Re: handling huge pgn databases
I have used that link too to download Lichess games and don't know anywhere for better ones.Jonathan003 wrote: ↑Wed Feb 02, 2022 1:13 amThanks for the information, I will try this 'Open Chess Game Database Standard (OCGDB)' tool.phhnguyen wrote: ↑Tue Feb 01, 2022 5:58 am The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
Can I download these 94 million games database from Lichess somewhere?
I know I can download Lichess databases here: https://database.lichess.org/
But I'm looking for collections of high quality Lichess databases.
Like all standard human games played on Lichess.org, where one of the players have a minimum rating of 2000 Elo.
You can create yourself a high quality with OCGDB by adding para -elo 200 when creating as below:
Code: Select all
ocgdb -db bigdb.ocgdb.db3 -pgn file1.pgn -pgn file2.pgn -pgn file3.pgn -o moves2 -cpu 4 -elo 2000
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
The most features chess GUI, based on opensource Banksia - the chess tournament manager