scid - pgn database size limitation?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

scid - pgn database size limitation?

Post by Guenther »

I am still extracting one of those gigantic lichess databases out of curiosity.
The one I have downloaded (2018/01) probably will be around 35GB! decompressed.

My plan is to convert it to a scid database (precisely scid vs pc) via the commandline tool delivered.
(used this one already since long for huge databases e.g. for GURL-2)
I have no clue though, if it will run/work on a file described above with
my ram limitation of 4GB? (cannot be extended due to bios)
This is under WIN7-64 Ultimate, NTFS of course, thus OS file size limitation
is no problem.

Will it work and if, how long will conversion last roughly?

a quite old thread about scid database limitations:
http://talkchess.com/forum/viewtopic.ph ... size+limit

This is what lichess database maintainers tell about there gigantic pgn files:
https://database.lichess.org/

Code: Select all

Open PGN files

Traditional PGN databases, like SCID or ChessBase, fail to open large PGN files. Until they fix it, you can split the PGN files, or use programmatic APIs such as python-chess or Scoutfish.
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy
User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: scid - pgn database size limitation?

Post by Guenther »

update:

after extracting file size was around 36GB
the scid commandline conversion tool started smoothly (batch,
but the cmd always stops at the same point and no err file is written.

I see now that the game number is > 17M which would be over scids
limit from the 2011 thread, but because of no error report and that it stops
relatively early after around 700MB (sg4 file size) I don't know the real reason
for the failure.

Seems I have to try another one, a bit 'smaller'. (18/01 was the biggest)
I guess the format with eval/clock shouldn't be the problem, otherwise
it would have stopped earlier with an error message?

Code: Select all

1. e4 { [%eval 0.17] [%clk 0:00:30] }

Code: Select all

Month           Size            Games           Clock   Download
January 2018    4.63 GB         17,945,784      ?       2018-01.pgn.bz2
December 2017   4.19 GB         16,232,215      ?       2017-12.pgn.bz2
November 2017   3.69 GB         14,306,375      ?       2017-11.pgn.bz2
October 2017    3.54 GB         13,703,878      ?       2017-10.pgn.bz2
September 2017  3.24 GB         12,564,109      ?       2017-09.pgn.bz2
August 2017     3.23 GB         12,458,761      ?       2017-08.pgn.bz2
July 2017       3.13 GB         12,080,314      ?       2017-07.pgn.bz2
June 2017       2.98 GB         11,512,600      ?       2017-06.pgn.bz2
May 2017        3.03 GB         11,693,919      ?       2017-05.pgn.bz2
April 2017      2.95 GB         11,348,506      ?       2017-04.pgn.bz2
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy
styx
Posts: 338
Joined: Tue Mar 13, 2012 9:59 pm
Location: Germany

Re: scid - pgn database size limitation?

Post by styx »

There are two things to consider:

1.) SCID databases can only hold up to 16777214 games (I think the developer of SCID mentioned somewhere, that he wants to remove this limit in future versions)
2.) there are also limitations for the PGN tags

The lichess database stores the full game path in the "[Site]" tag. For example:

Code: Select all

[Site "https://lichess.org/HNWwubot"]
So there are as many different "Sites" as games in the database.

If I remember correctly, SCID can only store around 400.000 different Site Tags per database. So you would need to strip the "Site"-tag out of your PGN file before importing into SCID.
User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: scid - pgn database size limitation?

Post by Guenther »

styx wrote:There are two things to consider:

1.) SCID databases can only hold up to 16777214 games (I think the developer of SCID mentioned somewhere, that he wants to remove this limit in future versions)
2.) there are also limitations for the PGN tags

The lichess database stores the full game path in the "[Site]" tag. For example:

Code: Select all

[Site "https://lichess.org/HNWwubot"]
So there are as many different "Sites" as games in the database.

If I remember correctly, SCID can only store around 400.000 different Site Tags per database. So you would need to strip the "Site"-tag out of your PGN file before importing into SCID.
Thanks for the hint with the site tag!
After my first attempt failed, I created several file chunks with pgnsplit
(5GB each) and still the conversion tool stopped at the same point,
this was of course a sign that there is a different problem too.

I am trying now to clean up the site tags, but even UltraEdit needs quite
some time to open a 5GB pgn file.
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy
styx
Posts: 338
Joined: Tue Mar 13, 2012 9:59 pm
Location: Germany

Re: scid - pgn database size limitation?

Post by styx »

Yes it's all a bit complicated. As far as I know, there is no tool that can handle such huge chess databases in a convenient way.
User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: scid - pgn database size limitation?

Post by Guenther »

styx wrote:Yes it's all a bit complicated. As far as I know, there is no tool that can handle such huge chess databases in a convenient way.
I see. With UltraEdit I replaced all Event and Site tags which contained https://lichess.org/XYZ... by plain lichess.org and it ran nearly through
the first 5GB chunk, but now I get a real app crash of the scid conversion tool (before it always stopped silently) after around 2.1GB.

Any idea what else could be fixed in the pgn file before conversion, w/o error messages I have no clue what it could be.

BTW what was the usual size difference between scid main database file and pgn? I don't remember myself anymore after 1.5 years I used it the last time.

Guenther
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy
User avatar
Ozymandias
Posts: 1532
Joined: Sun Oct 25, 2009 2:30 am

Re: scid - pgn database size limitation?

Post by Ozymandias »

styx wrote:If I remember correctly, SCID can only store around 400.000 different Site Tags per database.
I wonder if that was the problem I experienced.
User avatar
Ozymandias
Posts: 1532
Joined: Sun Oct 25, 2009 2:30 am

Re: scid - pgn database size limitation?

Post by Ozymandias »

Guenther wrote:I am trying now to clean up the site tags, but even UltraEdit needs quite
some time to open a 5GB pgn file.
You can try Norman Pollock's tagRemove, to completely get rid of the tag. It should be faster.
styx
Posts: 338
Joined: Tue Mar 13, 2012 9:59 pm
Location: Germany

Re: scid - pgn database size limitation?

Post by styx »

I have no idea what the crash might caused. Maybe your tool messed up the PGN structure at some point due to some resource problems?

I think I had some problems too when I used a small bash-script in linux to remove everything after the top-level-domain in the lichess URL.

I was able to import the PGN into a SCID database. But there was an error and afterwards, fewer games were available than expected. I never managed to find the problem. I still think that sed messed up the file at some point. I don't know why and where and eventually gave up.

Are there any tools for an integrity check on PGN files? Because you need a lot of RAM to open a big PGN file in an editor and even if it works, you need to find the wrong syntax within millions of games.

The compression rate is quite good. My SCID database with 3,56 million games (no annotated games) is 554 MB in size. That's 2,4 GB in plain PGN.
User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: scid - pgn database size limitation?

Post by Guenther »

styx wrote:I have no idea what the crash might caused. Maybe your tool messed up the PGN structure at some point due to some resource problems?
If you mean UltraEdit with 'my tool', then no, it did not mess up anything.
Actually it is an high performance professional editor for WIN, far above the free Notepad++ e.g., also it does _not_ need enormous ram to load super big files and I looked at the complete processed pgn later.

It seems it really is a similar problem Juan described in his linked thread, with files > 2GB for the pgnscid commandline tool.
I used Scid vs. PC 4.18 though, instead of Scid 4.64. Anyhow my pgnscid.exe (which is the cmd conversion tool) is from July 2017 with 552.960 bytes.
Import of the pgn via UI is impossible for me, as I have already said that I only have 4GB Ram here and I tested it and Scid closes after a while,
when it runs out of memory (watched in the task manager).
styx wrote: I think I had some problems too when I used a small bash-script in linux to remove everything after the top-level-domain in the lichess URL.

I was able to import the PGN into a SCID database. But there was an error and afterwards, fewer games were available than expected. I never managed to find the problem. I still think that sed messed up the file at some point. I don't know why and where and eventually gave up.

Are there any tools for an integrity check on PGN files? Because you need a lot of RAM to open a big PGN file in an editor and even if it works, you need to find the wrong syntax within millions of games.

The compression rate is quite good. My SCID database with 3,56 million games (no annotated games) is 554 MB in size. That's 2,4 GB in plain PGN.
Well that's strange and I assumed this. If you read my post carefully you see that the first PGN chunk (after splitting up) has 5GB and the pgnscid
commandline tool crashes after around 2.1GB which is already much more I had expected for a scid conversion??

My guess is that the immense number of tags bloats it up a lot, therefore
I considered already to change it to 7R (7 tag roster) with pgnextract,
but this is quite slow and I wanted to try other things first.

This is the last game of the first and tested 5GB chunk and you can see it
has 15! tags (like all games in this lichess database).
The pgn file has 45.198.338 lines and 2.510.113 games.
I will try it again with a very tiny portion of it to exclude that it is something else in the pgn structure...

BTW I wonder, why they did not split it up according to time controls?
I was more interested in blitz tc and not bullet (no longer tc neither)

Code: Select all

[Event "Rated Bullet game"]
[Site "lichess.org"]
[White "ARM13"]
[Black "selvasaravanan123"]
[Result "0-1"]
[UTCDate "2018.01.05"]
[UTCTime "14:26:41"]
[WhiteElo "1502"]
[BlackElo "1536"]
[WhiteRatingDiff "-9"]
[BlackRatingDiff "+11"]
[ECO "C02"]
[Opening "French Defense: Advance Variation #4"]
[TimeControl "60+0"]
[Termination "Time forfeit"]

1. e4 { [%clk 0:01:00] } d5 { [%clk 0:01:00] } 2. e5 { [%clk 0:00:59] } e6 { [%clk 0:00:59] } 3. d4 { [%clk 0:00:59] } c5 { [%clk 0:00:59] } 4. Be3 { [%clk 0:00:57] } cxd4 { [%clk 0:00:59] } 5. Bxd4 { [%clk 0:00:56] } Nc6 { [%clk 0:00:59] } 6. c3 { [%clk 0:00:55] } Be7 { [%clk 0:00:58] } 7. Nf3 { [%clk 0:00:55] } a6 { [%clk 0:00:57] } 8. Bd3 { [%clk 0:00:54] } b5 { [%clk 0:00:57] } 9. O-O { [%clk 0:00:53] } Qc7 { [%clk 0:00:54] } 10. a3 { [%clk 0:00:52] } Bd8 { [%clk 0:00:53] } 11. b4 { [%clk 0:00:51] } Nge7 { [%clk 0:00:51] } 12. Qc1 { [%clk 0:00:48] } O-O { [%clk 0:00:50] } 13. Qc2 { [%clk 0:00:47] } g6 { [%clk 0:00:48] } 14. Ng5 { [%clk 0:00:43] } Nf5 { [%clk 0:00:46] } 15. Bc5 { [%clk 0:00:40] } Re8 { [%clk 0:00:44] } 16. Nf3 { [%clk 0:00:38] } Ng7 { [%clk 0:00:41] } 17. Qd2 { [%clk 0:00:35] } Be7 { [%clk 0:00:40] } 18. Qh6 { [%clk 0:00:35] } Bxc5 { [%clk 0:00:38] } 19. bxc5 { [%clk 0:00:35] } Nf5 { [%clk 0:00:36] } 20. Nd4 { [%clk 0:00:35] } Nxh6 { [%clk 0:00:34] } 21. h3 { [%clk 0:00:33] } Nxd4 { [%clk 0:00:33] } 22. Nd2 { [%clk 0:00:33] } Nc6 { [%clk 0:00:32] } 23. Nf3 { [%clk 0:00:31] } Nxe5 { [%clk 0:00:30] } 24. Nxe5 { [%clk 0:00:29] } Qxe5 { [%clk 0:00:30] } 25. Rae1 { [%clk 0:00:26] } Qxc3 { [%clk 0:00:29] } 26. Be2 { [%clk 0:00:23] } Qxc5 { [%clk 0:00:28] } 27. Bg4 { [%clk 0:00:23] } Qxa3 { [%clk 0:00:27] } 28. Ra1 { [%clk 0:00:21] } Qe7 { [%clk 0:00:25] } 29. Rfe1 { [%clk 0:00:19] } e5 { [%clk 0:00:24] } 30. f4 { [%clk 0:00:15] } Nxg4 { [%clk 0:00:21] } 31. hxg4 { [%clk 0:00:13] } e4 { [%clk 0:00:20] } 32. Rac1 { [%clk 0:00:10] } Bxg4 { [%clk 0:00:19] } 33. g3 { [%clk 0:00:08] } Bf3 { [%clk 0:00:17] } 34. Kh2 { [%clk 0:00:08] } Qf6 { [%clk 0:00:15] } 35. Kg1 { [%clk 0:00:08] } Qg7 { [%clk 0:00:14] } 36. Kf2 { [%clk 0:00:08] } Qh6 { [%clk 0:00:13] } 37. Ke3 { [%clk 0:00:06] } Qg7 { [%clk 0:00:13] } 38. Kf2 { [%clk 0:00:05] } Qf8 { [%clk 0:00:13] } 39. Ke3 { [%clk 0:00:05] } Qe7 { [%clk 0:00:12] } 40. Kd4 { [%clk 0:00:05] } Qe6 { [%clk 0:00:12] } 41. Red1 { [%clk 0:00:04] } Qf6+ { [%clk 0:00:12] } 42. Ke3 { [%clk 0:00:04] } Rac8 { [%clk 0:00:12] } 43. Rxd5 { [%clk 0:00:03] } Rxc1 { [%clk 0:00:10] } 44. Rd8 { [%clk 0:00:01] } Rc3+ { [%clk 0:00:10] } 0-1
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy