styx wrote:I have no idea what the crash might caused. Maybe your tool messed up the PGN structure at some point due to some resource problems?
If you mean UltraEdit with 'my tool', then no, it did not mess up anything.
Actually it is an high performance professional editor for WIN, far above the free Notepad++ e.g., also it does _not_ need enormous ram to load super big files and I looked at the complete processed pgn later.
It seems it really is a similar problem Juan described in his linked thread, with files > 2GB for the pgnscid commandline tool.
I used Scid vs. PC 4.18 though, instead of Scid 4.64. Anyhow my pgnscid.exe (which is the cmd conversion tool) is from July 2017 with 552.960 bytes.
Import of the pgn via UI is impossible for me, as I have already said that I only have 4GB Ram here and I tested it and Scid closes after a while,
when it runs out of memory (watched in the task manager).
styx wrote:
I think I had some problems too when I used a small bash-script in linux to remove everything after the top-level-domain in the lichess URL.
I was able to import the PGN into a SCID database. But there was an error and afterwards, fewer games were available than expected. I never managed to find the problem. I still think that sed messed up the file at some point. I don't know why and where and eventually gave up.
Are there any tools for an integrity check on PGN files? Because you need a lot of RAM to open a big PGN file in an editor and even if it works, you need to find the wrong syntax within millions of games.
The compression rate is quite good. My SCID database with 3,56 million games (no annotated games) is 554 MB in size. That's 2,4 GB in plain PGN.
Well that's strange and I assumed this. If you read my post carefully you see that the first PGN chunk (after splitting up) has 5GB and the pgnscid
commandline tool crashes after around 2.1GB which is already much more I had expected for a scid conversion??
My guess is that the immense number of tags bloats it up a lot, therefore
I considered already to change it to 7R (7 tag roster) with pgnextract,
but this is quite slow and I wanted to try other things first.
This is the last game of the first and tested 5GB chunk and you can see it
has 15! tags (like all games in this lichess database).
The pgn file has 45.198.338 lines and 2.510.113 games.
I will try it again with a very tiny portion of it to exclude that it is something else in the pgn structure...
BTW I wonder, why they did not split it up according to time controls?
I was more interested in blitz tc and not bullet (no longer tc neither)
Code: Select all
[Event "Rated Bullet game"]
[Site "lichess.org"]
[White "ARM13"]
[Black "selvasaravanan123"]
[Result "0-1"]
[UTCDate "2018.01.05"]
[UTCTime "14:26:41"]
[WhiteElo "1502"]
[BlackElo "1536"]
[WhiteRatingDiff "-9"]
[BlackRatingDiff "+11"]
[ECO "C02"]
[Opening "French Defense: Advance Variation #4"]
[TimeControl "60+0"]
[Termination "Time forfeit"]
1. e4 { [%clk 0:01:00] } d5 { [%clk 0:01:00] } 2. e5 { [%clk 0:00:59] } e6 { [%clk 0:00:59] } 3. d4 { [%clk 0:00:59] } c5 { [%clk 0:00:59] } 4. Be3 { [%clk 0:00:57] } cxd4 { [%clk 0:00:59] } 5. Bxd4 { [%clk 0:00:56] } Nc6 { [%clk 0:00:59] } 6. c3 { [%clk 0:00:55] } Be7 { [%clk 0:00:58] } 7. Nf3 { [%clk 0:00:55] } a6 { [%clk 0:00:57] } 8. Bd3 { [%clk 0:00:54] } b5 { [%clk 0:00:57] } 9. O-O { [%clk 0:00:53] } Qc7 { [%clk 0:00:54] } 10. a3 { [%clk 0:00:52] } Bd8 { [%clk 0:00:53] } 11. b4 { [%clk 0:00:51] } Nge7 { [%clk 0:00:51] } 12. Qc1 { [%clk 0:00:48] } O-O { [%clk 0:00:50] } 13. Qc2 { [%clk 0:00:47] } g6 { [%clk 0:00:48] } 14. Ng5 { [%clk 0:00:43] } Nf5 { [%clk 0:00:46] } 15. Bc5 { [%clk 0:00:40] } Re8 { [%clk 0:00:44] } 16. Nf3 { [%clk 0:00:38] } Ng7 { [%clk 0:00:41] } 17. Qd2 { [%clk 0:00:35] } Be7 { [%clk 0:00:40] } 18. Qh6 { [%clk 0:00:35] } Bxc5 { [%clk 0:00:38] } 19. bxc5 { [%clk 0:00:35] } Nf5 { [%clk 0:00:36] } 20. Nd4 { [%clk 0:00:35] } Nxh6 { [%clk 0:00:34] } 21. h3 { [%clk 0:00:33] } Nxd4 { [%clk 0:00:33] } 22. Nd2 { [%clk 0:00:33] } Nc6 { [%clk 0:00:32] } 23. Nf3 { [%clk 0:00:31] } Nxe5 { [%clk 0:00:30] } 24. Nxe5 { [%clk 0:00:29] } Qxe5 { [%clk 0:00:30] } 25. Rae1 { [%clk 0:00:26] } Qxc3 { [%clk 0:00:29] } 26. Be2 { [%clk 0:00:23] } Qxc5 { [%clk 0:00:28] } 27. Bg4 { [%clk 0:00:23] } Qxa3 { [%clk 0:00:27] } 28. Ra1 { [%clk 0:00:21] } Qe7 { [%clk 0:00:25] } 29. Rfe1 { [%clk 0:00:19] } e5 { [%clk 0:00:24] } 30. f4 { [%clk 0:00:15] } Nxg4 { [%clk 0:00:21] } 31. hxg4 { [%clk 0:00:13] } e4 { [%clk 0:00:20] } 32. Rac1 { [%clk 0:00:10] } Bxg4 { [%clk 0:00:19] } 33. g3 { [%clk 0:00:08] } Bf3 { [%clk 0:00:17] } 34. Kh2 { [%clk 0:00:08] } Qf6 { [%clk 0:00:15] } 35. Kg1 { [%clk 0:00:08] } Qg7 { [%clk 0:00:14] } 36. Kf2 { [%clk 0:00:08] } Qh6 { [%clk 0:00:13] } 37. Ke3 { [%clk 0:00:06] } Qg7 { [%clk 0:00:13] } 38. Kf2 { [%clk 0:00:05] } Qf8 { [%clk 0:00:13] } 39. Ke3 { [%clk 0:00:05] } Qe7 { [%clk 0:00:12] } 40. Kd4 { [%clk 0:00:05] } Qe6 { [%clk 0:00:12] } 41. Red1 { [%clk 0:00:04] } Qf6+ { [%clk 0:00:12] } 42. Ke3 { [%clk 0:00:04] } Rac8 { [%clk 0:00:12] } 43. Rxd5 { [%clk 0:00:03] } Rxc1 { [%clk 0:00:10] } 44. Rd8 { [%clk 0:00:01] } Rc3+ { [%clk 0:00:10] } 0-1