Page 1 of 4

The first release of the CGR games database

Posted: Mon Feb 20, 2017 6:47 am
by bstjean
As I was collecting games to build an opening book for my chess engine (in development), I thought "why not share my games with everyone" ? And then it became a project of its own!

Well, here's the very first release of the CGR database!

For those interested, the details are here :

https://chessgamesrepository.wordpress. ... 0219-full/

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 12:38 pm
by retep1
thx for your great work. sadly the archiv ist corrupt.

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 1:36 pm
by Ozymandias
It’s been known for a while, games from the CCRL (Computer Chess Rating Lists) use a “custom” round number and many chess database programs don’t seem to like it. Besides, importing CCRL games will often cause the famous “Round Name limit of 262143 exceeded” error in Scid or Scid vs PC. So I have decided to replace the round number in the CCRL games by the default value of “?”.
So, that's why the CCRL404 was messing up my DB! I had to filter it trough several programs, without actually knowing what was wrong with it. Good to know.
we will hit the 16 million games limit in Scid… For those who use other chess database software, are there similar limits? Do you want the database in multiple Zip files or just one Zip file?
16? More like 12. I haven't read about CB or CA having a hardcoded limit, but I'm sure any of them will crash with enough games. Finally, using the Zip format isn't the best way to go.
Finally, as a side note, the next release will probably be another FULL one as I have another 83G of PGN games ready ! I have kept the 206G that made it into this first release
Are you saying that the DB is already 206G big, but below 16 Million games?

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 5:03 pm
by Fulvio
bstjean wrote: It’s been known for a while, games from the CCRL (Computer Chess Rating Lists) use a “custom” round number and many chess database programs don’t seem to like it. Besides, importing CCRL games will often cause the famous “Round Name limit of 262143 exceeded” error in Scid or Scid vs PC. So I have decided to replace the round number in the CCRL games by the default value of “?”. Does any one have a problem with this? Do you have any idea/suggestion/comment on this?
CCRL tags are like this:
[Event "CCRL 40/40"]
[Round "529.2.403"]
and i believe the best things would be to change them to:
[Event "CCRL 40/40 - 529"]
[Round "2.403"]
if 529 is the tournament number.

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 7:47 pm
by bstjean
Looks like the problem is on your end.

I downloaded the zip file myself, again, to test it and it works just fine. Besides, it's been downloaded 50+ times so far and I haven't received any comment nor email from anyone saying the archive was corrupted.

I'm using the 7-Zip software (on Windows) and it unzips fine it that helps.

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 7:48 pm
by bstjean
retep1 wrote:thx for your great work. sadly the archiv ist corrupt.
Looks like the problem is on your end.

I downloaded the zip file myself, again, to test it and it works just fine. Besides, it's been downloaded 50+ times so far and I haven't received any comment nor email from anyone saying the archive was corrupted.

I'm using the 7-Zip software (on Windows) and it unzips fine it that helps.

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 7:49 pm
by bstjean
Fulvio wrote:
bstjean wrote: It’s been known for a while, games from the CCRL (Computer Chess Rating Lists) use a “custom” round number and many chess database programs don’t seem to like it. Besides, importing CCRL games will often cause the famous “Round Name limit of 262143 exceeded” error in Scid or Scid vs PC. So I have decided to replace the round number in the CCRL games by the default value of “?”. Does any one have a problem with this? Do you have any idea/suggestion/comment on this?
CCRL tags are like this:
[Event "CCRL 40/40"]
[Round "529.2.403"]
and i believe the best things would be to change them to:
[Event "CCRL 40/40 - 529"]
[Round "2.403"]
if 529 is the tournament number.
I'm just processing PGN files the way they were produced! Right now, doing that kind of "magic" is not an option but that's definitely doable in a not-so-distant future!

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 7:52 pm
by bstjean
Are you saying that the DB is already 206G big, but below 16 Million games?
No! I'm saying I have downloaded 206G of PGN games and this database was built from that. Obviously, it looks like everyone has MANY games in common!!

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 9:24 pm
by styx
A very weird ZIP format but I managed to unzip it (using 7zip). It just took an unnecessary high amount of time.
Have you considered providing the database in SCID format? It occupies 100 MB less space than the zipped PGN file.

As for the database: nice stuff. Thank you.

Just for your information: There are at least 431000 doublets in this database.

Re: The first release of the CGR games database

Posted: Mon Feb 20, 2017 9:38 pm
by bstjean
styx wrote:A very weird ZIP format but I managed to unzip it (using 7zip). It just took an unnecessary high amount of time.
Have you considered providing the database in SCID format? It occupies 100 MB less space than the zipped PGN file.

As for the database: nice stuff. Thank you.

Just for your information: There are at least 431000 doublets in this database.
1) I wasn't sure the sudden burst of downloads wouldn't cause problems so I zipped the file with the maximum compression I could find (see my post on the blog regarding this)
2) For now, I will stick to the PGN format as not eveyone uses Scid. The goal is to provide a quality database to *everyone*, not only Scid users! And the portability of the PGN format is currently the best solution!
3) Thanks for the info. But which options did you use to detect those? I made 5 passes of "twin checks" and, obviously, I missed some! I'm interested to know how you detected those duplicates!
4) Next release, I will probably provide 5 zip files, one per ECO classification.

Thanks!