The first release of the CGR games database

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
gbtami
Posts: 389
Joined: Wed Sep 26, 2012 1:29 pm
Location: Hungary

Re: The first release of the CGR games database

Post by gbtami »

bstjean wrote:
gbtami wrote:
bstjean wrote:
gbtami wrote:
bstjean wrote: 2) Because they take an awful lot of space. You could easily double the size of the files.

3) I was thinking of providing the WhiteIsComp/BlackIsComp tag (when it is not already there) in a future release so you could filter out games played by computers. As I said, I don't want to impose my choices on people. I'd rather give them everything while making sure they have all they need to tailor their copy of the database to their own needs!
2) I disagree here. Very few amount of games in .pgn files has annotations. This extra space doesn't count compared to full .pgn size.

3) Good idea!
Anyway, keeping the comments/annotations just creates another problem. Often times, you'll find the same "historical" games annotated multiple times by multiple people. Do I keep Karpov's comments? Or Kortchnoi's annotations? Or Seirawan's comments? I would always find people who'd prefer one analysis over the other... Besides, most annotated games are also annotated on a lot of online sites if one absolutely needs comments. And nowadays, a LOT of people prefer having a chess engine running (and alerting you of blunders, better lines) as they go through the game...
I don't see "another problem" here. If one game occurs with annotation by Karpov and by Kortchnoi too, I want to read both! This is not a problem but a huge value! Chessbase mega database has this feature. "The exclusive annotated database. Contains more than 6.8 millions games from 1560 to 2016 in the highest ChessBase quality standard. 70,000 games contain commentary from top players"
Why do you think it's a problem?

Analyzing engines in GUI will never give you insights like Karpov and other human giants give in hes annotations!
Let's put it this way : I started with 206G of games and ended up with only 7G. There are **TONS** of games annotated by chess engines out there. We would end up with **TONS** of crap most people don't want. You'd be amazed to see how many times I've seen the Karpov-Korchnoi 1974 match, second game (http://www.chessgames.com/perl/chessgame?gid=1067858) annotated by chess engines or even some John Doe! Besides, since annotations mostly originate from stuff that has been published and reproduced, there a copyright problem with this!
I never thought cleaning up crap like engine analysis is easy task, but definitely doable. Copyright problems is real, but if you list your public sources in readme.txt you may be safe. If any copyright holder ever complains you just have to strip that annotation from your database.
User avatar
Ozymandias
Posts: 1534
Joined: Sun Oct 25, 2009 2:30 am

Re: The first release of the CGR games database

Post by Ozymandias »

bstjean wrote:I zipped the file with the maximum compression I could find (see my post on the blog regarding this)
Link to that post on the blog, please?
User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: The first release of the CGR games database

Post by Guenther »

Ozymandias wrote:
bstjean wrote:I zipped the file with the maximum compression I could find (see my post on the blog regarding this)
Link to that post on the blog, please?
Second post from the main url: PGN Files & Compression
https://chessgamesrepository.wordpress.com/
User avatar
gbtami
Posts: 389
Joined: Wed Sep 26, 2012 1:29 pm
Location: Hungary

Re: The first release of the CGR games database

Post by gbtami »

It would be cool to create a github repo to this process. I mean for .pgn sources and scripts you use to download, clean, merge, etc. This way others can help/contribute and follow how it goes. Involving the community to create a public game database can help a lot. Just see how more contributors made stockfish better and better.
retep1
Posts: 44
Joined: Sun Aug 07, 2016 5:24 pm

Re: The first release of the CGR games database

Post by retep1 »

you are right. With 7-zip it works. Winrar wasn't able to do this.
I can't load the file with Chessbase 14, but with Aquarium 2017.
Am I right, that there are games only till 2008?
styx
Posts: 338
Joined: Tue Mar 13, 2012 9:59 pm
Location: Germany

Re: The first release of the CGR games database

Post by styx »

The are games until February 12th 2017
bstjean
Posts: 19
Joined: Sat Oct 08, 2016 10:10 pm
Location: Montreal
Full name: Benoît St-Jean

Re: The first release of the CGR games database

Post by bstjean »

retep1 wrote:you are right. With 7-zip it works. Winrar wasn't able to do this.
I can't load the file with Chessbase 14, but with Aquarium 2017.
Am I right, that there are games only till 2008?
You're wrong. There are even 384K games since Jan 1st 2016!
bstjean
Posts: 19
Joined: Sat Oct 08, 2016 10:10 pm
Location: Montreal
Full name: Benoît St-Jean

Re: The first release of the CGR games database

Post by bstjean »

retep1 wrote:you are right. With 7-zip it works. Winrar wasn't able to do this.
I can't load the file with Chessbase 14, but with Aquarium 2017.
Am I right, that there are games only till 2008?
I am curious. What is the error you encounter with Chessbase 14?
User avatar
Ozymandias
Posts: 1534
Joined: Sun Oct 25, 2009 2:30 am

Re: The first release of the CGR games database

Post by Ozymandias »

Guenther wrote:
Ozymandias wrote:
bstjean wrote:I zipped the file with the maximum compression I could find (see my post on the blog regarding this)
Link to that post on the blog, please?
Second post from the main url: PGN Files & Compression
https://chessgamesrepository.wordpress.com/
A compression ratio of 14.588 is quite good, but depending on how fast it is (specially decompressing), it might not beat nanozip's 11.8 ratio.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: The first release of the CGR games database

Post by jdart »

A few comments: I am sure this took a lot of effort and contributing it to the public is a fine idea. But there are some issues:

1. I generally don't use computer match games played without book or with a limited-depth book. CCRL for example uses a limited book - the lines change from time to time but the same lines are repeated many times. They are not necessarily the best opening moves. Also many engines will play a suboptimal move in the first few moves out of book.

2. Quite a few games, especially correspondence games and those on Playchess, are lost by forfeit, so they have a result but that result is not a good guide to how well the players were doing when the game was terminated. For example, White might have been winning but overstepped the time limit and lost. I have a filter program that will weed these games out, since my book building program considers game results.

3. I mostly don't use blitz games due to the higher rate of errors in blitz play. I even don't use computer blitz games, although there is less reason to avoid those now since with today's search speeds, engines are reaching quite high depths even in blitz. Still I started weeding them out when engines were much slower.

4. While there are a lot of free game collections by human players available on the Internet, the quality of these is often low. The games themselves sometimes have errors, but the metadata (Event/Site/Date/Round) very frequently is wrong/incomplete or has things like round in the Site tag. I think some of these were pirated from Chessbase in the early days and auto-converted out of their format by buggy software. TWIC though is a very good source and doesn't have these issues.

--Jon