The first release of the CGR games database

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Norm Pollock
Posts: 1056
Joined: Thu Mar 09, 2006 4:15 pm
Location: Long Island, NY, USA

Re: The first release of the CGR games database

Post by Norm Pollock »

You claim the (wh) and (bl) tags are useless and idiotic on the basis of redundancy. If it were just to indicate which side he is playing I would agree with you. However...

What these tags do is allow a player to be analyzed as 2 different entities. We can examine the player from each color. We can compute his Elos, examine his repetoire (ECO), see style differences like aggressiveness, and so on.

You would first have to standardize each player's name and make sure each player is not playing under multiple spellings of his name. This is not practical in a super-mega-database.
bstjean
Posts: 19
Joined: Sat Oct 08, 2016 10:10 pm
Location: Montreal
Full name: Benoît St-Jean

Re: The first release of the CGR games database

Post by bstjean »

Norm Pollock wrote:You claim the (wh) and (bl) tags are useless and idiotic on the basis of redundancy. If it were just to indicate which side he is playing I would agree with you. However...

What these tags do is allow a player to be analyzed as 2 different entities. We can examine the player from each color. We can compute his Elos, examine his repetoire (ECO), see style differences like aggressiveness, and so on.

You would first have to standardize each player's name and make sure each player is not playing under multiple spellings of his name. This is not practical in a super-mega-database.
That kind of filter can be applied in 2 seconds in any decent chess database! Specify the player, the color, you're done. Repeat for black.
User avatar
gbtami
Posts: 389
Joined: Wed Sep 26, 2012 1:29 pm
Location: Hungary

Re: The first release of the CGR games database

Post by gbtami »

bstjean wrote:
styx wrote:A very weird ZIP format but I managed to unzip it (using 7zip). It just took an unnecessary high amount of time.
Have you considered providing the database in SCID format? It occupies 100 MB less space than the zipped PGN file.

As for the database: nice stuff. Thank you.

Just for your information: There are at least 431000 doublets in this database.
1) I wasn't sure the sudden burst of downloads wouldn't cause problems so I zipped the file with the maximum compression I could find (see my post on the blog regarding this)
2) For now, I will stick to the PGN format as not eveyone uses Scid. The goal is to provide a quality database to *everyone*, not only Scid users! And the portability of the PGN format is currently the best solution!
3) Thanks for the info. But which options did you use to detect those? I made 5 passes of "twin checks" and, obviously, I missed some! I'm interested to know how you detected those duplicates!
4) Next release, I will probably provide 5 zip files, one per ECO classification.

Thanks!
I'v tried to unzip it on Ubuntu 16.04 without success. Finally uncompressed on Windows, but it needed huge amount of time. I suggest to use default settings for .zip compression for next version!

May I ask you why do you strip annotation/comments? It's one of the most useful part of some .pgn files IMO.

Another idea can be to split the database into ICS/computer games/OTB games parts instead of ECO.

Thx for your hard work!
bstjean
Posts: 19
Joined: Sat Oct 08, 2016 10:10 pm
Location: Montreal
Full name: Benoît St-Jean

Re: The first release of the CGR games database

Post by bstjean »

[/quote]

I'v tried to unzip it on Ubuntu 16.04 without success. Finally uncompressed on Windows, but it needed huge amount of time. I suggest to use default settings for .zip compression for next version!

May I ask you why do you strip annotation/comments? It's one of the most useful part of some .pgn files IMO.

Another idea can be to split the database into ICS/computer games/OTB games parts instead of ECO.

Thx for your hard work![/quote]

1) So far, the time it takes to decompress the file and the compression method/format is THE major complain! It's well noted guys!

2) Because they take an awful lot of space. You could easily double the size of the files.

3) I was thinking of providing the WhiteIsComp/BlackIsComp tag (when it is not already there) in a future release so you could filter out games played by computers. As I said, I don't want to impose my choices on people. I'd rather give them everything while making sure they have all they need to tailor their copy of the database to their own needs!
User avatar
gbtami
Posts: 389
Joined: Wed Sep 26, 2012 1:29 pm
Location: Hungary

Re: The first release of the CGR games database

Post by gbtami »

bstjean wrote: 2) Because they take an awful lot of space. You could easily double the size of the files.

3) I was thinking of providing the WhiteIsComp/BlackIsComp tag (when it is not already there) in a future release so you could filter out games played by computers. As I said, I don't want to impose my choices on people. I'd rather give them everything while making sure they have all they need to tailor their copy of the database to their own needs!
2) I disagree here. Very few amount of games in .pgn files has annotations. This extra space doesn't count compared to full .pgn size.

3) Good idea!
styx
Posts: 338
Joined: Tue Mar 13, 2012 9:59 pm
Location: Germany

Re: The first release of the CGR games database

Post by styx »

gbtami wrote: I'v tried to unzip it on Ubuntu 16.04 without success. Finally uncompressed on Windows, but it needed huge amount of time. I suggest to use default settings for .zip compression for next version!
Ubuntu: make sure you got the p7zip-full package installed and then

Code: Select all

7z e filename.zip
works
bstjean
Posts: 19
Joined: Sat Oct 08, 2016 10:10 pm
Location: Montreal
Full name: Benoît St-Jean

Re: The first release of the CGR games database

Post by bstjean »

gbtami wrote:
bstjean wrote: 2) Because they take an awful lot of space. You could easily double the size of the files.

3) I was thinking of providing the WhiteIsComp/BlackIsComp tag (when it is not already there) in a future release so you could filter out games played by computers. As I said, I don't want to impose my choices on people. I'd rather give them everything while making sure they have all they need to tailor their copy of the database to their own needs!
2) I disagree here. Very few amount of games in .pgn files has annotations. This extra space doesn't count compared to full .pgn size.

3) Good idea!
Anyway, keeping the comments/annotations just creates another problem. Often times, you'll find the same "historical" games annotated multiple times by multiple people. Do I keep Karpov's comments? Or Kortchnoi's annotations? Or Seirawan's comments? I would always find people who'd prefer one analysis over the other... Besides, most annotated games are also annotated on a lot of online sites if one absolutely needs comments. And nowadays, a LOT of people prefer having a chess engine running (and alerting you of blunders, better lines) as they go through the game...
User avatar
gbtami
Posts: 389
Joined: Wed Sep 26, 2012 1:29 pm
Location: Hungary

Re: The first release of the CGR games database

Post by gbtami »

styx wrote:
gbtami wrote: I'v tried to unzip it on Ubuntu 16.04 without success. Finally uncompressed on Windows, but it needed huge amount of time. I suggest to use default settings for .zip compression for next version!
Ubuntu: make sure you got the p7zip-full package installed and then

Code: Select all

7z e filename.zip
works
Good idea, thx!
User avatar
gbtami
Posts: 389
Joined: Wed Sep 26, 2012 1:29 pm
Location: Hungary

Re: The first release of the CGR games database

Post by gbtami »

bstjean wrote:
gbtami wrote:
bstjean wrote: 2) Because they take an awful lot of space. You could easily double the size of the files.

3) I was thinking of providing the WhiteIsComp/BlackIsComp tag (when it is not already there) in a future release so you could filter out games played by computers. As I said, I don't want to impose my choices on people. I'd rather give them everything while making sure they have all they need to tailor their copy of the database to their own needs!
2) I disagree here. Very few amount of games in .pgn files has annotations. This extra space doesn't count compared to full .pgn size.

3) Good idea!
Anyway, keeping the comments/annotations just creates another problem. Often times, you'll find the same "historical" games annotated multiple times by multiple people. Do I keep Karpov's comments? Or Kortchnoi's annotations? Or Seirawan's comments? I would always find people who'd prefer one analysis over the other... Besides, most annotated games are also annotated on a lot of online sites if one absolutely needs comments. And nowadays, a LOT of people prefer having a chess engine running (and alerting you of blunders, better lines) as they go through the game...
I don't see "another problem" here. If one game occurs with annotation by Karpov and by Kortchnoi too, I want to read both! This is not a problem but a huge value! Chessbase mega database has this feature. "The exclusive annotated database. Contains more than 6.8 millions games from 1560 to 2016 in the highest ChessBase quality standard. 70,000 games contain commentary from top players"
Why do you think it's a problem?

Analyzing engines in GUI will never give you insights like Karpov and other human giants give in hes annotations!
bstjean
Posts: 19
Joined: Sat Oct 08, 2016 10:10 pm
Location: Montreal
Full name: Benoît St-Jean

Re: The first release of the CGR games database

Post by bstjean »

gbtami wrote:
bstjean wrote:
gbtami wrote:
bstjean wrote: 2) Because they take an awful lot of space. You could easily double the size of the files.

3) I was thinking of providing the WhiteIsComp/BlackIsComp tag (when it is not already there) in a future release so you could filter out games played by computers. As I said, I don't want to impose my choices on people. I'd rather give them everything while making sure they have all they need to tailor their copy of the database to their own needs!
2) I disagree here. Very few amount of games in .pgn files has annotations. This extra space doesn't count compared to full .pgn size.

3) Good idea!
Anyway, keeping the comments/annotations just creates another problem. Often times, you'll find the same "historical" games annotated multiple times by multiple people. Do I keep Karpov's comments? Or Kortchnoi's annotations? Or Seirawan's comments? I would always find people who'd prefer one analysis over the other... Besides, most annotated games are also annotated on a lot of online sites if one absolutely needs comments. And nowadays, a LOT of people prefer having a chess engine running (and alerting you of blunders, better lines) as they go through the game...
I don't see "another problem" here. If one game occurs with annotation by Karpov and by Kortchnoi too, I want to read both! This is not a problem but a huge value! Chessbase mega database has this feature. "The exclusive annotated database. Contains more than 6.8 millions games from 1560 to 2016 in the highest ChessBase quality standard. 70,000 games contain commentary from top players"
Why do you think it's a problem?

Analyzing engines in GUI will never give you insights like Karpov and other human giants give in hes annotations!
Let's put it this way : I started with 206G of games and ended up with only 7G. There are **TONS** of games annotated by chess engines out there. We would end up with **TONS** of crap most people don't want. You'd be amazed to see how many times I've seen the Karpov-Korchnoi 1974 match, second game (http://www.chessgames.com/perl/chessgame?gid=1067858) annotated by chess engines or even some John Doe! Besides, since annotations mostly originate from stuff that has been published and reproduced, there a copyright problem with this!