Open Chess Game Database Standard

phhnguyen · Post by **phhnguyen** » Thu Feb 03, 2022 11:02 pm

hgm wrote: ↑Thu Feb 03, 2022 7:15 pm When I was writing assemblers, the method I used for hashing the identifiers in the symbol table was to calculate the hash key for a symbol of N characters as A * character_N + B * Hash_of_first_N-1_characters, for suitable constants A and B. I dont see how the current problem is any different from hashing character strings. The internal representation of the game can always be thought of as a string.

Do you mean to use arrays of hash keys to compare?

I use the array of moves to compare, shorter. I can't "compress" an array into only one key to compare since that lost the information of ordering.

However, the main problem with using arrays is that I can't store them all (equal storing all moves) for a huge database in memory. That is why I have to develop a method to load and process data with the complexity of O(n) to avoid reloading too many times.

hgm · Post by **hgm** » Fri Feb 04, 2022 2:04 pm

No, I mean calculating the hash key in such a way that all moves of the game contribute to it. E.g. as a linear combination of the encoding of all indiviual moves (whatever that may be). Or a CRC checksum.

Sopel · Post by **Sopel** » Sat Feb 05, 2022 12:46 pm

Why exactly are "duplicate" games undesirable? I mean ones that were actually played twice, not multiple copies of one game. It's natural that common sequences reoccur and that going around that breaks the data integrity.

Jonathan003 · Post by **Jonathan003** » Sat Feb 05, 2022 11:33 pm

I wonder what's the reason why SCID and also the new HIARCS Chess Explorer pro, don't detect doubles when the player names are different? If they are different because of misspelling of the player names in the doubled games? Could it be that it's normal to have games in a database with exact the same moves but are not duplicates? I can imagine that could be the case in short draw games. But not in longer complete games, and I delete these short draw games anyway. You should have at least an option to consider these games also as duplicates.

phhnguyen · Post by **phhnguyen** » Mon Feb 07, 2022 4:44 am

Sopel wrote: ↑Sat Feb 05, 2022 12:46 pm Why exactly are "duplicate" games undesirable? I mean ones that were actually played twice, not multiple copies of one game. It's natural that common sequences reoccur and that going around that breaks the data integrity.

IMHO, there is no exact thing. All are just tastes, understandings, or views of users which are different from one to others. That is why the program should have multi parameters for changing by users.

Typically, "good" duplicates happen for short games. I usually set plycount as 20, sometimes 40 to ignore games with lengths under those numbers.

phhnguyen · Post by **phhnguyen** » Mon Feb 07, 2022 4:50 am

Jonathan003 wrote: ↑Sat Feb 05, 2022 11:33 pm I wonder what's the reason why SCID and also the new HIARCS Chess Explorer pro, don't detect doubles when the player names are different? If they are different because of misspelling of the player names in the doubled games?

I am not an author nor a user of those programs thus I can't answer.

IMO, it is best if users can choose, have some options to select.

Jonathan003 wrote: ↑Sat Feb 05, 2022 11:33 pm Could it be that it's normal to have games in a database with exact the same moves but are not duplicates? I can imagine that could be the case in short draw games. But not in longer complete games, and I delete these short draw games anyway. You should have at least an option to consider these games also as duplicates.

Yes, that typically happens for short games.

We have already a parameter, named "plycount" to ignore short games for checking duplicates. In our above tests, it is set to 20, sometimes to 40.

Modern Times · Post by **Modern Times** » Mon Feb 07, 2022 5:01 am

I'd be interested in running this duplicate detection on the CCRL databases. We do remove duplicates, and I think it is custom-written code in the Perl scripts. I'm not sure how it works exactly - for example I don't know how duplicate games played by different engines are treated. It would be useful to see what your code detects. I know a while ago Norm Schmidt I think it was said we do have a small number of duplicates so our code may not be working 100% correctly. But it certainly works very well - just this week Graham submitted his games and he submitted nearly 600 of them twice, which the scripts detected and eliminated.

Jonathan003 · Post by **Jonathan003** » Sat Feb 12, 2022 12:23 am

I have tried OCGDB with merging some pgn databases with millions of games.
I merged this database:
https://lichess-elite-db.s3.eu-west-3.a ... 2021-03.7z

with this:
http://rebel13.nl/dl.html?file=dl/human.7z

I'm sure there are many doubles when these databases are merged but I couldn't find any with OCGDB
Maybe I did something wrong?

The import of a pgn to the OCGDB format goes fast, also searching for doubles in the OCGDB format goes fast. But the export back to pgn goes slow on my pc.
I stil learn to work with the tool, maybe I did something wrong.
I want to test the tool with some smaller databases. Can you recommend some pgn databases to test the function like checking for doubles?

sarona · Post by **sarona** » Sat Feb 12, 2022 5:01 am

I downloaded the files from both your links. Converted the SCID database to a pgn and merged it with human.pgn using ocgdb (Beta 5). After making a backup of the database, I used the command below and found thousands of "doubles."

ocgdb -db (yourdatabasename).ocgdb.db3 -cpu 4 -dup -plycount 40 -o printall; remove

Some of the matches contain the same moves, but are, in fact, different games.

Here is an example:

Jonathan003 · Post by **Jonathan003** » Sun Feb 13, 2022 2:34 pm

sarona wrote: ↑Sat Feb 12, 2022 5:01 am I downloaded the files from both your links. Converted the SCID database to a pgn and merged it with human.pgn using ocgdb (Beta 5). After making a backup of the database, I used the command below and found thousands of "doubles."

ocgdb -db (yourdatabasename).ocgdb.db3 -cpu 4 -dup -plycount 40 -o printall; remove

Some of the matches contain the same moves, but are, in fact, different games.

Here is an example:

Thanks for the explanation!
Ho do you insert an image here on the forum without using a link to https://nl.imgbb.com/ for example?
I did found the doubles but because they where not deleted (#removed: 0)
I was thinking ocgdb didn't found any doubles.
What do I have to type to remove the found doubles? And to export the result (without the doubles), back to a pgn database?

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard