SCID5

Discussion of chess software programming and technical issues.

Moderator: Ras

Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

SCID5

Post by Fulvio »

The implementation of the SCID5 codec is finished (now the test and benchmark phase begins).
I have created a wiki where I am documenting the database structure:
https://github.com/benini/scid/wiki/Dat ... dec:-SCID5

Image

Please let me know if you think the new limits are too low, notice something is missing, or have any suggestions (please refrain from why don't you use SQL... :wink: ).
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: SCID5

Post by dangi12012 »

Congratulations!
More speed is always good.

Now assuming I would like to load a very big pgn file from here:
https://database.lichess.org/

How many games/s can this current implementation insert?
To compare file sizes and performance.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: SCID5

Post by Fulvio »

dangi12012 wrote: Sun Jul 17, 2022 7:08 pm How many games/s can this current implementation insert?
Up to 4 billion games
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: SCID5

Post by dangi12012 »

Fulvio wrote: Sun Jul 17, 2022 9:21 pm
dangi12012 wrote: Sun Jul 17, 2022 7:08 pm How many games/s can this current implementation insert?
Up to 4 billion games
No - not the total. The number of games that can be inserted per second.
So how fast is the path pgn file -> database?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: SCID5

Post by phhnguyen »

Nice work, congratulation!

I have seen some images of SCID which are showing some players' photos (i.e., https://en.wikipedia.org/wiki/Shane's_C ... 2-Win7.png).

However, did not see any field about that info (photos of players) in your database structure. Do you cut off them?
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: SCID5

Post by Fulvio »

dangi12012 wrote: Sun Jul 17, 2022 11:45 pm The number of games that can be inserted per second.
Let me mention that I have added a new menu item to convert a database (database -> copy all games to -> new).
It should also be noted that the only challenge from a performance point of view is to look for a position in a few milliseconds.

Inserting games is super fast, it is simply a matter of appending the data to the 3 files.
There is an interesting bottleneck however when it comes to lichess files.
There are 3 types of queries that are performed on the Site table:
  • get_name (SiteID)
    exists (name)
    match_name_prefix (name_prefix)
(the last query is for auto complete: when the user enters some characters, suggest sites that are already present in the database)

I implemented it with a simple std::map, but lichess inserts a unique URL for each game.
In databases with millions of games, and therefore millions of unique URLs, searching if the name exists in the table becomes relatively slow and above all inefficient, because it never exists.

The solution that comes to mind is to use a bloom filter.
Can anyone recommend a good library for that?
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: SCID5

Post by Fulvio »

phhnguyen wrote: Mon Jul 18, 2022 1:05 am However, did not see any field about that info (photos of players) in your database structure. Do you cut off them?
Player information is shared across databases.
There is a spelling file (so called because it is mainly used to correct the names of the players) where there are various information such as the date of birth, the FIDE title, all the elo progression, etc ..
And there are file with photos.
The wiki has not been updated (the new menu item is Options->Resources):
https://sourceforge.net/p/scid/wiki/How ... ersPhotos/
but it should be enough to clarify.

The files can be downloaded here:
https://sourceforge.net/projects/scid/f ... er%20Data/
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: SCID5

Post by phhnguyen »

Look like I missed some points. Do you mean you still use multi-files instead of one file for your database? Is it your own binary database structure or is it built on a general open-source database? (sorry, I am so confused since I knew you have been trying/working on some general open-source databases).
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: SCID5

Post by Fulvio »

phhnguyen wrote: Mon Jul 18, 2022 7:58 am Do you mean you still use multi-files instead of one file for your database? Is it your own binary database
Yes, this is still a dedicated open-source binary database format.
The main objectives are:
1) maximum speed for finding an exact position
2) minimum compressed size
Finding an exact position is the most common and most critical operation. When studying an opening it is necessary to have the results in less than a second.
Having the smallest compressed size makes a difference in terms of bandwidth when downloading.

As for the general databases, RocksDB is great.
It is not really a database, it has no search functions.
But it has bloom filters, LRU caches, lz4/zstd compression and auto compaction built in.
However, when searching for a position, the ability to reorder the games makes the difference. When a SCID database is compacted, it also reorders the games optimally and becomes over 6x faster than rocksdb.
When it comes to compression, which is again fantastic and super fast, there is a catch though. Compressing the entire database into a single file (rocksdb uses an entire directory for its files) no longer reduces its size much. It varies a lot, but the compressed rocksdb database becomes ~20% larger than the same compressed SCID5 database.

I also checked your sqlite database structure (is it complete? Importing a PGN file with many tag-pairs, comments, NAGs, variations and then exporting it back to PGN produces a file equal to the original?).
Adding the tag "TimeControl" to the Game's record is a very interesting idea. It is a fundamental information in my opinion. Unfortunately I believe it is only present in lichess PGNs.
Adding a "Ply" column in the "Comment" table instead is a mistake in my opinion. For example lichess adds a %clk comment for each move and you end up memorizing all those consecutive plies even if it is not necessary.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: SCID5

Post by dangi12012 »

Fulvio wrote: Mon Jul 18, 2022 10:06 am Adding a "Ply" column in the "Comment" table instead is a mistake in my opinion. For example lichess adds a %clk comment for each move and you end up memorizing all those consecutive plies even if it is not necessary.
Lichess also adds a %eval comment. This can be used to get thousands of cpu hours of SF 14/15 eval of millions of games.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer