SCID4 database

Fulvio · Post by **Fulvio** » Wed Oct 06, 2021 12:14 pm

I intend to write a series of posts in which I describe the format of the SCID4 database.
The goal is to document the format in a simple way and to collect any suggestions on how to improve it.

It should also be noted that the format is still the one designed by Shane Hudson over 20 years ago.
And the main reason why a classic SQL database is not used is to obtain better performance.

The database is divided into 3 files:
.sn4 -> NameBase
.si4 -> Index
.sg4 -> GamesData

The NameBase consists of 4 reference tables which contain the names of the players, events, sites and rounds.
The correspondent in an SQL database would be:

Code: Select all

CREATE TABLE Players (Name varchar (255), NameID int);
CREATE TABLE Events  (Name varchar (255), NameID int);
CREATE TABLE Sites   (Name varchar (255), NameID int);
CREATE TABLE Rounds  (Name varchar (255), NameID int);

It is a fairly common practice when it comes to databases:
https://en.wikipedia.org/wiki/Associative_entity
which allows to save space and speed up searches.

The binary format is described in the source code:
https://sourceforge.net/p/scid/code/ci/ ... d4.cpp#l31
The names are slightly compressed:
https://en.wikipedia.org/wiki/Incremental_encoding
The main drawback is the need to rewrite the entire file when something changes.
From a performance point of view, however, it is not a significant problem.
When there are multiple changes, for example because games are added to the database,
the tables are modified in memory and written to the file only once at the end of the operation.

In the next post I will describe the .si4 Index where the NameIDs of the NameBase are used.

Fulvio · Post by **Fulvio** » Thu Oct 07, 2021 12:03 pm

At the beginning of the .si4 file there is an header of 182 bytes which contains some information about the database:

Code: Select all

description // a string (max 107 chars) describing the database.
flagDesc[6] // short description (8 chars) for CUSTOM flags
autoLoad // game number to autoload: 0 = none, 1 = 1st,> numGames = last
baseType // Type, e.g. tournament, theory, etc.

After the header, the .si4 file (Index) contains a record of 47 bytes for each game.
This record (IndexEntry) contains informations about a game.
Specifically, the value of the most common tags ( http://www.saremba.de/chessgml/standard ... e.htm#c8.1 ):

Code: Select all

Event
Site
Date
Round
White
Black
Result
WhiteElo (* see note)
BlackElo (* see note)
EventDate
ECO
ScidFlags

some information on the moves of the game:

Code: Select all

half moves (ply) count
variations count
comments count
NAGs count

some values calculated to speed up searching:

Code: Select all

StoredLine
HomePawnData
FinalMatSig

and finally:
the reference where to find the rest of the data in the .sg4 file (extra tags, moves, comments)

The value of the Event, Site, Round, White and Black tags are stored in the tables of the NameBase and the corresponding NameID is stored in the IndexEntry.

The binary representation of an IndexEntry is quite complex:
https://sourceforge.net/p/scid/code/ci/ ... 4.cpp#l322
and it creates some limitations that I will describe in the next post.

* NOTE:
A player can have different ratings, the recognized ones are:
WhiteElo
WhiteRating
WhiteRapid
WhiteICCF
WhiteUSCF
WhiteDWZ
WhiteECF
and the one considered "main" and stored in the IndexEntry is the first one, the others are stored in the .sg4 file.

dangi12012 · Post by **dangi12012** » Thu Oct 07, 2021 8:01 pm

This looks a lot like what would normally be inside list of pgn files and you need the 4 relational tables from that.
Why not have a normal relational database in sqlite?

The language bindings are there for C++. C#, Java and every Language you can think of. That way any language could read the project file and interact with it.

As you said 20 years ago - there was little notion of open source and every developer had to come up with something on his own.
IMO sqlite has performance - is tested in enterprise software - and has good example code available?

When size is an issue you cann still store compressed blobs. Either way performance will be very good.

What would be your motivation to improve readability and performance of this project? - when the replacement is tested free open source enterprise grade maintained code with a lot of bindings?

Fulvio · Post by **Fulvio** » Fri Oct 08, 2021 10:24 am

dangi12012 wrote: ↑Thu Oct 07, 2021 8:01 pm What would be your motivation to improve readability and performance of this project? - when the replacement is tested free open source enterprise grade maintained code with a lot of bindings?

I agree on the benefits you described, but sqlite performance is too poor.
The only thing I found with acceptable performance is rocksdb:
https://github.com/facebook/rocksdb.
Anyway, it is possible to implement different types of codecs:
https://sourceforge.net/p/scid/code/ci/ ... rc/codec.h
My current goal is to add a SCID5 codec that offer maximum performance and support chess960, but probably in the future I can also add other codecs for rocksdb or traditional sql databases.
It is quite simple, for example:
https://github.com/benini/scid/commit/c ... 1d68676098
the most complicated thing is to compile the dependencies (and compiling rocksdb is incredibly slow).

dangi12012 · Post by **dangi12012** » Fri Oct 08, 2021 12:36 pm

Poor in what way? Inserts/s or Query.

I am telling you that the queryspeed of sqlite is insane. With inserts you can do bulk inserts.
You have to make sure to also mark the columns as primary keys.

Do you think you need more performance than 30 Million Rows/s?

My suggestion is: stick with sql. Just creating a DB is not enough - also mark the colums as keys. The lookup performance will be O(1).

I know this because I wrote a sql importer for every single lichess game from 2015 - 2020 which are hundreds of millions of games. And billions and billions of unique positions.
Beware - these files have 100s of gigabytes.

I am telling you - if you think sqlite is too slow you are using it wrong.

https://database.lichess.org/

Fulvio · Post by **Fulvio** » Sat Oct 09, 2021 12:55 pm

Anyway, back to the limitations.

1) number of characters for tags.
The tag name must consist of at least one character and at most 240 (range [1: 240] ).
The tag value must have a maximum of 255 characters (range [0: 255] ).

2) number of unique values for some tags.
White and Black tags: max 1048575 unique values
Event and Site tags: max 524287 unique values
Round tag: max 262143 unique values
These limits are caused by the number of bits used for nameIDs.

3) The year of the date stored in the EventDate tag must be a maximum of 3 years before or 3 years after the year of the date stored in the Date tag.

4) Maximum number of games: 16777214

5) Maximum .sg4 file size: 4GB

6) Maximum size of the game's data (extra tags, moves, comments): 128KB

7) Games are not actually deleted. The IndexEntry records must be in order, and to delete a game it would then be necessary to rewrite all subsequent ones as well. Instead, the 'D' flag is set. There is the advantage that the user can eventually restore the games. The downside is that it is necessary to "compact" the database to actually delete the games and recover the space.

Documenting how StoredLine, HowePawnData and FinalMatSig values are used for positional searches would be interesting. However I don't think I will change them in SCID5, so I'll leave it out for the moment.

Next I will describe the structure of the .sg4 file and then post the changes I am considering making and how much extra space they require.

dangi12012 · Post by **dangi12012** » Sat Oct 09, 2021 3:04 pm

I read your 7 points and sqlite literally has none of that. Is ready to use - and has good documentation.

Seems to me like you want to reinvent a database from scratch - you can go to 100s of Gigabytes with sqlite...
After bulk deletion you can just call VACUUM; COMMIT; - done.

My question for you is: Why do you want to modernize an esoteric dataformat - when alternatives are already developed. industry tested and ready to use?

Fulvio · Post by **Fulvio** » Mon Oct 11, 2021 4:21 pm

dangi12012 wrote: ↑Sat Oct 09, 2021 3:04 pm My question for you is: Why do you want to modernize an esoteric dataformat - when alternatives are already developed. industry tested and ready to use?

I think I have already answered exhaustively.
It's not just me, read for example this:
https://triplehappy.wordpress.com/2016/ ... mentation/

And, for reference, this is the performance level of the SCID4 format:

dangi12012 · Post by **dangi12012** » Mon Oct 11, 2021 5:26 pm

Fulvio wrote: ↑Mon Oct 11, 2021 4:21 pm
dangi12012 wrote: ↑Sat Oct 09, 2021 3:04 pm My question for you is: Why do you want to modernize an esoteric dataformat - when alternatives are already developed. industry tested and ready to use?
I think I have already answered exhaustively.
It's not just me, read for example this:
https://triplehappy.wordpress.com/2016/ ... mentation/

And, for reference, this is the performance level of the SCID4 format:

There you have your answer already: "However I absolutely don’t discount the possibility that a more experienced and capable SQL developer could think of dramatic improvements to the model I was using. I concede I don’t know how Chessbase searches work – they have a very practical solution with compact databases, reasonable creation times, reasonable search speed and (I think) a disk based rather than memory based model."

I promise you that with this approach you can list all positions and variations that a position reached in under 10ms:
even if there are 100 Million games.
First use SQLITE.

Second you have a table of all positions - here is the important part: The primary key is the zobrist hash of the position. Then you have gameID etc..
Third you have a table of all games - here you have the primary key gameID
Thats it. Thats the whole list.

Explanation:
When you query a position and want to know all games - you have the hash which is a O(1) lookup and can query how many times a position was reached in O(1):
select count(*) from position p, games g where p.pk = 'Zobristhash' AND p.gameID = g.gameID
group by outomce

You can subsequently query the outome statistics for all ~30 possibilities in O(1) also.
*Do 30x or write a good SQL query here*
select * from position p, games g where p.pk = 'Zobristhash' AND p.gameID = g.gameID
group by outomce

I already did that to generate statistics on all 100Million+ games of lichess... I used SQLite. You need to know what a primary key is - what a normal key is in SQL. Why you dont use strings as a PK - and get the gist of SQL.

The important part is that looking up by zobrist is O(1). No matter how big the db - this will be almost instant. Second - if you are worried about collisions you can add a second zobrist with a different seed for each square. Now collisions are impossible - and you only need to add AND pk2 = Hash.

Its really easy - and an industry standard for much larger and more performance critical lookups than chess stats for a position I see in the youtube video.

Fulvio · Post by **Fulvio** » Mon Oct 11, 2021 9:31 pm

dangi12012 wrote: ↑Mon Oct 11, 2021 5:26 pm Its really easy - and an industry standard for much larger and more performance critical lookups than chess stats for a position I see in the youtube video.

Please post a link to some code that can confirm these claims.

SCID4 database

SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database