SCID4 database

Discussion of chess software programming and technical issues.

Moderator: Ras

Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: SCID4 database

Post by Fulvio »

Fulvio wrote: Sat Oct 09, 2021 12:55 pm Anyway, back to the limitations.

1) number of characters for tags.
The tag name must consist of at least one character and at most 240 (range [1: 240] ).
The tag value must have a maximum of 255 characters (range [0: 255] ).
Fulvio wrote: Sun Oct 17, 2021 1:18 pm 10) The most important limitation in my opinion is that neither the tags nor the comments are compressed in any way. It is now common to have games where the clock and the eventual evaluation of each position are stored as comments, and it would be possible to save a lot of space.
Although these are not overly important limits, it would still be useful to remove them.
The NameBase has a couple of useful properties:
- the IDs are sequentially increasing values ​​starting with value 0
- the tag values ​​cannot contain the null char ('\ 0').
The simplest idea is therefore to store only the tag values, sorted by ID, as c-strings (terminated by a null char '\ 0').
By preceding a section (players name, events, sites, rounds) with 8-bytes indicating the entire size of a section, it is possible to concatenate them.
And then you can compress everything with zstd.
In summary decoding would become:
- read the .sn4 file in memory
- ZSTD_getFrameContentSize -> total size of all uncompressed sections
- ZSTD_decompress -> unzip all sections
- Convert the first 8-bytes to uint64_t -> size of the next section
- Read the strings and load them into memory (currently a std :: vector <std :: unique_ptr <const char [] >> is used)

Benefits:
- no limit on length
- probably a smaller size
- the four sections can possibly be loaded in parallel
- an easier format to decode

Disadvantages:
- dependence on the zstd library. Managing dependencies in C ++ is always a nightmare. Simply connecting to the library in the system dynamically could lead to incompatibility problems. For example, a database created with a newer version that is not read by older versions.
- reading the file requires double the RAM memory.

Another interesting idea, since the NameBase is always written entirely anyway, would be to remove the .sn4 file and insert the Namebase at the end of the .si4 file.
Benefits:
- the database would consist of only 2 files
- there can be no incompatibility problems between the .si4 file and the values ​​it refers to in the .sn4 file

Disadvantages:
- when adding games, the old NameBase is overwritten and only remains in memory. If an error or a power failure occurs, the database is irrimedially damaged (with two separate files only the new names added are lost).
Sopel
Posts: 391
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: SCID4 database

Post by Sopel »

Fulvio wrote: Tue Oct 26, 2021 11:01 am Although these are not overly important limits, it would still be useful to remove them.
The NameBase has a couple of useful properties:
- the IDs are sequentially increasing values ​​starting with value 0
- the tag values ​​cannot contain the null char ('\ 0').
The simplest idea is therefore to store only the tag values, sorted by ID, as c-strings (terminated by a null char '\ 0').
By preceding a section (players name, events, sites, rounds) with 8-bytes indicating the entire size of a section, it is possible to concatenate them.
And then you can compress everything with zstd.
In summary decoding would become:
- read the .sn4 file in memory
- ZSTD_getFrameContentSize -> total size of all uncompressed sections
- ZSTD_decompress -> unzip all sections
- Convert the first 8-bytes to uint64_t -> size of the next section
- Read the strings and load them into memory (currently a std :: vector <std :: unique_ptr <const char [] >> is used)
You could compress in, say, 64kiB chunks (or N games). Then you can keep the data compressed in memory and only decompress the relevant chunk when needed. A dictionary could be used to improve compression and speed up decompression, with a possibility for even lower chunks sizes. ZSTD is very fast.

If you don't want to keep the data in compressed form in memory then I don't see a point in doing any form of compression at all, since you say you need to keep the whole file loaded in memory anyway.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: SCID4 database

Post by Fulvio »

Sopel wrote: Tue Oct 26, 2021 12:06 pm If you don't want to keep the data in compressed form in memory then I don't see a point in doing any form of compression at all, since you say you need to keep the whole file loaded in memory anyway.
I'm not sure I understand the objection.
It is a bit like an image in a jpg file: even if it is used uncompressed in memory, it is still useful that it takes less space as a file.
Sopel
Posts: 391
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: SCID4 database

Post by Sopel »

Fulvio wrote: Wed Oct 27, 2021 10:21 am
Sopel wrote: Tue Oct 26, 2021 12:06 pm If you don't want to keep the data in compressed form in memory then I don't see a point in doing any form of compression at all, since you say you need to keep the whole file loaded in memory anyway.
I'm not sure I understand the objection.
It is a bit like an image in a jpg file: even if it is used uncompressed in memory, it is still useful that it takes less space as a file.
That's because usually you have a lot of small images. How many databases does one usually have? Either way, feel free to ignore that sentence as it's offtopic.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.