SCID4 database

Fulvio · Post by **Fulvio** » Tue Oct 26, 2021 11:01 am

Fulvio wrote: ↑Sat Oct 09, 2021 12:55 pm Anyway, back to the limitations.

1) number of characters for tags.
The tag name must consist of at least one character and at most 240 (range [1: 240] ).
The tag value must have a maximum of 255 characters (range [0: 255] ).

Fulvio wrote: ↑Sun Oct 17, 2021 1:18 pm 10) The most important limitation in my opinion is that neither the tags nor the comments are compressed in any way. It is now common to have games where the clock and the eventual evaluation of each position are stored as comments, and it would be possible to save a lot of space.

Although these are not overly important limits, it would still be useful to remove them.
The NameBase has a couple of useful properties:
- the IDs are sequentially increasing values starting with value 0
- the tag values cannot contain the null char ('\ 0').
The simplest idea is therefore to store only the tag values, sorted by ID, as c-strings (terminated by a null char '\ 0').
By preceding a section (players name, events, sites, rounds) with 8-bytes indicating the entire size of a section, it is possible to concatenate them.
And then you can compress everything with zstd.
In summary decoding would become:
- read the .sn4 file in memory
- ZSTD_getFrameContentSize -> total size of all uncompressed sections
- ZSTD_decompress -> unzip all sections
- Convert the first 8-bytes to uint64_t -> size of the next section
- Read the strings and load them into memory (currently a std :: vector <std :: unique_ptr <const char [] >> is used)

Benefits:
- no limit on length
- probably a smaller size
- the four sections can possibly be loaded in parallel
- an easier format to decode

Disadvantages:
- dependence on the zstd library. Managing dependencies in C ++ is always a nightmare. Simply connecting to the library in the system dynamically could lead to incompatibility problems. For example, a database created with a newer version that is not read by older versions.
- reading the file requires double the RAM memory.

Another interesting idea, since the NameBase is always written entirely anyway, would be to remove the .sn4 file and insert the Namebase at the end of the .si4 file.
Benefits:
- the database would consist of only 2 files
- there can be no incompatibility problems between the .si4 file and the values it refers to in the .sn4 file

Disadvantages:
- when adding games, the old NameBase is overwritten and only remains in memory. If an error or a power failure occurs, the database is irrimedially damaged (with two separate files only the new names added are lost).

Sopel · Post by **Sopel** » Tue Oct 26, 2021 12:06 pm

Fulvio wrote: ↑Tue Oct 26, 2021 11:01 am Although these are not overly important limits, it would still be useful to remove them.
The NameBase has a couple of useful properties:
- the IDs are sequentially increasing values starting with value 0
- the tag values cannot contain the null char ('\ 0').
The simplest idea is therefore to store only the tag values, sorted by ID, as c-strings (terminated by a null char '\ 0').
By preceding a section (players name, events, sites, rounds) with 8-bytes indicating the entire size of a section, it is possible to concatenate them.
And then you can compress everything with zstd.
In summary decoding would become:
- read the .sn4 file in memory
- ZSTD_getFrameContentSize -> total size of all uncompressed sections
- ZSTD_decompress -> unzip all sections
- Convert the first 8-bytes to uint64_t -> size of the next section
- Read the strings and load them into memory (currently a std :: vector <std :: unique_ptr <const char [] >> is used)

You could compress in, say, 64kiB chunks (or N games). Then you can keep the data compressed in memory and only decompress the relevant chunk when needed. A dictionary could be used to improve compression and speed up decompression, with a possibility for even lower chunks sizes. ZSTD is very fast.

If you don't want to keep the data in compressed form in memory then I don't see a point in doing any form of compression at all, since you say you need to keep the whole file loaded in memory anyway.

Fulvio · Post by **Fulvio** » Wed Oct 27, 2021 10:21 am

Sopel wrote: ↑Tue Oct 26, 2021 12:06 pm If you don't want to keep the data in compressed form in memory then I don't see a point in doing any form of compression at all, since you say you need to keep the whole file loaded in memory anyway.

I'm not sure I understand the objection.
It is a bit like an image in a jpg file: even if it is used uncompressed in memory, it is still useful that it takes less space as a file.

Sopel · Post by **Sopel** » Wed Oct 27, 2021 1:26 pm

Fulvio wrote: ↑Wed Oct 27, 2021 10:21 am
Sopel wrote: ↑Tue Oct 26, 2021 12:06 pm If you don't want to keep the data in compressed form in memory then I don't see a point in doing any form of compression at all, since you say you need to keep the whole file loaded in memory anyway.
I'm not sure I understand the objection.
It is a bit like an image in a jpg file: even if it is used uncompressed in memory, it is still useful that it takes less space as a file.

That's because usually you have a lot of small images. How many databases does one usually have? Either way, feel free to ignore that sentence as it's offtopic.

SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database