Open Chess Game Database Standard

Sopel · Post by **Sopel** » Mon Nov 15, 2021 7:18 pm

Fulvio wrote: ↑Mon Nov 15, 2021 6:55 pm
Sopel wrote: ↑Mon Nov 15, 2021 2:47 pm I highly doubt this is the case. You have a pipeline with 4 stages. 1. Reading the file. 2. Parsing the file. 3. Creating the import statements. 4. Importing into DB
In my experience (SCID reads PGN files in 128kb chunks, automatically doubling the buffer up to 128MB if it encounters larger games) O.S. are pretty good at optimizing point 1. Moving it to a separate thread may increase complexity without improving the performance.

My experience (win 7) is that reading files in text mode is relatively slow even if the file is cached because windows handles a lot of special characters + there are copies being made. I don't have the data right now but I have definitely seen file reads in the order of 10% of runtime when processing PGNs at ~100MB/s (which means about +10% worst case speed improvement from offloading that part to a different thread). Admittadly, using a binary mode does close that gap significantly, so if someone is designing a parser I'd suggest going this way and handling the special characters in the parser instead. (mmap only operates in binary mode).

Overall I don't like to rely on OS caching the file as I have less feedback about whether the data is ready or not. When doing asynchronous IO explicitly you can get better understanding of actual bottlenecks. I'd go as far as to disable caching (so open with FILE_FLAG_NO_BUFFERING on windows) for large files that are to be read sequentially (and rely on explicit buffers with async reads) as to not fill the memory available for file caches with useless stuff. This makes mmap undesirable as it forces caching.

phhnguyen · Post by **phhnguyen** » Mon Nov 15, 2021 11:26 pm

dangi12012 wrote: ↑Mon Nov 15, 2021 3:12 pm
phhnguyen wrote: ↑Mon Nov 15, 2021 12:31 pm I don’t have problems with having some ideas and/or implementing them in general. However, sometimes I struggled with how to apply them efficiently. In this case, it is multi-threading. Both reading input (the PGN file) and writing to the database (inserting records) are almost sequencings. Thus logically multi-threading won't help much, especially when other processes (such as parsing PGN tags) are very fast too (not much work to share between threads).
As usual very bad trolling advice from sopel:
Sopel wrote: ↑Mon Nov 15, 2021 2:47 pm I highly doubt this is the case. You have a pipeline with 4 stages. 1. Reading the file. 2. Parsing the file. 3. Creating the import statements. 4. Importing into DB
Heres the answer:
Of course you can parse a file multithreaded and very fast - and here is how:
You have to index the offsets and lengths of all games in a pgn without parsing. Just the raw offset with json or a text seeker. This can be insanely fast because maybe you only search for {} tokens or double newlines.
For Lichess DB I did this already and there i just search for doubled newlines with C++ memchr() on a memory mapped file.

Once you have this vector of offsets and lengths - you can easily spawn 32 Threads and parse each offset and length seperately in via a mapped file.

The trick is that seeking for simple tokens in phase 0 is faster than parsing pgn fully - so you just remember the pointeroffset where each game starts and on a second pass you parse in parallel. If your game is a class you can even generate fuctions to prepare a sql statement for insert. This can also be done in parallel.

DB inserts cannot be done in parallel with sqlite (with proper server sql dbs its faster) - but sqlite is threadsafe anyways so no worries and just commit transactions from multiple threads. Should also be a little bit faster because inserts dont stall.

Thanks for the suggestion.

However, it won't help since I use only one pass to parse all data.

Basically, all main functions in my code require only one pass of processing: read one, parse one, write one. Thus applying effectively multi-threading, mmap... is really a challenge.

Below is my code to parse a block of data:

Code: Select all

void Builder::processDataBlock(char* buffer, long sz, bool connectBlock)
{
    assert(buffer && sz > 0);
    
    std::unordered_map<std::string, std::string> tagMap;
    
    auto st = 0, eventCnt = 0;
    auto hasEvent = false;
    char *tagName = nullptr, *tagContent = nullptr, *event = nullptr, *moves = nullptr;

    for(char *p = buffer, *end = buffer + sz; p < end; p++) {
        char ch = *p;
        
        switch (st) {
            case 0:
            {
                if (ch == '[') {
                    p++;
                    if (!isalpha(*p)) {
                        continue;
                    }
                    
                    // has a tag
                    if (moves) {
                        if (hasEvent && p - buffer > 2) {
                            *(p - 2) = 0;

                            auto moveText = std::string(moves);
                            if (!addGame(tagMap, moveText)) {
                                errCnt++;
                            }
                        }

                        tagMap.clear();
                        hasEvent = false;
                        moves = nullptr;
                    }

                    tagName = p;
                    st = 1;
                } else if (ch > ' ') {
                    if (!moves && hasEvent) {
                        moves = p;
                    }
                }
                break;
            }
            case 1: // name tag
            {
                if (!isalpha(ch)) {
                    if (ch <= ' ') {
                        *p = 0; // end of the tag name
                        st = 2;
                    } else { // something wrong
                        st = 0;
                    }
                }
                break;
            }
            case 2: // between name and content of a tag
            {
                if (ch == '"') {
                    st = 3;
                    tagContent = p + 1;
                }
                break;
            }
            case 3:
            {
                if (ch == '"' || ch == 0) { // == 0 trick to process half begin+end
                    *p = 0;
                    
                    std::string name = tagName;
                    if (name == "Event") {
                        event = tagName - 1;
                        if (eventCnt == 0 && connectBlock) {
                            long len =  (event - buffer) - 1;
                            processHalfEnd(buffer, len);
                        }
                        hasEvent = true;
                        eventCnt++;
                        gameCnt++;
                    }

                    if (hasEvent) {
                        tagMap[name] = tagContent;
                    }

                    tagName = tagContent = nullptr;
                    st = 4;
                }
                break;
            }
            default: // the rest of the tag
            {
                if (ch == '\n' || ch == 0) {
                    st = 0;
                }
                break;
            }
        }
    }
    
    if (connectBlock) {
        processHalfBegin(event, (long)sz - (event - buffer));
    } else if (moves) {
        auto moveText = std::string(moves);
        if (!addGame(tagMap, moveText)) {
            errCnt++;
        }
    }
}

dangi12012 · Post by **dangi12012** » Mon Nov 15, 2021 11:58 pm

phhnguyen wrote: ↑Mon Nov 15, 2021 11:26 pm

Well then I have another option for you:
Remove all the inner stuff in your parser - and copy all the inner code to a function.
Then you push a consumer function onto a threadpool with a buffer that you consume inside your loop.
https://github.com/vit-vit/CTPL

I recommend a threadsafe blockingqueue - so that your total memory consumtion is well defined.

This is called the producer consumer pattern. You have 1 pass - but the actual parsing happens on multiple threads.
But you are playing the optimisation game - and its a dangerous game because you can waste a lot of time that ultimately maybe even the speedup here doesnt matter.

phhnguyen · Post by **phhnguyen** » Tue Nov 16, 2021 12:15 am

dangi12012 wrote: ↑Mon Nov 15, 2021 11:58 pm
phhnguyen wrote: ↑Mon Nov 15, 2021 11:26 pm
Well then I have another option for you:
Remove all the inner stuff in your parser - and copy all the inner code to a function.
Then you push a consumer function onto a threadpool with a buffer that you consume inside your loop.
https://github.com/vit-vit/CTPL

I recommend a threadsafe blockingqueue - so that your total memory consumtion is well defined.

This is called the producer consumer pattern. You have 1 pass - but the actual parsing happens on multiple threads.
But you are playing the optimisation game - and its a dangerous game because you can waste a lot of time that ultimately maybe even the speedup here doesnt matter.

Hmm, I am still missing the point: how do you divide the job between threads??? I don't believe in magic by just using some libraries. Note that we always have two botle necks: reading from hard disk and writing to the database. Not much space between them for multi-threading.

BTW, I am going to push the latest code into GitHub today. If you could, please help to make some requests. We all can check and compare the performance.

dangi12012 · Post by **dangi12012** » Tue Nov 16, 2021 12:44 am

phhnguyen wrote: ↑Tue Nov 16, 2021 12:15 am
dangi12012 wrote: ↑Mon Nov 15, 2021 11:58 pm
phhnguyen wrote: ↑Mon Nov 15, 2021 11:26 pm
Well then I have another option for you:
Remove all the inner stuff in your parser - and copy all the inner code to a function.
Then you push a consumer function onto a threadpool with a buffer that you consume inside your loop.
https://github.com/vit-vit/CTPL

I recommend a threadsafe blockingqueue - so that your total memory consumtion is well defined.

This is called the producer consumer pattern. You have 1 pass - but the actual parsing happens on multiple threads.
But you are playing the optimisation game - and its a dangerous game because you can waste a lot of time that ultimately maybe even the speedup here doesnt matter.
Hmm, I am still missing the point: how do you divide the job between threads??? I don't believe in magic by just using some libraries. Note that we always have two botle necks: reading from hard disk and writing to the database. Not much space between them for multi-threading.

BTW, I am going to push the latest code into GitHub today. If you could, please help to make some requests. We all can check and compare the performance.

Git PR is fine for me! The point is you are right only that stage can be done faster.. and if perf is "good enough" no optimisations matter anymore.

phhnguyen · Post by **phhnguyen** » Tue Nov 16, 2021 2:13 am

OK, I have pushed the new code into GitHub. It has been cleaned thus much easier to read. The main code in the file builder.cpp.

It has stats of converting as below (test with MillionBase database of 3.45 million games):

1) Convert into an SQLite database file (.db3, stored in the hard disk), 45 seconds:

Code: Select all

#games: 3457050, #errors: 372, elapsed: 45922 ms, 00:45, speed: 75280 games/s

2) Convert into a memory database (:memory:, stored in the RAM), 29 seconds:

Code: Select all

#games: 3457050, #errors: 372, elapsed: 29159 ms, 00:29, speed: 118558 games/s

The converter can work far below 1 minute to convert the MillionBase. All used only a single thread.

Hope someone can improve its speed

dangi12012 · Post by **dangi12012** » Tue Nov 16, 2021 12:51 pm

phhnguyen wrote: ↑Tue Nov 16, 2021 2:13 am OK, I have pushed the new code into GitHub. It has been cleaned thus much easier to read. The main code in the file builder.cpp.

It has stats of converting as below (test with MillionBase database of 3.45 million games):

1) Convert into an SQLite database file (.db3, stored in the hard disk), 45 seconds:
Code: Select all
#games: 3457050, #errors: 372, elapsed: 45922 ms, 00:45, speed: 75280 games/s
2) Convert into a memory database (:memory:, stored in the RAM), 29 seconds:
Code: Select all
#games: 3457050, #errors: 372, elapsed: 29159 ms, 00:29, speed: 118558 games/s
The converter can work far below 1 minute to convert the MillionBase. All used only a single thread.

Hope someone can improve its speed

Wow very impressive - I am glad to take a look!
- also you have 100% disproven the forum sceptics that are against anything new.

Does this now also support taking the comments from the moves?

Sopel · Post by **Sopel** » Tue Nov 16, 2021 2:01 pm

dangi12012 wrote: ↑Tue Nov 16, 2021 12:51 pm
phhnguyen wrote: ↑Tue Nov 16, 2021 2:13 am OK, I have pushed the new code into GitHub. It has been cleaned thus much easier to read. The main code in the file builder.cpp.

It has stats of converting as below (test with MillionBase database of 3.45 million games):

1) Convert into an SQLite database file (.db3, stored in the hard disk), 45 seconds:
Code: Select all
#games: 3457050, #errors: 372, elapsed: 45922 ms, 00:45, speed: 75280 games/s
2) Convert into a memory database (:memory:, stored in the RAM), 29 seconds:
Code: Select all
#games: 3457050, #errors: 372, elapsed: 29159 ms, 00:29, speed: 118558 games/s
The converter can work far below 1 minute to convert the MillionBase. All used only a single thread.

Hope someone can improve its speed
Wow very impressive - I am glad to take a look!
- also you have 100% disproven the forum sceptics that are against anything new.

Does this now also support taking the comments from the moves?

Still waiting for it to match even a fraction of SCIDs capabilities

dangi12012 · Post by **dangi12012** » Tue Nov 16, 2021 3:35 pm

Sopel wrote: ↑Tue Nov 16, 2021 2:01 pm Still waiting for it to match even a fraction of SCIDs capabilities

You even participated in the SCID discussion which proves you are 100% confirmed forum troll and only want to derail any discussion thread you find.
http://www.talkchess.com/forum3/viewtop ... ilit=scid4

Fulvio wrote:1) number of characters for tags.
The tag name must consist of at least one character and at most 240 (range [1: 240] ).
The tag value must have a maximum of 255 characters (range [0: 255] ).

2) number of unique values for some tags.
White and Black tags: max 1048575 unique values
Event and Site tags: max 524287 unique values
Round tag: max 262143 unique values
These limits are caused by the number of bits used for nameIDs.

3) The year of the date stored in the EventDate tag must be a maximum of 3 years before or 3 years after the year of the date stored in the Date tag.

4) Maximum number of games: 16777214

5) Maximum .sg4 file size: 4GB

6) Maximum size of the game's data (extra tags, moves, comments): 128KB

7) Games are not actually deleted. The IndexEntry records must be in order, and to delete a game it would then be necessary to rewrite all subsequent ones as well. Instead, the 'D' flag is set. There is the advantage that the user can eventually restore the games. The downside is that it is necessary to "compact" the database to actually delete the games and recover the space.

You read this. You know this. And yet you still write "waiting for it to match even a fraction of SCIDs capabilities" while sqlite has no 4gb limit - 16777214 game limit and all the other limits. Troll somewhere else please.

Sqlite is a very good format for chess game lookups. You can insert all years games in 45s and query them all in O(1). And no "max 1048575 unique values" and 4gb limit.

Sopel · Post by **Sopel** » Tue Nov 16, 2021 4:20 pm

dangi12012 wrote: ↑Tue Nov 16, 2021 3:35 pm
Sopel wrote: ↑Tue Nov 16, 2021 2:01 pm Still waiting for it to match even a fraction of SCIDs capabilities
You even participated in the SCID discussion which proves you are 100% confirmed forum troll and only want to derail any discussion thread you find.
http://www.talkchess.com/forum3/viewtop ... ilit=scid4

Fulvio wrote:1) number of characters for tags.
The tag name must consist of at least one character and at most 240 (range [1: 240] ).
The tag value must have a maximum of 255 characters (range [0: 255] ).

2) number of unique values for some tags.
White and Black tags: max 1048575 unique values
Event and Site tags: max 524287 unique values
Round tag: max 262143 unique values
These limits are caused by the number of bits used for nameIDs.

3) The year of the date stored in the EventDate tag must be a maximum of 3 years before or 3 years after the year of the date stored in the Date tag.

4) Maximum number of games: 16777214

5) Maximum .sg4 file size: 4GB

6) Maximum size of the game's data (extra tags, moves, comments): 128KB

7) Games are not actually deleted. The IndexEntry records must be in order, and to delete a game it would then be necessary to rewrite all subsequent ones as well. Instead, the 'D' flag is set. There is the advantage that the user can eventually restore the games. The downside is that it is necessary to "compact" the database to actually delete the games and recover the space.
You read this. You know this. And yet you still write "waiting for it to match even a fraction of SCIDs capabilities" while sqlite has no 4gb limit - 16777214 game limit and all the other limits. Troll somewhere else please.

Sqlite is a very good format for chess game lookups. You can insert all years games in 45s and query them all in O(1). And no "max 1048575 unique values" and 4gb limit.

You have participated in the SCID discussion and you're still not aware that SCID can do position queries?

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard