Open Chess Game Database Standard

Sopel · Post by **Sopel** » Fri Nov 26, 2021 1:37 am

dangi12012 wrote: ↑Thu Nov 25, 2021 1:48 pm
Dann Corbit wrote: ↑Thu Nov 25, 2021 11:20 am I had to change the code for Windows. Here is the link:

fseek()/ftell() do not work for files bigger than 32 bits.

I get 140,000 games per second but I have a fast disk.

There is a visual studio project and binary in the archive, but it uses /arch:AVX2 on the command line.
If you do not have an advanced CPU, you will want to change the command line option.
Use memory mapped IO. All seeking will go away and you can use extremely optimized code like memchr to read line by line!

That's going in a completely wrong direction. You want to AVOID file caching for large files being read sequentially (as pgn files usually are). Otherwise you'll pollute file cache with unused data. Mmap forces caching, so it's the worst you can do.

phhnguyen · Post by **phhnguyen** » Fri Nov 26, 2021 6:54 am

Dann Corbit wrote: ↑Thu Nov 25, 2021 11:20 am I had to change the code for Windows. Here is the link:

fseek()/ftell() do not work for files bigger than 32 bits.

I get 140,000 games per second but I have a fast disk.

There is a visual studio project and binary in the archive, but it uses /arch:AVX2 on the command line.
If you do not have an advanced CPU, you will want to change the command line option.

Thank Dann, the code will be updated!

phhnguyen · Post by **phhnguyen** » Fri Nov 26, 2021 6:59 am

dangi12012 wrote: ↑Thu Nov 25, 2021 1:48 pm
Dann Corbit wrote: ↑Thu Nov 25, 2021 11:20 am I had to change the code for Windows. Here is the link:

fseek()/ftell() do not work for files bigger than 32 bits.

I get 140,000 games per second but I have a fast disk.

There is a visual studio project and binary in the archive, but it uses /arch:AVX2 on the command line.
If you do not have an advanced CPU, you will want to change the command line option.
Use memory mapped IO. All seeking will go away and you can use extremely optimized code like memchr to read line by line!

I have mentioned it once. The problem is that the memory-mapped file is just the second-best with a large gap behind, compared with the current method (reading by blocks). I don't see its chance since we read all data once, in sequence only. Except if you can make it better

dangi12012 · Post by **dangi12012** » Fri Nov 26, 2021 9:37 am

Sopel wrote: ↑Fri Nov 26, 2021 1:37 am
dangi12012 wrote: ↑Thu Nov 25, 2021 1:48 pm
Dann Corbit wrote: ↑Thu Nov 25, 2021 11:20 am I had to change the code for Windows. Here is the link:

fseek()/ftell() do not work for files bigger than 32 bits.

I get 140,000 games per second but I have a fast disk.

There is a visual studio project and binary in the archive, but it uses /arch:AVX2 on the command line.
If you do not have an advanced CPU, you will want to change the command line option.
Use memory mapped IO. All seeking will go away and you can use extremely optimized code like memchr to read line by line!
That's going in a completely wrong direction. You want to AVOID file caching for large files being read sequentially (as pgn files usually are). Otherwise you'll pollute file cache with unused data. Mmap forces caching, so it's the worst you can do.

You can tell on open that you are doing sequential scanning of the file.

But you dont know what i mean:
When the Parser reaches multiple gb/s (which it will only have if parallelized) the buffer copy from fread becomes a bottleneck.

A multithreaded approach can access the mmap file from all offsets in parallel and that will be much slower with fseek!

Dann Corbit · Post by **Dann Corbit** » Fri Nov 26, 2021 11:08 am

Chess games are stateful. SMP processing will be tricky.

If I can read 30 million games in five minutes and I never have to read those games again, how fast does it need to be?

I think we are now looking at premature optimization, that even if correctly implemented would yield very little.

Much better to add functionality at this point.

Suggestion:
Parse SCID tables like ratings.ssp, Add automatic linking of game positions to EPD, etc.
That kind of thing.

The number one rule of optimization is, "Don't do it."
The number two rule of optimization (for experts only) is, "Don't do it yet."

Here is my opinion about the right way to optimize:
When something is clearly not fast enough, profile it.
Identify the bottleneck.
Find or write a better algorithm for the bottleneck code.
All other optimization is bad optimization.

Something to think about:
Accomplish needed functionality.
Establish correctness
Speed up if necessary.

First make it right, then make it fast.

Sopel · Post by **Sopel** » Sun Nov 28, 2021 3:50 pm

dangi12012 wrote: ↑Fri Nov 26, 2021 9:37 am
Sopel wrote: ↑Fri Nov 26, 2021 1:37 am
dangi12012 wrote: ↑Thu Nov 25, 2021 1:48 pm
Dann Corbit wrote: ↑Thu Nov 25, 2021 11:20 am I had to change the code for Windows. Here is the link:

fseek()/ftell() do not work for files bigger than 32 bits.

I get 140,000 games per second but I have a fast disk.

There is a visual studio project and binary in the archive, but it uses /arch:AVX2 on the command line.
If you do not have an advanced CPU, you will want to change the command line option.
Use memory mapped IO. All seeking will go away and you can use extremely optimized code like memchr to read line by line!
That's going in a completely wrong direction. You want to AVOID file caching for large files being read sequentially (as pgn files usually are). Otherwise you'll pollute file cache with unused data. Mmap forces caching, so it's the worst you can do.
You can tell on open that you are doing sequential scanning of the file.

Did you actually read what I wrote? How is that an answer?

But you dont know what i mean:
When the Parser reaches multiple gb/s (which it will only have if parallelized) the buffer copy from fread becomes a bottleneck.

Who is talking about buffer copies? Ever heard about `setvbuf`? Ever actually read docs on implementations of fread? I guess not.
https://docs.microsoft.com/en-us/cpp/c- ... w=msvc-170

When used on a text mode stream, if the amount of data requested (that is, size * count) is greater than or equal to the internal FILE * buffer size (by default this is 4096 bytes, configurable by using setvbuf), stream data is copied directly into the user-provided buffer, and newline conversion is done in that buffer.

A multithreaded approach can access the mmap file from all offsets in parallel and that will be much slower with fseek!

What??? What a load of bullcrap.

Dann Corbit · Post by **Dann Corbit** » Sun Nov 28, 2021 5:53 pm

No matter what approach is taken, we will have to read the data exactly once from disk. A sequential scan is the fastest method for that because buffering is predictable.

What a memory map would give you is simple, logical access to any point in the file. If the file does not exceed RAM, we could do the same thing with a sequential read into a memory buffer the size of the file. Memory mapped access with simplify access if we need to page.

You cannot simply divide the file by thread count and simultaneously read the sections because chess is stateful. You will have to read above and below and detect the start of games. It is not as simple as it sounds. For example:

[Event "CRO-chT"]
[Site "Sibenik CRO"]
[Date "2007.10.13"]
[Round "1.12"]
[White "Lalic, Bogdan"]
[Black "Tratar, Marko"]
[Result "1/2-1/2"]
[ECO "E92"]
[WhiteElo "2500"]
[BlackElo "2502"]
[PlyCount "45"]
[EventDate "2007.10.13"]
[EventType "team"]
[EventRounds "9"]
[EventCountry "CRO"]

1. d4 d6 2. Nf3 Nf6 3. c4 g6 4. Nc3 Bg7 5. e4 O-O 6. Be2 e5 7. Be3 Ng4 8. Bg5
f6 9. Bc1 Nc6 10. d5 Ne7 11. h3 Nh6 12. h4 Nf7 13. h5 f5 14. Ng5 Nxg5
{
I remember a similar game from 2000, but Branko Damljanovic continued with hxg6 rather than Ng5:
[Event "YUG Team Ch 52nd"]
[Site "Novi Sad SRB"]
[Date "2000.08.31"]
[Round "7"]
[White "Damljanovic, Branko"]
[Black "Nevednichy, Vladislav"]
[Result "1/2-1/2"]
[WhiteElo "2559"]
[BlackElo "2582"]
[ECO "E92k"]

1.c4 g6 2.d4 Bg7 3.e4 d6 4.Nc3 Nf6 5.Nf3 O-O 6.Be2 e5 7.Be3 Ng4 8.Bg5 f6 9.Bc1 Nc6 10.d5 Ne7 11.h3 Nh6 12.h4 Nf7 13.h5 f5 14.hxg6 Nxg6 15.Qc2 f4 16.Bd2 c5 17.dxc6 bxc6 18.c5 d5 19.Bd3 d4 20.Na4 Bf6 21.O-O-O Kg7 22.Rdg1 Ng5 23.Nxg5 Bxg5 24.b3 Qe7 25.Bc4 Bg4 26.f3 Bd7 27.Be1 Rh8 28.Rh5 Be8 29.Rgh1 Nf8 30.R5h2 Nd7 31.Kb1 Bg6 32.Ka1 Rab8 33.Bd3 Rhc8 34.Ba6 Rh8 35.Ba5 Nf6 36.Bd3 Rb7 37.b4 Rhb8 38.Rb1 Bf7 39.Ba6 Rd7 40.Nb2 Ne8 41.Bc4 Bg6 42.Nd3 Nc7 43.Bxc7 Rxc7 44.Rhh1 Bf6 45.Qf2 Kh8 46.Rb2 Qe8 47.Qc2 Rg7 48.Rhb1 Bd8 49.Qa4 Bc7 50.Qa6 h5 51.b5 cxb5 52.Rxb5 Rxb5 53.Rxb5 Qd8 54.Rb1 Qg5 55.Rb2 Kh7 56.Be6 Bd8 57.Bh3 Qe7 58.Qd6 1/2-1/2
}
15. Bxg5 Bf6 16. Qd2 f4 17. Bxf6 Rxf6 18. hxg6 Rxg6 19. Rh2 Bd7 20. O-O-O Kh8 21. Bh5
Rg7 22. Ne2 Ng8 23. g3 1/2-1/2

It is not impossible that a 200GB PGN file is a single annotated game (though nobody could realistically read it all). I actually saw something like that once, in the form of an annotated opening book. It wasn't 200 GB, but it was many megabytes.

And just because something is crazy does not mean that people won't do it.
I often see keywords as column names in database tables and things like database names or table names with a minus sign in them.

So if we are to do it correctly, it seems unlikely to really, truly know where we are without scanning all the way from the top of the file to the bottom. I imagine it can be done, but it would be really difficult.

What a memory mapped file will give you is a paging, array addressible version of the data. If it were not for recursive annotations, it would not be too difficult to do a multi-threaded PGN reader that can use all of the machine's threads to parse the file. But I suspect that the process is file I/O limited anyway.

A simple experiment to look for speedup would be to read a PGN file into memory and then parse it from the bottom and the top until the threads met (and then have one thread finish the current game if we did not stop between games). I guess that the savings will be small.

How often do we read PGN from a source? For me, it is one time. So I do not think optimization of PGN reading is an optimal use of time.
That having been said, I am all ears if someone writes an ultra-fast PGN reader. AndI did enjoy looking over this body of PGN reading code.

Which would YOU rather have:
A PGN reader that can parse 100 GB of PGN in 10 minutes
OR
A PGN reader that can parse 100 GB of PGN in 20 minutes, while simultaneously creating links to a tree of EPD records
?
That's a no-brainer for me, but everyone's use case is different.

dangi12012 · Post by **dangi12012** » Sun Nov 28, 2021 6:29 pm

Dann Corbit wrote: ↑Sun Nov 28, 2021 5:53 pm No matter what approach is taken, we will have to read the data exactly once from disk. A sequential scan is the fastest method for that because buffering is predictable.

What a memory map would give you is simple, logical access to any point in the file. If the file does not exceed RAM, we could do the same thing with a sequential read into a memory buffer the size of the file. Memory mapped access with simplify access if we need to page.

You cannot simply divide the file by thread count and simultaneously read the sections because chess is stateful. You will have to read above and below and detect the start of games. It is not as simple as it sounds. For example:

[Event "CRO-chT"]
[Site "Sibenik CRO"]
[Date "2007.10.13"]
[Round "1.12"]
[White "Lalic, Bogdan"]
[Black "Tratar, Marko"]
[Result "1/2-1/2"]
[ECO "E92"]
[WhiteElo "2500"]
[BlackElo "2502"]
[PlyCount "45"]
[EventDate "2007.10.13"]
[EventType "team"]
[EventRounds "9"]
[EventCountry "CRO"]

1. d4 d6 2. Nf3 Nf6 3. c4 g6 4. Nc3 Bg7 5. e4 O-O 6. Be2 e5 7. Be3 Ng4 8. Bg5
f6 9. Bc1 Nc6 10. d5 Ne7 11. h3 Nh6 12. h4 Nf7 13. h5 f5 14. Ng5 Nxg5
{
I remember a similar game from 2000, but Branko Damljanovic continued with hxg6 rather than Ng5:
[Event "YUG Team Ch 52nd"]
[Site "Novi Sad SRB"]
[Date "2000.08.31"]
[Round "7"]
[White "Damljanovic, Branko"]
[Black "Nevednichy, Vladislav"]
[Result "1/2-1/2"]
[WhiteElo "2559"]
[BlackElo "2582"]
[ECO "E92k"]

1.c4 g6 2.d4 Bg7 3.e4 d6 4.Nc3 Nf6 5.Nf3 O-O 6.Be2 e5 7.Be3 Ng4 8.Bg5 f6 9.Bc1 Nc6 10.d5 Ne7 11.h3 Nh6 12.h4 Nf7 13.h5 f5 14.hxg6 Nxg6 15.Qc2 f4 16.Bd2 c5 17.dxc6 bxc6 18.c5 d5 19.Bd3 d4 20.Na4 Bf6 21.O-O-O Kg7 22.Rdg1 Ng5 23.Nxg5 Bxg5 24.b3 Qe7 25.Bc4 Bg4 26.f3 Bd7 27.Be1 Rh8 28.Rh5 Be8 29.Rgh1 Nf8 30.R5h2 Nd7 31.Kb1 Bg6 32.Ka1 Rab8 33.Bd3 Rhc8 34.Ba6 Rh8 35.Ba5 Nf6 36.Bd3 Rb7 37.b4 Rhb8 38.Rb1 Bf7 39.Ba6 Rd7 40.Nb2 Ne8 41.Bc4 Bg6 42.Nd3 Nc7 43.Bxc7 Rxc7 44.Rhh1 Bf6 45.Qf2 Kh8 46.Rb2 Qe8 47.Qc2 Rg7 48.Rhb1 Bd8 49.Qa4 Bc7 50.Qa6 h5 51.b5 cxb5 52.Rxb5 Rxb5 53.Rxb5 Qd8 54.Rb1 Qg5 55.Rb2 Kh7 56.Be6 Bd8 57.Bh3 Qe7 58.Qd6 1/2-1/2
}
15. Bxg5 Bf6 16. Qd2 f4 17. Bxf6 Rxf6 18. hxg6 Rxg6 19. Rh2 Bd7 20. O-O-O Kh8 21. Bh5
Rg7 22. Ne2 Ng8 23. g3 1/2-1/2

It is not impossible that a 200GB PGN file is a single annotated game (though nobody could realistically read it all). I actually saw something like that once, in the form of an annotated opening book. It wasn't 200 GB, but it was many megabytes.

And just because something is crazy does not mean that people won't do it.
I often see keywords as column names in database tables and things like database names or table names with a minus sign in them.

So if we are to do it correctly, it seems unlikely to really, truly know where we are without scanning all the way from the top of the file to the bottom. I imagine it can be done, but it would be really difficult.

What a memory mapped file will give you is a paging, array addressible version of the data. If it were not for recursive annotations, it would not be too difficult to do a multi-threaded PGN reader that can use all of the machine's threads to parse the file. But I suspect that the process is file I/O limited anyway.

A simple experiment to look for speedup would be to read a PGN file into memory and then parse it from the bottom and the top until the threads met (and then have one thread finish the current game if we did not stop between games). I guess that the savings will be small.

How often do we read PGN from a source? For me, it is one time. So I do not think optimization of PGN reading is an optimal use of time.
That having been said, I am all ears if someone writes an ultra-fast PGN reader. AndI did enjoy looking over this body of PGN reading code.

Which would YOU rather have:
A PGN reader that can parse 100 GB of PGN in 10 minutes
OR
A PGN reader that can parse 100 GB of PGN in 20 minutes, while simultaneously creating links to a tree of EPD records
?
That's a no-brainer for me, but everyone's use case is different.

First of all: I think its fast enough as is. Second you can easily do multithreading with memory mapped files and it can be parallelized.
I know this because I have played with an NVME raid array before and sequential reading caps out at 4-5 GB/s while the array could do 10s of GB/s.

The solution (and this only applies for well formed files) is to single threadedly do a memchr or stringfind on the file but you dont read it line by line - but you build an offset vector for all games.
For this it would be to remember all ptrdiffs to the starting pointer where "[Event" occurs in the first pass.

The second pass is a std::parallel_for for all offsets that concurrently parse all games in the file. It will be much much faster if your parsing is complex.

Also multithreading is a top level constuct. Maybe its better to parse multiple files in parallel and have a fast single threaded parser available.

Sopel · Post by **Sopel** » Sun Nov 28, 2021 6:51 pm

dangi12012 wrote: ↑Sun Nov 28, 2021 6:29 pmSecond you can easily do multithreading with memory mapped files and it can be parallelized.

It's also just as easy without memory mapped files. With the added benefit that you can disable caching so that it's not polluted by large files that are read once. Read a large chunk [asynchronously], say 10MB, identify full games (preserve remainer at the end, move to the start of the next chunk), divide by games, process in parallel.

Dann Corbit wrote: ↑Sun Nov 28, 2021 5:53 pmWhat a memory map would give you is simple, logical access to any point in the file. If the file does not exceed RAM, we could do the same thing with a sequential read into a memory buffer the size of the file. Memory mapped access with simplify access if we need to page.

Why do you assume that no memmap requires reading the whole file into memory at once?

Dann Corbit wrote: ↑Sun Nov 28, 2021 5:53 pmA sequential scan is the fastest method for that because buffering is predictable.

And that buffering is precisely what you want to AVOID for files that are meant to be read once.

Dann Corbit wrote: ↑Sun Nov 28, 2021 5:53 pmYou cannot simply divide the file by thread count and simultaneously read the sections because chess is stateful. You will have to read above and below and detect the start of games. It is not as simple as it sounds. For example:

You can do it such that it works well for well behaved inputs, and falls back to a whatever implementation for anomalous inputs. The fact that some inputs might not benefit from an optimization doesn't mean it's worthless, because nice inputs will benefit from it.

Dann Corbit · Post by **Dann Corbit** » Mon Nov 29, 2021 10:51 pm

Re: "And that buffering is precisely what you want to AVOID for files that are meant to be read once."

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

char string[32767];
char *getsafe(char *buffer, int count)
{
    char *result = buffer, *np;
    if ((buffer == NULL) || (count < 1))
        result = NULL;
    else if (count == 1)
        *result = '\0';
    else if ((result = fgets(buffer, count, stdin)) != NULL)
        if (np = strchr(buffer, '\n'))
            *np = '\0';
    return result;
}

#include <stdio.h>
char buffer[512]= {0};
int main(int argc, char **argv)
{
    FILE *pFile;
    clock_t start=0,end=0;
    int buftype =0;
    float seconds =0;
    pFile = fopen("C:\\lichess\\lichess_gm_2020-09.pgn","r");
    if (argc >1)
    {
        buftype = atoi(argv[1]);
        if (buftype <0) buftype = 0;
        if (buftype >2) buftype = 2;
    }
    if (buftype == 0) /* full buffering */
    {
        setvbuf ( pFile, NULL, _IOFBF, 32767);
        puts("Full buffering with 32 K");
    }
    else if  (buftype == 1) /* line buffering */
    {
        setvbuf ( pFile, NULL, _IOLBF, 512);
        puts("line buffering with 0.5 K");
    }
    else /* no buffering */
    {
        setvbuf ( pFile, NULL, _IONBF, 0);
        puts("no buffering");
    }
    // File operations here
    if (pFile) {
		    start = clock();
        while (fread(buffer, 1, sizeof buffer, pFile) > 0 ) {
        }
        end = clock();
        fclose (pFile);
    }
    seconds = (float)(end - start) / CLOCKS_PER_SEC;
    printf("Elapsed time is %g seconds.\n", seconds);
    return 0;
}

/*
G:\cc>gcc buftest.c

G:\cc>a 0
Full buffering with 32 K
Elapsed time is 1.027 seconds.

G:\cc>a 1
line buffering with 0.5 K
Elapsed time is 3.34 seconds.

G:\cc>a 2
no buffering
Elapsed time is 5.363 seconds.
*/

11/03/2021 05:39 PM 441,396,530 lichess_gm_2020-09.pgn

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard