PGN for dummies

Fulvio · Post by **Fulvio** » Thu Nov 11, 2021 5:44 pm

Each character belongs to a category, and the most important one is the one called "symbols":
"letter characters (" A-Za-z "), digit characters (" 0-9 "), the underscore (" _ "), the plus sign (" + "), the octothorpe sign (" # "), the equal sign ("="), the colon (":"), and the hyphen ("-"). "

Parsing a PGN file is done reading a char and using its category to identify the end of the token.
Some tokens consist of only one character (in the standard "self terminating").
Some of those are usually simply ignored:
spaces (' ' '\t' '\v' '\r' '\n')
angle brackets ('<' '>')
period ('.')

others represent a token:
( --> variation start
) --> variation end
* --> end of a game with an unknown or otherwise unavailable result

Other tokens are instead composed of several characters and the first character is used to identify the last:
{ --> read next chars until char! = '}' --> comment
; --> read next chars until char! = '\n' --> comment
$ --> read next chars until char is a digit --> NAG
a symbol (see above) --> read next chars until char is a symbol

if the first char is not a digit --> a move
if all the chars are digit --> a move number
if it is "0-1" or "1-0" or "1/2-1/2" --> end of the game

As for the pairs tag:
[ --> skip spaces
--> read next chars until char is a symbol --> tag name
--> skip spaces
--> read next chars until char ! = ']' and [char-1] != '\' --> tag value

That's all.
Then it is actually necessary to interpret the SAN moves, which can be annoying for example if a move is unambiguous only because one of the pieces is pinned.
And decide how tolerant you want to be in case of non-compliant inputs.
But at its core it's really simple.

Fulvio · Post by **Fulvio** » Thu Nov 11, 2021 6:49 pm

Correction for the pair tags:
[ --> skip spaces
--> read next chars until char is a symbol --> tag name
--> skip spaces
--> skip quote char "
--> read next chars until char != " and [char-1] != '\' --> tag value

hgm · Post by **hgm** » Thu Nov 11, 2021 7:08 pm

I agree, I don't see any problem in parsing PGN. The only difficulty is really to extract the semantics of the SAN moves. But I cannot imagine any serious application where it would be acceptable to be unable to parse SAN moves. SAN is everywhere, because it is the move format preferred by humans. And you will encounter many places where games are published for human reading, and it would seriously hurt any application if it could not read those.

dangi12012 · Post by **dangi12012** » Thu Nov 11, 2021 7:23 pm

Fulvio wrote: ↑Thu Nov 11, 2021 5:44 pm Each character belongs to a category, and the most important one is the one called "symbols":
"letter characters (" A-Za-z "), digit characters (" 0-9 "), the underscore (" _ "), the plus sign (" + "), the octothorpe sign (" # "), the equal sign ("="), the colon (":"), and the hyphen ("-"). "

Parsing a PGN file is done reading a char and using its category to identify the end of the token.
Some tokens consist of only one character (in the standard "self terminating").
Some of those are usually simply ignored:
spaces (' ' '\t' '\v' '\r' '\n')
angle brackets ('<' '>')
period ('.')

others represent a token:
( --> variation start
) --> variation end
* --> end of a game with an unknown or otherwise unavailable result

Other tokens are instead composed of several characters and the first character is used to identify the last:
{ --> read next chars until char! = '}' --> comment
; --> read next chars until char! = '\n' --> comment
$ --> read next chars until char is a digit --> NAG
a symbol (see above) --> read next chars until char is a symbol

if the first char is not a digit --> a move
if all the chars are digit --> a move number
if it is "0-1" or "1-0" or "1/2-1/2" --> end of the game

As for the pairs tag:
[ --> skip spaces
--> read next chars until char is a symbol --> tag name
--> skip spaces
--> read next chars until char ! = ']' and [char-1] != '\' --> tag value

That's all.
Then it is actually necessary to interpret the SAN moves, which can be annoying for example if a move is unambiguous only because one of the pieces is pinned.
And decide how tolerant you want to be in case of non-compliant inputs.
But at its core it's really simple.

And now please a high performance library for that.
Parsing 80gb of pgn for a single year is SLOW!

Chessprogramming really needs performance improvements on all ends. To be honest I dont really see JSON replacing that.

hgm · Post by **hgm** » Fri Nov 12, 2021 8:02 am

Parsing 80GB of PGN is not slower than reading 80GB of anything from disk, right? It is slow because 80GB is a lot, not because it is PGN. Binary formats would be more compact, and could thus be faster.

Kotlov · Post by **Kotlov** » Fri Nov 12, 2021 9:07 am

hgm wrote: ↑Fri Nov 12, 2021 8:02 am It is slow because 80GB is a lot, not because it is PGN.

Golden words

mvanthoor · Post by **mvanthoor** » Fri Nov 12, 2021 10:58 am

hgm wrote: ↑Thu Nov 11, 2021 7:08 pm I agree, I don't see any problem in parsing PGN. The only difficulty is really to extract the semantics of the SAN moves. But I cannot imagine any serious application where it would be acceptable to be unable to parse SAN moves. SAN is everywhere, because it is the move format preferred by humans. And you will encounter many places where games are published for human reading, and it would seriously hurt any application if it could not read those.

I disagree. SAN should never have existed. Too many mistakes can and are being made. Even in publications.

Ra1xa5 is infinitely more readable than "Raa5:"

Then there's the thing that some publications use "Rxaa5" or "Raxa5"; when pawns capture pawns, it can be described as "exd5", "ed5:", "ed:", "exd", or even just "ed".)

SAN may be the preferred format for _some_ humans, but it's definitely not mine.

dangi12012 · Post by **dangi12012** » Fri Nov 12, 2021 1:24 pm

hgm wrote: ↑Fri Nov 12, 2021 8:02 am Parsing 80GB of PGN is not slower than reading 80GB of anything from disk, right? It is slow because 80GB is a lot, not because it is PGN. Binary formats would be more compact, and could thus be faster.

Parsing with 500kb/s or 500MB/s makes a difference.
One is the first implementation - the other one would be an optimized parser.

If its binary - we wouldnt need to parse anything and could mount it as a memory mapped file into the memory space directly.

dangi12012 · Post by **dangi12012** » Fri Nov 12, 2021 1:31 pm

mvanthoor wrote: ↑Fri Nov 12, 2021 10:58 am
hgm wrote: ↑Thu Nov 11, 2021 7:08 pm I agree, I don't see any problem in parsing PGN. The only difficulty is really to extract the semantics of the SAN moves. But I cannot imagine any serious application where it would be acceptable to be unable to parse SAN moves. SAN is everywhere, because it is the move format preferred by humans. And you will encounter many places where games are published for human reading, and it would seriously hurt any application if it could not read those.
I disagree. SAN should never have existed. Too many mistakes can and are being made. Even in publications.

Ra1xa5 is infinitely more readable than "Raa5:"

Then there's the thing that some publications use "Rxaa5" or "Raxa5"; when pawns capture pawns, it can be described as "exd5", "ed5:", "ed:", "exd", or even just "ed".)

SAN may be the preferred format for _some_ humans, but it's definitely not mine.

So is there an open source SAN parser?

mvanthoor · Post by **mvanthoor** » Fri Nov 12, 2021 1:52 pm

dangi12012 wrote: ↑Fri Nov 12, 2021 1:31 pm So is there an open source SAN parser?

Several; but they're probably integrated in other programs. In Rust, there's the (massive) Shakmaty library, which also contains a SAN-parser. (Shakmaty aims to be the Rust equivalent to PyChess as far as I can see.)

In the end, I'll probably just split off Rustic's business code into a library myself. I dislike basing my programs on massive external libraries, especially for the parts I can (somewhat easily) write myself.

PGN for dummies

PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies