PGN for dummies

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Fulvio
Posts: 395
Joined: Fri Aug 12, 2016 8:43 pm

PGN for dummies

Post by Fulvio »

Each character belongs to a category, and the most important one is the one called "symbols":
"letter characters (" A-Za-z "), digit characters (" 0-9 "), the underscore (" _ "), the plus sign (" + "), the octothorpe sign (" # "), the equal sign ("="), the colon (":"), and the hyphen ("-"). "

Parsing a PGN file is done reading a char and using its category to identify the end of the token.
Some tokens consist of only one character (in the standard "self terminating").
Some of those are usually simply ignored:
spaces (' ' '\t' '\v' '\r' '\n')
angle brackets ('<' '>')
period ('.')

others represent a token:
( --> variation start
) --> variation end
* --> end of a game with an unknown or otherwise unavailable result

Other tokens are instead composed of several characters and the first character is used to identify the last:
{ --> read next chars until char! = '}' --> comment
; --> read next chars until char! = '\n' --> comment
$ --> read next chars until char is a digit --> NAG
a symbol (see above) --> read next chars until char is a symbol
  • if the first char is not a digit --> a move
    if all the chars are digit --> a move number
    if it is "0-1" or "1-0" or "1/2-1/2" --> end of the game

As for the pairs tag:
[ --> skip spaces
--> read next chars until char is a symbol --> tag name
--> skip spaces
--> read next chars until char ! = ']' and [char-1] != '\' --> tag value

That's all.
Then it is actually necessary to interpret the SAN moves, which can be annoying for example if a move is unambiguous only because one of the pieces is pinned.
And decide how tolerant you want to be in case of non-compliant inputs.
But at its core it's really simple.
Fulvio
Posts: 395
Joined: Fri Aug 12, 2016 8:43 pm

Re: PGN for dummies

Post by Fulvio »

Correction for the pair tags:
[ --> skip spaces
--> read next chars until char is a symbol --> tag name
--> skip spaces
--> skip quote char "
--> read next chars until char != " and [char-1] != '\' --> tag value
User avatar
hgm
Posts: 27836
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

I agree, I don't see any problem in parsing PGN. The only difficulty is really to extract the semantics of the SAN moves. But I cannot imagine any serious application where it would be acceptable to be unable to parse SAN moves. SAN is everywhere, because it is the move format preferred by humans. And you will encounter many places where games are published for human reading, and it would seriously hurt any application if it could not read those.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: PGN for dummies

Post by dangi12012 »

Fulvio wrote: Thu Nov 11, 2021 5:44 pm Each character belongs to a category, and the most important one is the one called "symbols":
"letter characters (" A-Za-z "), digit characters (" 0-9 "), the underscore (" _ "), the plus sign (" + "), the octothorpe sign (" # "), the equal sign ("="), the colon (":"), and the hyphen ("-"). "

Parsing a PGN file is done reading a char and using its category to identify the end of the token.
Some tokens consist of only one character (in the standard "self terminating").
Some of those are usually simply ignored:
spaces (' ' '\t' '\v' '\r' '\n')
angle brackets ('<' '>')
period ('.')

others represent a token:
( --> variation start
) --> variation end
* --> end of a game with an unknown or otherwise unavailable result

Other tokens are instead composed of several characters and the first character is used to identify the last:
{ --> read next chars until char! = '}' --> comment
; --> read next chars until char! = '\n' --> comment
$ --> read next chars until char is a digit --> NAG
a symbol (see above) --> read next chars until char is a symbol
  • if the first char is not a digit --> a move
    if all the chars are digit --> a move number
    if it is "0-1" or "1-0" or "1/2-1/2" --> end of the game

As for the pairs tag:
[ --> skip spaces
--> read next chars until char is a symbol --> tag name
--> skip spaces
--> read next chars until char ! = ']' and [char-1] != '\' --> tag value

That's all.
Then it is actually necessary to interpret the SAN moves, which can be annoying for example if a move is unambiguous only because one of the pieces is pinned.
And decide how tolerant you want to be in case of non-compliant inputs.
But at its core it's really simple.
And now please a high performance library for that.
Parsing 80gb of pgn for a single year is SLOW!

Chessprogramming really needs performance improvements on all ends. To be honest I dont really see JSON replacing that.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
hgm
Posts: 27836
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

Parsing 80GB of PGN is not slower than reading 80GB of anything from disk, right? It is slow because 80GB is a lot, not because it is PGN. Binary formats would be more compact, and could thus be faster.
User avatar
Kotlov
Posts: 266
Joined: Fri Jul 10, 2015 9:23 pm
Location: Russia

Re: PGN for dummies

Post by Kotlov »

hgm wrote: Fri Nov 12, 2021 8:02 am It is slow because 80GB is a lot, not because it is PGN.
Golden words
Eugene Kotlov
Hedgehog 2.1 64-bit coming soon...
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: PGN for dummies

Post by mvanthoor »

hgm wrote: Thu Nov 11, 2021 7:08 pm I agree, I don't see any problem in parsing PGN. The only difficulty is really to extract the semantics of the SAN moves. But I cannot imagine any serious application where it would be acceptable to be unable to parse SAN moves. SAN is everywhere, because it is the move format preferred by humans. And you will encounter many places where games are published for human reading, and it would seriously hurt any application if it could not read those.
I disagree. SAN should never have existed. Too many mistakes can and are being made. Even in publications.

Ra1xa5 is infinitely more readable than "Raa5:"

Then there's the thing that some publications use "Rxaa5" or "Raxa5"; when pawns capture pawns, it can be described as "exd5", "ed5:", "ed:", "exd", or even just "ed".)

SAN may be the preferred format for _some_ humans, but it's definitely not mine.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: PGN for dummies

Post by dangi12012 »

hgm wrote: Fri Nov 12, 2021 8:02 am Parsing 80GB of PGN is not slower than reading 80GB of anything from disk, right? It is slow because 80GB is a lot, not because it is PGN. Binary formats would be more compact, and could thus be faster.
Parsing with 500kb/s or 500MB/s makes a difference.
One is the first implementation - the other one would be an optimized parser.

If its binary - we wouldnt need to parse anything and could mount it as a memory mapped file into the memory space directly.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: PGN for dummies

Post by dangi12012 »

mvanthoor wrote: Fri Nov 12, 2021 10:58 am
hgm wrote: Thu Nov 11, 2021 7:08 pm I agree, I don't see any problem in parsing PGN. The only difficulty is really to extract the semantics of the SAN moves. But I cannot imagine any serious application where it would be acceptable to be unable to parse SAN moves. SAN is everywhere, because it is the move format preferred by humans. And you will encounter many places where games are published for human reading, and it would seriously hurt any application if it could not read those.
I disagree. SAN should never have existed. Too many mistakes can and are being made. Even in publications.

Ra1xa5 is infinitely more readable than "Raa5:"

Then there's the thing that some publications use "Rxaa5" or "Raxa5"; when pawns capture pawns, it can be described as "exd5", "ed5:", "ed:", "exd", or even just "ed".)

SAN may be the preferred format for _some_ humans, but it's definitely not mine.
So is there an open source SAN parser?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: PGN for dummies

Post by mvanthoor »

dangi12012 wrote: Fri Nov 12, 2021 1:31 pm So is there an open source SAN parser?
Several; but they're probably integrated in other programs. In Rust, there's the (massive) Shakmaty library, which also contains a SAN-parser. (Shakmaty aims to be the Rust equivalent to PyChess as far as I can see.)

In the end, I'll probably just split off Rustic's business code into a library myself. I dislike basing my programs on massive external libraries, especially for the parts I can (somewhat easily) write myself.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL