Loving Parquet (Beyond PGN)

Fulvio · Post by **Fulvio** » Tue Nov 28, 2023 8:35 pm

I wrote a simple Python script that converts PGNs to Parquet files:
https://github.com/benini/pgn_to_parquet
As an example I converted the latest 1516 and 1515 archives from The Week In Chess and uploaded them to Github.

Why it is so cool?
1) It is possible to run SQL queries on the Parquet files!
2) It is possible to query the file remotely, without downloading the entire file!
3) It is possible to join multiple files into a single database, again without downloading them!
4) It is possible to output the queries in multiple file formats: json, csv, markdown, etc...

Example using DuckDB:

Code: Select all

CREATE VIEW chess_games AS SELECT * FROM read_parquet(
  ['https://github.com/benini/pgn_to_parquet/raw/main/twic1515.parquet',
   'https://github.com/benini/pgn_to_parquet/raw/main/twic1516.parquet']);
.mode markdown
SELECT White,Black,Result,Date FROM chess_games WHERE White ilike '%vachier%';

Code: Select all

|       White       |      Black       | Result  |        Date         |
|-------------------|------------------|---------|---------------------|
| Vachier Lagrave,M | Caruana,F        | 1/2-1/2 | 2023-11-14 00:00:00 |
| Vachier Lagrave,M | Nepomniachtchi,I | 1/2-1/2 | 2023-11-15 00:00:00 |
| Vachier Lagrave,M | Robson,R         | 1/2-1/2 | 2023-11-16 00:00:00 |
| Vachier Lagrave,M | Sevian,Samuel    | 1/2-1/2 | 2023-11-16 00:00:00 |
| Vachier Lagrave,M | Nepomniachtchi,I | 0-1     | 2023-11-17 00:00:00 |
| Vachier Lagrave,M | Giri,A           | 1-0     | 2023-11-17 00:00:00 |
| Vachier Lagrave,M | So,W             | 1-0     | 2023-11-17 00:00:00 |
| Vachier Lagrave,M | Caruana,F        | 1/2-1/2 | 2023-11-17 00:00:00 |
| Vachier Lagrave,M | Le,Quang Liem    | 0-1     | 2023-11-18 00:00:00 |
| Vachier Lagrave,M | Firouzja,Alireza | 1-0     | 2023-11-18 00:00:00 |
| Vachier Lagrave,M | Robson,R         | 1/2-1/2 | 2023-11-18 00:00:00 |
| Vachier Lagrave,M | Xiong,Jeffery    | 1/2-1/2 | 2023-11-18 00:00:00 |
| Vachier Lagrave,M | Sevian,Samuel    | 1/2-1/2 | 2023-11-18 00:00:00 |
| Vachier Lagrave,M | Aronian,L        | 1/2-1/2 | 2023-11-21 00:00:00 |
| Vachier Lagrave,M | Caruana,F        | 1/2-1/2 | 2023-11-23 00:00:00 |
| Vachier Lagrave,M | Nepomniachtchi,I | 1/2-1/2 | 2023-11-25 00:00:00 |

So cool!!!

flok · Post by **flok** » Wed Nov 29, 2023 8:40 am

Fulvio wrote: ↑Tue Nov 28, 2023 8:35 pm I wrote a simple Python script that converts PGNs to Parquet files:

Am I right that parquet-files are binary? That's not very convenient to process from e.g. the commandline.

Fulvio · Post by **Fulvio** » Wed Nov 29, 2023 1:20 pm

flok wrote: ↑Wed Nov 29, 2023 8:40 am Am I right that parquet-files are binary?

Yes, but usually PGNs are also compressed binaries.
For instance, the TWIC archive 1516 is a 1.92 MB zipped file. The equivalent Parquet file is slightly smaller at 1.74 MB and can easily be converted back to PGN or other text formats like JSON, CSV, etc.

The most significant advantage, however, is the ability to download only the necessary data.
Let's say you followed the recent Kramnik-Nakamura cheating drama and are curious to know how top players perform on Lichess.
You could go to https://database.lichess.org/
Download the 30+ GB zstd compressed PGNs.
Decompress them, which is not so easy on Windows.
Parse the PGNs, which is not so easy anywhere, each file is over 200GB.

That's very time-consuming.
If they where Parquet files it would be possible to open DuckDB on the command line and run:

Code: Select all

SELECT Date, White, WhiteElo, Black, BlackElo, Result
FROM 'https://github.com/benini/pgn_to_parquet/raw/main/twic1516.parquet'
WHERE WhiteElo > 2700 OR BlackElo > 2700;

(that downloads about 200KB)

Loving Parquet (Beyond PGN)

Loving Parquet (Beyond PGN)

Re: Loving Parquet (Beyond PGN)

Re: Loving Parquet (Beyond PGN)