Gaviota TBs, compression schemes

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Gaviota TBs, compression schemes

Post by michiguel » Wed Dec 30, 2009 9:07 am

I am done writing the compressor for the endgame tablebases (tbcp). I have done a basic research on some compression schemes, all of them incorporated as an option. The compressor/decompressor help is at the end.

The analysis of the different schemes are here:
http://sites.google.com/site/gaviotachessengine/schemes

All of them, AFAIK, are either BSD license, or public domain. I wrote myself the huffman coding and the RLE. The other ones are from free libraries. There gazillion of good alternatives, but most of them are GPLed (and the idea is to have a more permissive option).

The default will be LZMA, which is the one that provides the smallest memory. AFAIK, it is smaller than Nalimov's and it fits in a 8 GB USB stick. However, there are other valid alternatives with potential (Zlib-9 and LZF). Those could give better performance in the future, as the people move to solid state disks. The decompression speed of LZF is close to the speed of light (and the source code is really small!!).

I still have to try several ideas to improve this, and a couple of other compression libraries (which may be faster but no chance to compete with the compression ratio of LZMA). But I do not want to delay this any further.

So, there is a chance (but slim) that I can find a better alternative, or more likely, a better compression algorithm & free library could be designed by somebody else in the future. In those cases, I would like to keep the flexibility for progress. I think that I may request to set LZMA as a default that every engine developer should support if they support Gaviota TBs, and, if in the future we find something else, we could move to a new scheme, but always supporting LZMA to guarantee backward compatibility. In other words, if testers decide to maintain only one copy of LZMA version, they should always be covered. The library with the probing code that I will release, should be able to this, but the engine developers will be responsible to provide a switch or a WB/UCI option etc. so the user can inform the engine what TBs are available.

Miguel

Code: Select all

miguel@avefenix:/media/bigdisk/dev/tbcp$ ./tbcp -h

Compression/decompression utility for Gaviota Table Bases.
It uses different free compression algorithms. More information at
http://sites.google.com/site/gaviotachessengine/

quick example: tbcp gtb
 compresses all the endgame table base files in the folder "gtb"
 with the default compression scheme 2, and no deletion of the source files

usage: tbcp [-OPTION]... [folder]

 -h        give this help
 -v        give version number and exits
 -c        compress (default)
 -d        decompress
 -k        keep   source files, after compression/decompression (default)
 -r        remove source files, after compression/decompression
 -V        verify encoding/decoding on the fly
 -q        quiet (no output except error messages)
 -i <file> work on the given individual file, rather than a folder
 -s   #    sets compression scheme &#40;0-2, default=2&#41;
 -n   #    operates on up to 3-, 4- o 5-piece tablebases &#40;default=5&#41;

Copyright &#40;c&#41; 2009-2010 Miguel A. Ballicora
There is NO WARRANTY of any kind

Gian-Carlo Pascutto
Posts: 1192
Joined: Sat Dec 13, 2008 6:00 pm
Contact:

Re: Gaviota TBs, compression schemes

Post by Gian-Carlo Pascutto » Wed Dec 30, 2009 10:45 am

For my own EGTB (used in Stoofvlees) I applied some of the ideas from LZO and wrote my own variant of LZF, which was significantly more performant than the regular one.

Note that you can use a zlib style hashtable with buckets to make an LZF compressor that will greatly outdo the standard one in compression rate, with no penalty of decompression speed.

Neither will come close to LZMA in ratio, though.

Pradu
Posts: 287
Joined: Sat Mar 11, 2006 2:19 am
Location: Atlanta, GA
Contact:

Re: Gaviota TBs, compression schemes

Post by Pradu » Wed Dec 30, 2009 9:57 pm

michiguel wrote:I am done writing the compressor for the endgame tablebases (tbcp)...
Are there any good resources describing LZMA? I realize that there's already some C code, but I don't think it will be that useful to learn the algorithm.

User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: Gaviota TBs, compression schemes

Post by michiguel » Wed Dec 30, 2009 10:15 pm

Gian-Carlo Pascutto wrote:For my own EGTB (used in Stoofvlees) I applied some of the ideas from LZO and wrote my own variant of LZF, which was significantly more performant than the regular one.

Note that you can use a zlib style hashtable with buckets to make an LZF compressor that will greatly outdo the standard one in compression rate, with no penalty of decompression speed.

Neither will come close to LZMA in ratio, though.
I think that writing my own variant of those algorithm will be an overkill for me, but... writing a chess engine is already an overkill, not to mention an EGTB :-) If anybody told me years ago that I would be in this situation, I would have laughed so hard.

Anyway, there is a couple of fast decompressors I did not test:
http://www.fastlz.org/download.htm

and LZJB (it needs some work to be compiled because it has some weird typedefs and casts. I tried to tweak it and crashed on me. I gave up too quick though)
http://src.opensolaris.org/source/xref/ ... compress.c
It is very appealing that the whole source code is so small.

Anyway, there are thing I want to try. For instance, any compressor will benefit from a canonical function of the data. We could filter out specific positions from a given block on the fly (pieces that overlap, stalemates, mates etc.) during the compression, and put them back during the decompression. Something similar was suggested in another forum by Urban Koistinen (or at least that is what I understood). This may help another layer of compression. I just tested as a proof of concept this morning the compression of this filter(forbidden positions)+RLE+Huffman and I got 8.1 GiB. As good as bzip2 alone and way much faster.

I searched Stoofvlees and I found that is an exquisite "Carbonada flamenca", mmmhhh.... yummy. Is that a game?

Miguel

User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: Gaviota TBs, compression schemes

Post by michiguel » Wed Dec 30, 2009 10:29 pm

Pradu wrote:
michiguel wrote:I am done writing the compressor for the endgame tablebases (tbcp)...
Are there any good resources describing LZMA? I realize that there's already some C code, but I don't think it will be that useful to learn the algorithm.
For what I read, LZMA is LZ77 + Arithmetic coding, so I guess that there is literature for both. From all the links I surveyed to collect free libraries, I may conclude that it may be a good idea to ask this guy
http://www.mattmahoney.net/
It is a professor expert in data compression and runs an impressive benchmark page:
http://www.mattmahoney.net/dc/text.html

Miguel

ernest
Posts: 1875
Joined: Wed Mar 08, 2006 7:30 pm

Re: Gaviota TBs, compression schemes

Post by ernest » Thu Dec 31, 2009 12:10 am

michiguel wrote:it is smaller than Nalimov's and it fits in a 8 GB USB stick.
Info:
Nalimov's 3-4-5 man is 290 files and 7 580 348 624 bytes

Dann Corbit
Posts: 10269
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Gaviota TBs, compression schemes

Post by Dann Corbit » Thu Dec 31, 2009 1:59 am

ernest wrote:
michiguel wrote:it is smaller than Nalimov's and it fits in a 8 GB USB stick.
Info:
Nalimov's 3-4-5 man is 290 files and 7 580 348 624 bytes
Yes, Miguel gets about the same size using compression method v1:

Code: Select all

Scheme  Algorithm  Memory &#40;GiB&#41;  CompressionTime &#40;min&#41; Decompression Time &#40;min&#41;   Comp. MiB/s   Dec. MiB/s 
v1   Zlib-9  7.9   188.0   21.7   3.5   30.3
v2   LZMA-5-4k   6.5  394.0   30.2   1.7   21.7 
However an 8 GB memory stick really has only 7.4 gb so compression method v1 is needed in order to get the data to actually fit on the stick.

alpha123
Posts: 660
Joined: Sat Dec 05, 2009 4:13 am
Location: Colorado, USA

Re: Gaviota TBs, compression schemes

Post by alpha123 » Fri Jan 01, 2010 12:15 am

Info:
Nalimov's 3-4-5 man is 290 files and 7 580 348 624 bytes
However an 8 GB memory stick really has only 7.4 gb
:shock:

That's so ironic that I actually laughed :lol:. Why is it called an 8GB stick then, and not a 7.4GB stick?!

Peter

Teemu Pudas
Posts: 88
Joined: Wed Mar 25, 2009 11:49 am

Re: Gaviota TBs, compression schemes

Post by Teemu Pudas » Fri Jan 01, 2010 1:13 am

Because hard drive manufacturers are liars. 1GB = 10^9 bytes according to them.

User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: Gaviota TBs, compression schemes

Post by michiguel » Fri Jan 01, 2010 1:20 am

alpha123 wrote:
Info:
Nalimov's 3-4-5 man is 290 files and 7 580 348 624 bytes
However an 8 GB memory stick really has only 7.4 gb
:shock:

That's so ironic that I actually laughed :lol:. Why is it called an 8GB stick then, and not a 7.4GB stick?!

Peter
According to the metric system and international standards

1000 bytes is a Kilobyte
1000 x 1000 a Megabyte
1000 x 1000 x 1000 a Gigabyte.

For convenience, computer scientist call 1024 bytes a kilobyte (which is of course 2^10)
so
1024 x 1024 bytes is a Megabyte for computer scientists and
1024 x 1024 x 1024 bytes is Gigabyte. = 1073741824 bytes

They really screwed up giving it the same name as the metric system's Mega and Giga.

So for the metric system 8 GB is 8000000000 bytes. For Comp. sci. is 8 GB is 8589934592 bytes.
So, a true 8GB is 7.45 GB for a computer scientist. In fact, there is a new name for the "Comp. Sci. GB" and the name is Gibibyte (Giga binary byte) or GiB. That is what everybody should be using. Note that that is what I am using in the website.

So, companies may be using one or the other definition when they sell stuff. The capacity of my USB Stick is "8 GB" but when I look at the capacity in "properties" it says 7.4

To be accurate, an 8 GB equals 7.4 GiB.

Miguel
Last edited by michiguel on Fri Jan 01, 2010 1:25 am, edited 1 time in total.

Post Reply