public domain mini-LZ library

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

public domain mini-LZ library

Post by mar »

Just finished my very simple LZ compression library in C.
It's public domain (unlicense), so if it might be useful to you, grab it :)
Link: https://github.com/kmar/mlz

Can be used to embed compressed data in executables or simply for experiments.
Since it doesn't use entropy coding, it produces larger data than zlib,
but still the compression ratio seems fine (better than LZF/LZ4/LZO in my tests).

I focused primarily on compression ratio and decompression speed,
but still it can't compete with super-fast codecs like LZ4 or Snappy (because I use bit accumulator to help tag literals/match types).
I only get ~400-500MB/sec raw decompression speed on my i7, YMMV.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: public domain mini-LZ library

Post by mar »

mar wrote:It's public domain (unlicense)
Changed to Boost software license (I've read that PD is not a good idea)
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: public domain mini-LZ library

Post by syzygy »

mar wrote:
mar wrote:It's public domain (unlicense)
Changed to Boost software license (I've read that PD is not a good idea)
But the Boost license seems to require that all derivate works include "the above license grant". So it seems "viral", which is not necessarily bad, but probably not what you intended. Other people cannot simply take the code, modify it to their liking, and include it in their own program. They have to add the Boost license at least to the files that include (remnants of) your code and perhaps to the source code as a whole (the FSF would tell you that the program as a whole becomes a derivative work... the FSF may be wrong, but still).

http://www.boost.org/users/license.html
and the fact that the Boost license is not "viral": if you distribute your own code along with some Boost code, the Boost license applies only to the Boost code (and modified versions thereof); you are free to license your own code under any terms you like.
How far does "modified versions" go? Does it stop at function boundaries? File boundaries?
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: public domain mini-LZ library

Post by mar »

syzygy wrote:But the Boost license seems to require that all derivate works include "the above license grant". So it seems "viral", which is not necessarily bad, but probably not what you intended. Other people cannot simply take the code, modify it to their liking, and include it in their own program. They have to add the Boost license at least to the files that include (remnants of) your code and perhaps to the source code as a whole (the FSF would tell you that the program as a whole becomes a derivative work... the FSF may be wrong, but still).

http://www.boost.org/users/license.html
and the fact that the Boost license is not "viral": if you distribute your own code along with some Boost code, the Boost license applies only to the Boost code (and modified versions thereof); you are free to license your own code under any terms you like.
How far does "modified versions" go? Does it stop at function boundaries? File boundaries?
I hope this is not the case. The license only needs to remain with the library code as I understand it.
I also considered other licenses. BSD/MIT (if I'm not mistaken) require the license to be included even with binary distributions, which is not what I want.
(L)GPL is naturally out of question, that's not a free license at all.
So I wonder what remains, there's also zlib license, Apache and more.
CC0 is similar to Unlicense and I'm not quite sure it can be applied to software.
I'm not a lawyer but I simply want the most permissive license available while still legal...
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: public domain mini-LZ library

Post by syzygy »

How about:
Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: public domain mini-LZ library

Post by syzygy »

Ah, you started with the Unlicense:
This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Looks fine to me.

Whether the "dedicate to public domain" clause is really non-revokable is not within the EU is not entirely clear, but it's the best you can do. EU copyright law does not really have a notion of "public domain" (except for works for which copyright has expired).

Another option: http://www.wtfpl.net/about/
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: public domain mini-LZ library

Post by Daniel Shawul »

Martin,
I wrote a custom LZ + Huffman (Deflate) lib for my bitbases with the limited knowledge I have at the time https://github.com/dshawul/compress. Do you think the LZ part of it could be improved for compression size or decompression speed with your library?
Daniel
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: public domain mini-LZ library

Post by mar »

Daniel Shawul wrote:Martin,
I wrote a custom LZ + Huffman (Deflate) lib for my bitbases with the limited knowledge I have at the time https://github.com/dshawul/compress. Do you think the LZ part of it could be improved for compression size or decompression speed with your library?
Daniel
Hi Daniel,
Regarding compression size, you already do lazy matching so my implementation can't beat it as I do the same.
Since your compressor uses entropy coding, it should be superior to my plain LZ in terms of ratio.
I only do plain hash-chain (or hash-list) and my compressor is VERY slow (in part because I use 64k "sliding dictionary").
However since it's block-based, I plan to parallelize the compressor.

Next step could be optimal parsing (using suffix trees - slow & probably patented?) or you may want to look at this: http://fastcompression.blogspot.cz/2011 ... egies.html
My deep lazy matching does about 1% worse than that (compared to my LZ4 implementation vs original LZ4 compressor).
If you want something better than Huffman (still static), you can look at ANS (Jarek Duda) or FSE (two terms for basically the same thing), it's very interesting to say the least (note: I haven't implemented it so can't really compare to Huffman).

Unless you already do, there are two things to improve compression performance (which is not important at all for offline compression anyway):
- when looking for a better (=longer match), first check byte at src[-cur_match_offset+best_len] == src[best_len], this can give a nice gain
- when looking for a lazy match, first check if the tail (=[greedy]best_dist + best_len + 2) is present for MIN_MATCH_LEN before looking for full lazy match

Speaking of decompression performance, your Huffman decoding should be pretty fast already (from what I've seen in your code - you slice it)
In my inflate I use direct lookup table (11 or 10 bits, don't remember) and then I return extra bits. Another possiblity is this (haven't implemented it but Ronald did in Syzygy bases): http://www.eecs.harvard.edu/~michaelm/E210/huffman.pdf
Your bitstream (or bitset in your code) uses accumulator + length (number of bits)
What I do in my "bit accumulator" is I use a special bit flag while pulling bits.
Since my accumulator goes lsb=>msb, I use a "guard" bit.
It's 24 bits so I read 3 bytes and then or 1 << 24, shifting right as I pull bits.
So I can always simply use a bit mask to check how many bits are available (=can be read without reloading accumulator).
Maybe a similar approach could be used in your decoder? (but reversed).

PS It's a bit late so I hope that what I wrote makes sense - just sharing some thoughts.

Oh and speaking of CRC-32, you may want to try slicing which is significantly faster: http://create.stephan-brumme.com/crc32/
For example, just computing Adler32 on uncompressed data in my library eats about 17% of total decompression time (and adler32 is much faster but alas also much worse in terms of quality)
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: public domain mini-LZ library

Post by wgarvin »

By the way, if you haven't come across them before, Charles Bloom has written years worth of good blog posts about LZ parsing. I suggest to start with these two:
http://cbloomrants.blogspot.ca/2012/09/ ... -tree.html
http://cbloomrants.blogspot.ca/2015/03/ ... redux.html
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: public domain mini-LZ library

Post by Rein Halbersma »

syzygy wrote:Ah, you started with the Unlicense:
This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Looks fine to me.

Whether the "dedicate to public domain" clause is really non-revokable is not within the EU is not entirely clear, but it's the best you can do. EU copyright law does not really have a notion of "public domain" (except for works for which copyright has expired).

Another option: http://www.wtfpl.net/about/
I would not recommend a modified Boost license (or any other less familiar license) because it increases transaction costs for further development. The whole proliferation of slightly different licenses is a real pain.

You can freely can copy Boost licensed code into your own code. For clarity it is probably easier to keep a separation at the file level (#include makes it identical for the compiler anyway).