Zobrist hashing for text

Cardoso · Post by **Cardoso** » Sun Jan 21, 2018 12:33 am

Who would thought of that?
Apparently someone did:
https://www.codeproject.com/Articles/12 ... on-ZobHash

mar · Post by **mar** » Sun Jan 21, 2018 12:22 pm

Yes zobrist hashing can be used outside of the domain of chess (and has been done before).
But hashing byte by byte can never be fast no matter what you do, this is why CRC is slow like hell even with tables.

Therefore I find it rather dubious considering the guy claims xxHash (easily the fastest and best hash function out there) is slower than his byte-by-byte zobrist hashing of text, but that's obviously nonsense.
xxHash does 32/64 bits at a time (actually 128/256 bit chunks) (fast hashes work this way, Murmur3 is also awesome).

So no idea what he did (most likely something rather suboptimal), but it's obvious his numbers don't make sense.

See here where Yann Collet evaluates performance of his xxHash:
https://github.com/Cyan4973/xxHash

mar · Post by **mar** » Sun Jan 21, 2018 2:21 pm

Here are my results:

Code: Select all

all hashes are 32-bit, no loop unrolling for zobrist/fnv, VS2017 64-bit C++ compiler
avg word len&#58; 9.4 chars
total words&#58; 428395

timings &#40;best of 10&#41;&#58;
xxHash&#58; 5462 us &#40;2% slower than FNV-1a, 23% faster than zobrist&#41;
zobrist&#58; 6743 us &#40;26% slower than FNV-1a, 23% slower than xxHash&#41;
FNV-1a&#58; 5351 us &#40;2% faster than xxHash, 26% faster than zobrist&#41;

zobrist&#58; 14 collisions (*depends on values, I used MT in C++)
xxHash &#40;seed 0&#41;&#58; 14 collisions (*depends on seed, not a typo - different 14&#41;
FNV-1a&#58; 19 collisions

So we can say that zobrist hashing of strings has good quality and is certainly comparable in performance.
BUT the figures the author gives are zobrist being 48% faster than xxHash and 70% faster than fnv-1a, I wonder how he came up with those...

xxHash has a bit of distadvantage of having internal state of 128-bits in this test, but it shines at hashing large buffers

Ras · Post by **Ras** » Sun Jan 21, 2018 11:53 pm

mar wrote:But hashing byte by byte can never be fast no matter what you do, this is why CRC is slow like hell even with tables.

There are a number of processors out there that support CRC in hardware, and if you can leverage that, it should be somewhat faster. Besides, CRC is not really designed for hashing, but for detecting bit errors. Though in practice, it performs indeed reasonably well as hash function. I'm using a combined CRC-40 for the opening book to identify positions.

Therefore I find it rather dubious considering the guy claims xxHash (easily the fastest and best hash function out there) is slower than his byte-by-byte zobrist hashing of text

Taking a look at the xxHash sources, there is some unrolling that kicks in only for len>=16, but many English words are a lot shorter than that. So I'd expect xxHash to take a performance hit due to short inputs in this application, which are not what the xxHash author has used in his benchmarking.

The next thing is that the Zobrist guy has given benchmarks with and without string conversion to byte arrays, and the conversion itself seems to impact performance negatively. "Zobhash" is given with and without, and without, it is 5 times faster. The speedup for "GetHashCode .net" is even more - and the number of collisions also goes down without conversion. I'm not sure what this conversion actually does, but it seems to have a major influence to the extent that it dominates the execution time.

Zobrist hashing for text

Zobrist hashing for text

Re: Zobrist hashing for text

Re: Zobrist hashing for text

Re: Zobrist hashing for text