Don Dailey

Joined: 29 Apr 2008
Posts: 4323

Post subject: Re: how to measure frequency of hash collisions.    Posted: Sun Jun 17, 2012 1:26 am

Daniel Shawul wrote:
 Don wrote: You can estimate the collision rate by using N bits are for checking. So if your key is 64 bits, pretend it's only 60 bits and 4 bits are for collision testing. If the 4 bits do not match it was a collision. You can extrapolate to get the 64 bit collision rate estimate - each time you add a bit you can expect half the number of collisions. Don

But that won't work because in a hash collision, the hash signature (all 64 bits) are the same for two completely different positions... You need a key from another sequence of random numbers (be it from the same or different hash function). Am I missing something ?

Daniel,

I'm going to do a reset here and answer this post again. I think I understand what caused the misunderstanding. I actually am pretty sure we both misunderstood each other.

My idea is probably a lot simpler than you assumed it to be. Essentially you have 64 hash signatures and 64 random number tables. The signatures are all 1 bit and the random number tables are each 1 bit tables. Due to the magic of 64 bit processors they can all be computed in parallel but that is just a detail.

You can check each of these 64 signatures against each other. If the least significant bit is the same, it might be a match or it may be a collision. You can use the second bit to test that it's a collision with a zero percent false positive but 50% false negative. In other words if the second bit does not match then the first bit is a collision for sure. The more bits you use to test, the more accurate your assessment of whether it's a collision.

You could take 60 of those hash signatures and if they all agree with the 60 signatures of the position being tested against, they probably represent the same position. But you can use 4 more of these independent signatures to see if they agree too. If they don't, you have detected a collision with 100% certainty. By using 4 of these signatures you will detect valid collisions 15 out of 16 times. But once in a while one will slip through undetected. If you want better then you must use more bits.

You did a total context switch which confused me because I was talking about this and you started talking about using separate hash functions. I may have been calling my 64 hash signatures as 64 different hash functions, but the function is really the same XOR function operating on 64 different data so this was probably confusing to you. I was careful this time to call them "signatures"

Does it make sense to you now?
