I'm not even sure about that. I see little if any logic in Steven's deduction.mvk wrote:This is only the case if you compare each new hash with every previously generated hash.sje wrote:How many bits should a hash signature have?
With a 64 bit signature, the program will encounter a false positive about once every 2^32 comparisons, or about once every seven minutes. That's bad.
He seems to argue that the probability that two randomly selected strings of N bits long are identical is 1 in 2^(N/2). The reasoning behind this seems to be that you can expect N/2 bits to be identical anyway, so the probability that the other N/2 bits coincide "must be" 2^(N/2). This is very wrong.
The probability that two randomly selected strings of N bits long are identical is 1 in 2^N. Not too much to explain about that, but maybe an example will help. Suppose the first string happens to be 0000...0 (N bits). What is the probability that the second strings will also be 0000...0? That's easy: 2^N. Now ask yourself whether the answer is different if the first string is another bit pattern. (Hopefully you'll answer "no".)
Just for fun: the probability that k bits coincide is Binomial(k,N) / 2^N.
The expected number of coinciding bits is indeed N/2: the ith bit coincides half the time, so E(# of coinciding bits) = E(\sum_i (ith bit coincides ? 1 : 0)) = \sum_i E(ith bit coincides ? 1 : 0) = \sum_i (1/2) = N / 2.
Indeed, the mathematics is very simple if you just assume that all table entries are in use (which is the normal case).But once the table is saturated, that is not what happens because you lose old hashes, and with that the opportunity to find collisions. From that moment onwards the ratio of false hits scales only with the number of non-addressing bits minus the binary logarithm of the bucket size. This is easy to see intuitively if you start to imagine a table size of 1.
With a bucket size of 1, you can expect 1 false hit every 2^N probes that are supposed not to hit, where N is the number of bits in the hash key that were not used to identify the bucket.
With a bucket size of K, you can expect K false hits every 2^N probes.