hgm wrote:Sven Schüle wrote:If I set K=24 (which would correlate to 256MB hash and 16 byte entry size, just as an example) for a collision probability of 1/(2^40) then within 2^40 probes (=1.1e12, about 12.7 days of searching with 1 Mnodes/sec if probing would occur at 100% of visited nodes, so realistically much longer) the probability for one or more collisions in that time would be about 0.632.

(EDITED the last number, initially I had (1 - that value). )

It _is_ a variation of the birthday paradox.

Does my calculation make sense to you?

Yes, with 40 bits signature you would have a collission probability of 2^-40 (per comparison). That is what I was saying all along. The signature length determines the probability of a false match.

If you search 4 entries per probe (e.g. when you do shallowest-of-four replacement) the probability becomes 4 times larger.

The table size does not come in, other than that it might affect the effective signature length (i.e. discounting the redundant bits).

Your "yes" in conjunction with what you are writing afterwards is more like a "no" since you disagree to me

But I think we are saying the same things, only with different words.

My definitions:

K := number of address bits

S := log(2) of number of slots per address (e.g. S=2 for four slots)

E := log(2) of entry size in bytes (to simplify, let E be an integer)

N := log(2) of table size in bytes

N = K+S+E, i.e. table size = 2^K * 2^S * 2^E

L := stored signature length

L_eff := effective signature length = min(L, 64-K)

Following my previous example of a 256 MB hash table with 16 bytes per entry we would have N=28, E=4 and with S=2 finally K=(28-4-2)=22.

a) The case you are talking about is:

L < 64-K => L_eff = L

b) The case I was talking about is:

L >= 64-K => L_eff = 64-K

a) Your case has no redundant signature bits so in that case it is right that the table size has no effect on the probability of a false match. This is true as long the condition above holds, it becomes false by sufficiently increasing the table size until L_eff = 64-K.

In your case it can occur, in theory, that a matching entry is "found" even though it was initially stored for a different hash key. That happens since some bits are dropped when storing, and can't be used any longer to reject the entry. But that probability is small. For D dropped bits (D := 64-N-K) it is something like (2^D - 1)/(2^64). So the expected case is that this does not occur.

Given that assumption and the equation L_eff = L that I gave above, it is clear that L_eff is independent from N at least, as you said. I have to think about the question whether the probability for a false match depends on S, though. Maybe you are right since my "number of probes" in my previous post is not identical to the number of signature comparisons (we can have 1,2,3, or 4 for 4 slots). OTOH even with 4 comparisons per probe I would either have 0 or 1 false matches. The first match terminates the probe. So I am not sure about it. It is possible that you are right but I did not get to the deciding point yet.

To clear this up it will also be necessary to define which change of parameters we intend to look at:

- increase S (by the programmer) but keep N constant (by the environment), thereby reducing K?

- increase S (by the programmer) and increase N by the same amount as S (by the environment), thereby keeping K constant?

b) My case has either zero or more redundant signature bits but no "dropped bits" (D=0). K directly depends on both N and S, and we have:

L_eff = 64-(N-S-E)

so here the table size has an influence.

Could you please verify my theory, and try to explain your theory about case a) based on the terms I introduced above so that we are talking about the same thing? That would be helpful.

Sven