-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: CountingBloomFilter should not use hash1 membership if enable_repeat_insert=true #4
Comments
Actually, the CountingBloomFilter in fastbloom is not using one more hash to test membership. Whether enable_repeat_insert is true or not, it using k hashes to track the k slots (every slot is a four bit counter) and use all the slots counter>0 as membership test. The four bit counter is use for For example, we insert two element 'hello' 'world' to CountingBloomFilter: cbf.add('hello') # If the k hashes slot index is [234, 5678, 23451, 47684, 109871]
cbf.add('world') # If the k hashes slot index is [234, 1089, 5678, 41320, 110956] the collision indices is 234 and 5678, so the counter value in these indices is 2, and the others are 1. If remove 'world' from the CountingBloomFilter, the counter value at indices 234 and 5678 will become 1 again, and it not impact the membership test of 'hello' ! So for #3 , i think it may return the minimum value for all k index counter . For this bug, can you show me for code for recurrence? |
For this example, if enable_repeat_insert=true and if we insert 'hello' twice, the counter will be like:
234:2 5678:2 23451:2 47684:2 109871:2 |
Got it, my bad! I mistook this line which uses the direct I think I've figured out the reason for the bug in my code now - and it's normal expected counting bloom filter behaviour. Going to close this bug now. |
Found a tricky bug in my code which I think is caused by this.
CountingBloomFilter is using k hashes to track the count, but is also adding one more hash to test membership (hash1). Even if enable_repeat_insert=true, this count is still created and leveraged. I'm not sure if this additional hash is typical design?
Maybe when enable_repeat_insert=true, it should not be using the hash1 anywhere and should use the count>0 as the membership test only.
In my weird case I believe the above is impacting me because:
Theoretically even with collisions I think this shouldn't impact any of the underlying numbers. But I think it is, because 'abc' was not actually present in the CountingBloomFilter (it was a collision case), so when it processes the first increment it'll add a presence hash. This then breaks some other numbers in my bloom filter causing the error.
Also the addition of hash1 for presence tracking adds to the error rate unnecessarily in the enable_repeat_insert=true scenario.
The text was updated successfully, but these errors were encountered: