Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: CountingBloomFilter should not use hash1 membership if enable_repeat_insert=true #4

Closed
glmcdona opened this issue Feb 12, 2023 · 3 comments

Comments

@glmcdona
Copy link

glmcdona commented Feb 12, 2023

Found a tricky bug in my code which I think is caused by this.

CountingBloomFilter is using k hashes to track the count, but is also adding one more hash to test membership (hash1). Even if enable_repeat_insert=true, this count is still created and leveraged. I'm not sure if this additional hash is typical design?

Maybe when enable_repeat_insert=true, it should not be using the hash1 anywhere and should use the count>0 as the membership test only.

In my weird case I believe the above is impacting me because:

  1. Test membership of key 'abc', it thinks correctly that it is present because it has non-zero count, but is actually an error collision
  2. I loop decrement until 'abc' is not present anymore to get the count
  3. Now I loop increment 'abc' until it's back to it's original count

Theoretically even with collisions I think this shouldn't impact any of the underlying numbers. But I think it is, because 'abc' was not actually present in the CountingBloomFilter (it was a collision case), so when it processes the first increment it'll add a presence hash. This then breaks some other numbers in my bloom filter causing the error.

Also the addition of hash1 for presence tracking adds to the error rate unnecessarily in the enable_repeat_insert=true scenario.

@yankun1992
Copy link
Owner

Actually, the CountingBloomFilter in fastbloom is not using one more hash to test membership. Whether enable_repeat_insert is true or not, it using k hashes to track the k slots (every slot is a four bit counter) and use all the slots counter>0 as membership test. The four bit counter is use for remove element from CountingBloomFilter.

For example, we insert two element 'hello' 'world' to CountingBloomFilter:

cbf.add('hello') #  If the k hashes slot index is [234, 5678, 23451, 47684, 109871]
cbf.add('world') #  If the k hashes slot index is [234, 1089, 5678, 41320, 110956]

the collision indices is 234 and 5678, so the counter value in these indices is 2, and the others are 1. If remove 'world' from the CountingBloomFilter, the counter value at indices 234 and 5678 will become 1 again, and it not impact the membership test of 'hello' ! So for #3 , i think it may return the minimum value for all k index counter .

For this bug, can you show me for code for recurrence?

@yankun1992
Copy link
Owner

For this example, if enable_repeat_insert=true and if we insert 'hello' twice, the counter will be like:

cbf.add('hello') #  If the k hashes slot index is [234, 5678, 23451, 47684, 109871]
cbf.add('hello') 
cbf.add('world') #  If the k hashes slot index is [234, 1089, 5678, 41320, 110956]

234:2 5678:2 23451:2 47684:2 109871:2
and after insert 'world', the counter will be
234:3 5678:3 23451:2 47684:2 109871:2 1089:1 41320:1 110956:1

@glmcdona
Copy link
Author

Got it, my bad! I mistook this line which uses the direct hash1 key to be the feature that enables the enable_repeat_insert capability:
https://github.com/yankun1992/fastbloom/blob/main/fastbloom-rs/src/bloom.rs#L471

I think I've figured out the reason for the bug in my code now - and it's normal expected counting bloom filter behaviour. Going to close this bug now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants