chg: [blog] fixed typos

Signed-off-by: qjerome <qjerome@rawsec.lu>
pull/100/head
qjerome 2024-03-25 12:49:01 +01:00
parent 1f371a5e2c
commit 07edfc3507
1 changed files with 2 additions and 2 deletions

View File

@ -52,7 +52,7 @@ Another surprising thing is the very high value of `m` which is `uint64::MAX - 5
Everything was going fine, until we looked for further optimization to apply.
Digging into other datastructures (like hash tables) implementations, we noticed a nice optimization can be done on the bit **index generation algorithm**. This optimization consists of taking a bitset (holding all the bits of the filter) with a size being a **power of two**. If we do so, instead of computing bit index with a **modulo** operation, we can compute bit index with a **bit and** operation. This comes from the following nice property, for any `m` being a power of two `i % m == i & (m - 1)`. It is important to understand that this particular optimization will allow us to win on CPU cycles as less instructions are needed to compute the bit indexes.
Digging into other data structures (like hash tables) implementations, we noticed a nice optimization can be done on the bit **index generation algorithm**. This optimization consists of taking a bitset (holding all the bits of the filter) with a size being a **power of two**. If we do so, instead of computing bit index with a **modulo** operation, we can compute bit index with a **bit and** operation. This comes from the following nice property, for any `m` being a power of two `i % m == i & (m - 1)`. It is important to understand that this particular optimization will allow us to win on CPU cycles as less instructions are needed to compute the bit indexes.
Trying to apply this with bloom filter is fairly trivial, when computing the desired size (in bits) of the bloom filter we just need to take **the next power of two** (not to penalize false positive probability). When implementing this with our current Rust port of the [bloom](https://github.com/DCSO/bloom) we realized the implementation was broken. When the bit size of the filter becomes a power of two, the **real false positive proability** of the filter literally explodes (**x20** in some cases). Without, going further into the details we believe the weakness comes from the **bit index generation** algorithm described above. A [Github issue](https://github.com/DCSO/bloom/issues/19) has been opened in the original project to track this issue.
@ -124,7 +124,7 @@ poppy -j 0 create -p 0.001 /path/to/output/filter.pop /path/to/dataset/*.txt
### Lessons learned
Bloom filters belongs to the category of probabilistic datastructures which means that many non trivial factors might alter its behavior. Maybe the more confusing aspect about such a structure is that what you get, in term of false positive probability, is not necessarily what you expect. Over the course of this implementation, we had to do many adjustments to limit side effects (such as the one breaking the original implementation). So one wanting to implement it's own Bloom filter really has to pay attention to the quality of the **bit index generation algorithm** to make sure it does not create unexpected collisions. The other important parameters is the **hashing function** used. This one needs to have a low collision rate regardless of the input data. To make sure everything works as expected, thorough testing mixing filter properties and data types is mandatory.
Bloom filters belongs to the category of probabilistic data structures which means that many non trivial factors might alter its behavior. Maybe the more confusing aspect about such a structure is that what you get, in term of false positive probability, is not necessarily what you expect. Over the course of this implementation, we had to do many adjustments to limit side effects (such as the one breaking the original implementation). So one wanting to implement it's own Bloom filter really has to pay attention to the quality of the **bit index generation algorithm** to make sure it does not create unexpected collisions. The other important parameters is the **hashing function** used. This one needs to have a low collision rate regardless of the input data. To make sure everything works as expected, thorough testing mixing filter properties and data types is mandatory.
### Future work