Merge pull request #100 from qjerome/review

last review
chg: [blog] fixed typos
2024-03-25 13:30:25 +01:00 · 2024-03-25 12:54:53 +01:00 · 2024-03-25 12:54:49 +01:00 · 2024-03-25 11:16:31 +01:00 · 2024-03-25 11:00:49 +01:00
2 changed files with 13 additions and 11 deletions
--- a/content/blog/Poppy-a-new-bloom-filter-format-and-project.md
+++ b/content/blog/Poppy-a-new-bloom-filter-format-and-project.md
@ -12,7 +12,7 @@ banner: /img/blog/poppy/2.png

 At [CIRCL](https://www.circl.lu) we use regularly bloom filters for some of our use cases especially in digital forensic. Such as providing a small, fast and shareable caching mechanism for [Hashlookup](https://hashlookup.io/) database which can be used by incident responders.

-We initially worked with an existing great project [bloom](https://github.com/DCSO/bloom) from [DCSO](https://github.com/DCSO) as it provided convenient features we were looking for, such as data serialization. To better suits our growing bloom filter needs, we decided to re-implement the [bloom project](https://github.com/DCSO/bloom) in Rust called [Poppy](https://github.com/hashlookup/poppy). Over the course of the re-implementation, we have noticed some challenges for our use-cases with the original implementation and we decided to move on towards a new implementation we will detail over this blog post.
+We initially worked with an existing great project [bloom](https://github.com/DCSO/bloom) from [DCSO](https://github.com/DCSO) as it provided convenient features we were looking for, such as data serialization. To better suits our growing bloom filter needs, we decided to re-implement the [bloom project](https://github.com/DCSO/bloom) in Rust called [Poppy](https://github.com/hashlookup/poppy). Over the course of the re-implementation, we have noticed some challenges for our use-cases with the original implementation. Therefore, we have decided to move on towards a new implementation we will detail over this blog post.

 So that the reader fully enjoys the content of this blog post, we highly recommend him to familiarize with classical [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) implementation.

@ -52,13 +52,13 @@ Another surprising thing is the very high value of `m` which is `uint64::MAX - 5

 Everything was going fine, until we looked for further optimization to apply.

-Digging into other datastructures (like hash tables) implementations, we noticed a nice optimization can be done on the bit **index generation algorithm**. This optimization consists of taking a bitset (holding all the bits of the filter) with a size being a **power of two**. If we do so, instead of computing bit index with a modulo, we can compute bit index with a `bit and` operation. This comes from the following nice property, for any `m` being a power of two `i % m == i & (m - 1)`. It is important to understand that this particular optimization will allow us to win on CPU cycles as less instructions are needed to compute the bit indexes.
+Digging into other data structures (like hash tables) implementations, we noticed a nice optimization can be done on the bit **index generation algorithm**. This optimization consists of taking a bitset (holding all the bits of the filter) with a size being a **power of two**. If we do so, instead of computing bit index with a **modulo** operation, we can compute bit index with a **bit and** operation. This comes from the following nice property, for any `m` being a power of two `i % m == i & (m - 1)`. It is important to understand that this particular optimization will allow us to win on CPU cycles as less instructions are needed to compute the bit indexes.

-Trying to apply this with bloom filter is fairly trivial, when computing the desired size (in bits) of the bloom filter we just need to take **the next power of two**. When implementing this with our current Rust port of the [bloom](https://github.com/DCSO/bloom) we realized the implementation was broken. When the bit size of the filter becomes a power of two, the **real false positive proability** of the filter literally explodes (**x20** in some cases). Without, going further into the details we believe the weakness comes from the **bit index generation** algorithm describes above. A [Github issue](https://github.com/DCSO/bloom/issues/19) has been opened in the original project to track this issue.
+Trying to apply this with bloom filter is fairly trivial, when computing the desired size (in bits) of the bloom filter we just need to take **the next power of two** (not to penalize false positive probability). When implementing this with our current Rust port of the [bloom](https://github.com/DCSO/bloom) we realized the implementation was broken. When the bit size of the filter becomes a power of two, the **real false positive proability** of the filter literally explodes (**x20** in some cases). Without, going further into the details we believe the weakness comes from the **bit index generation** algorithm described above. A [Github issue](https://github.com/DCSO/bloom/issues/19) has been opened in the original project to track this issue.

-We are now quite embarassed with this issue since to fix it we need to change the **bit index generation algorithm**. Such a change would break compatibility between versions before and after the fix.
+We are now quite embarrassed with this issue since to fix it we need to change the **bit index generation algorithm**. Such a change would break compatibility between versions before and after the fix.

-During this optimization attempt, this was not the only challenge we had to deal with, but the following are not directly related to the current implementation. Optimizing filter with a **power of two bit size** is not ideal, as in the worst case we need to double the size of the filter and we are facing an **exponential growth** issue. We also have to keep in mind that the bigger the filter, the more costly in term of size the optimization will be (we always take next power of two). This fact might be negligeable when dealing with small filters but no so much for big ones, such as the one (currently **~700MB**) generated to hold the full [Hashlookup][hashlookup] database. In this particular example, we would need to increase the size of the filter to **1GB** to benefit from such a speed optimization.
+During this optimization attempt, this was not the only challenge we had to deal with, but the following is not directly related to the current implementation. Optimizing filter with a **power of two bit size** is not ideal, as in the worst case we need to **double** the size of the filter and we are facing an **exponential growth** issue. This fact might be negligible when dealing with small filters but no so much for big ones, such as the one (currently **~700MB**) generated to hold the full [Hashlookup](https://hashlookup.io/) database. In this particular example, we would need to increase the size of the filter to **1GB** to benefit from such a speed optimization.

 Given those conditions, we had a pretty solid motivation to move towards an improved version of the format, library and tools.

@ -74,22 +74,24 @@ In spite of its advantages, this implementations also has some cons:
 * the minimal size of the filter is always **4096 bytes**
 * we always need to retrieve one filter from memory for a lookup

-This is now the time to address the main problem we found in the [bloom](https://github.com/DCSO/bloom), namely the **bit index algorithm**. To address this, we opted for [double hashing](https://en.wikipedia.org/wiki/Double_hashing), a more traditional approach seen in other **Bloom filters** implementations. The only freedom taken on this regard is the **hashing function** used. The original implementation is using [fnv1](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function), which is rather easy to implement but is not adapted for hashing long strings. Fnv family algorithms are processing bytes one by one, impacting hashing performance on large inputs. After several benchmarks, we decided to use [wyhash](https://github.com/wangyi-fudan/wyhash) for the following reasons:
-* one of the best of our [hash function benchmark](https://github.com/tkaitchuck/aHash?tab=readme-ov-file#comparison-with-other-hashers) (maybe for a later blog post)
+This is now the time to address the main problem we found in [bloom](https://github.com/DCSO/bloom), namely the **bit index algorithm**. To address this, we opted for [double hashing](https://en.wikipedia.org/wiki/Double_hashing), a more traditional approach seen in other **Bloom filters** implementations. The only freedom taken on this regard is the **hashing function** used. The original implementation is using [fnv1](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function), which is rather easy to implement but is not adapted for hashing long strings. Fnv family algorithms are processing bytes one by one, impacting hashing performance on large inputs. After several benchmarks, we decided to use [wyhash](https://github.com/wangyi-fudan/wyhash) for the following reasons:
+* one of the best of our benchmark (maybe for a later blog post). One can see this other [hash function benchmark](https://github.com/tkaitchuck/aHash?tab=readme-ov-file#comparison-with-other-hashers) comparing several algorithms
 * portable between CPU architectures
 * implemented in other languages (important if this work needs to be ported)

 In the spirit of providing an improved implementation, we explored some further optimizations applicable to this specific structure.

-1. one can increase troughput (at the cost of space) by choosing a **table size** (c.f. structure drawing) being a **power of two**
+1. one can increase throughput (at the cost of space) by choosing a **table size** (c.f. structure drawing) being a **power of two**
 2. a **trade-off optimization** (a small space cost but gaining speed) by having a pre-filter bitset keeping track of inserted hashes.

 ## Benchmarks

-In order to evaluate our modifications and compare it with the previous implementation, we ran some benchmark on a common dataset. It is worth noting that benchmarking against [DCSO implementation][dcso-bloom] (i.e. v1) has been fully done in Rust. Here after, one can find information about our evaluation dataset.
+In order to evaluate our modifications and compare it with the previous implementation, we ran some benchmarks on a common dataset. It is worth noting that benchmarking against [DCSO implementation](https://github.com/DCSO/bloom) (i.e. v1) has been fully done in Rust. Here after, one can find information about our evaluation dataset.

 **Dataset size**: varies between **10MB** and **7GB**
+
 **Data type**: sha1 strings (40 bytes wide)
+
 **False positive probability**: 0.001 (0.1%)

 ![](/img/blog/poppy/2.png)
@ -102,7 +104,7 @@ On the above graph, zooming on the start of the curves, we notice an interesting

 ![](/img/blog/poppy/4.png)

-On the plot we can observe that both the implementations (in their unoptimized forms) have the same size. We can clearly see the impact of aligning filter size on the **next power of two** done by the speed optimization variant. For the trade off optimization, we can observe that size is more linear, making it a reasonnable default choice.
+On the plot we can observe that both the implementations (in their un-optimized forms) have the same size. We can clearly see the impact of aligning filter size on the **next power of two** done by the speed optimization variant. For the trade off optimization, we can observe that size is more linear, making it a reasonnable default choice.

 So that [Poppy](https://github.com/hashlookup/poppy/) users can compare and choose the best settings for their use case, we integrated a specific **bench** command. This special command allows to assess the speed of the filter but also verifies that the **false positive probability** of the filter matches the expected one.

@ -122,7 +124,7 @@ poppy -j 0 create -p 0.001 /path/to/output/filter.pop /path/to/dataset/*.txt

 ### Lessons learned

-Bloom filters belongs to the category of probabilistic datastructures which means that many non trivial factors might alter its behaviour. Maybe the more confusing aspect about such a structure is that what you get, in term of false positive probability, is not necessarily what you expect. Over the course of this implementation, we had to do many adjustments to limit side effects (such as the one breaking the original implementation). So one wanting to implement it's own Bloom filter really has to pay attention to the quality of the **bit index generation algorithm** to make sure it does not create unexpected collisions. The other important parameters is the **hashing function** used. This one needs to have a low collision rate regardless of the input data. To make sure everything works as expected, thourough testing mixing filter properties and data types is mandatory.
+Bloom filters belongs to the category of probabilistic data structures which means that many non trivial factors might alter its behavior. Maybe the more confusing aspect about such a structure is that what you get, in term of false positive probability, is not necessarily what you expect. Over the course of this implementation, we had to do many adjustments to limit side effects (such as the one breaking the original implementation). So one wanting to implement it's own Bloom filter really has to pay attention to the quality of the **bit index generation algorithm** to make sure it does not create unexpected collisions. The other important parameters is the **hashing function** used. This one needs to have a low collision rate regardless of the input data. To make sure everything works as expected, thorough testing mixing filter properties and data types is mandatory.

 ### Future work

--- a/static/img/blog/poppy/1.png
+++ b/static/img/blog/poppy/1.png
Author	SHA1	Message	Date
Alexandre Dulaunoy	309222b6b9	Merge pull request #100 from qjerome/review last review	2024-03-25 13:30:25 +01:00
qjerome	07edfc3507	chg: [blog] fixed typos Signed-off-by: qjerome <qjerome@rawsec.lu>	2024-03-25 12:54:53 +01:00
qjerome	1f371a5e2c	chg: [blog] last review before publishing Signed-off-by: qjerome <qjerome@rawsec.lu>	2024-03-25 12:54:49 +01:00
Alexandre Dulaunoy	e2b045d05e	chg: [blog] fixed	2024-03-25 11:16:31 +01:00
Alexandre Dulaunoy	983fd8969c	chg: [image] image resized	2024-03-25 11:00:49 +01:00