14 releases
Uses new Rust 2024
| 0.13.11 | Jan 16, 2026 |
|---|---|
| 0.13.10 | Jan 16, 2026 |
| 0.12.6 | Jan 15, 2026 |
#141 in #fuse
65KB
770 lines
jdb_xorf: High-Performance Binary Fuse Filter in Rust
Introduction
jdb_xorf is a high-performance Binary Fuse filter implementation for Rust. Compared to Bloom or Cuckoo filters, this probabilistic data structure offers faster query speeds and smaller memory footprint. Binary Fuse filters represent the state-of-the-art in static set membership testing.
Binary Fuse is the pinnacle of the Xor Filter family and is currently the most efficient static membership test structure known. It has significant advantages over other filters in all key metrics:
Why Choose Binary Fuse?
- Faster Construction: Uses Graph Partitioning technology to break the problem into small blocks adaptable to L1/L2 cache, making construction 10-20 times faster than traditional Xor Filters.
- More Space Efficient: Lower space occupancy for the same false positive rate.
Bf8requires only about 8.64 bits/entry to achieve a false positive rate of about 0.39% (space overhead is only 1.08x the theoretical lower limit). - Better Locality: Partitioned design significantly reduces CPU cache misses.
- Ultra-Fast Query: Query is strictly O(1), requiring only 3 memory accesses + 1 hash mixing + 2 XOR calculations.
- No False Negatives: Guaranteed to return True if the element is in the set.
Table of Contents
- Prerequisites and Caveats
- Usage Demo
- Features
- Design
- Technology Stack
- Directory Structure
- API
- Construction Failure Probability
- Historical Background
- References
Prerequisites and Caveats
No Duplicate Keys Allowed
The construction algorithm for Binary Fuse filters (Bf8, Bf16, Bf32) has a strict prerequisite: the input data structure must not contain duplicate keys. If the input u64 hash values contain duplicates, the construction process will almost certainly fail. If you use the raw filter directly, you must remove duplicates yourself before construction.
automatic deduplication with Bf
It is recommended to use the Bf wrapper to handle arbitrary types (such as String, &[u8], etc.). Bf internally automatically handles all hash calculation, sorting, and deduplication work to ensure construction success rate. You only need to pass in the data, and leave the rest to it.
Usage Demo
Basic Binary Fuse Filter
use jdb_xorf::{Filter, Bf8};
let keys = vec![1u64, 2, 3];
let filter = Bf8::from(&keys);
assert!(filter.has(&1));
assert!(!filter.has(&4));
Construction from Arbitrary Types with Borrowed Query
Bf supports borrow-based querying similar to HashMap. For example, after referencing a String filter, you can query it directly using &str.
use jdb_xorf::{Filter, Bf, Bf8};
let fruits = vec!["apple".to_string(), "banana".to_string()];
// Bf automatically handles hashing and deduplication.
// Uses RapidHasher by default for extremely high performance.
let filter: Bf<String, Bf8> = Bf::from(&fruits);
// Query with &str directly on a String filter
assert!(filter.has("apple"));
Wrapper (Wrap) Mode: Using Bf<str>
If you want the filter to semantically represent a "Filter of strings" (rather than a "Filter of string references") at the type level, you can use .into() to convert Bf<&str> to Bf<str>. This is particularly useful for Unsized Types.
use jdb_xorf::{Bf, Bf8, Filter};
let keys = vec!["apple", "banana"];
// 1. Build a normal Bf<&str> first
let temp_filter: Bf<&str, Bf8> = Bf::from(&keys);
// 2. Convert into Bf<str> (Unsized)
let str_filter: Bf<str, Bf8> = temp_filter.into();
// 3. Query
assert!(str_filter.has("apple"));
Binary Data: Using Bf<[u8]>
Similarly, you can create filters for byte slices.
use jdb_xorf::{Bf, Bf8, Filter};
let data: Vec<&[u8]> = vec![b"hello", b"world"];
let temp_filter: Bf<&[u8], Bf8> = Bf::from(&data);
// Convert into Bf<[u8]>
let bytes_filter: Bf<[u8], Bf8> = temp_filter.into();
assert!(bytes_filter.has(b"hello".as_slice()));
Serialization and Deserialization (Optional Feature)
After enabling the bitcode feature, you can use bitcode::encode / bitcode::decode for serialization directly.
use jdb_xorf::{Bf, Bf8};
// 1. Serialize (Encode)
let keys = vec!["apple", "banana"];
// No need to convert to String, use &str directly
let filter: Bf<&str, Bf8> = Bf::from(&keys);
// Use bitcode library function directly
let bytes = bitcode::encode(&filter);
// 2. Deserialize (Decode)
// Can fully restore type information, including generic parameters
let loaded: Bf<&str, Bf8> = bitcode::decode(&bytes).expect("Decode failed");
assert!(loaded.has("apple"));
Embedding in Structs
Thanks to bitcode support, you can easily embed Bf into your own structs, such as an SSTable structure in a database.
use bitcode::{Decode, Encode};
use jdb_xorf::{Bf, Bf8};
#[derive(Encode, Decode)]
pub struct Sst {
// Filter for binary data
pub xorf: Bf<[u8], Bf8>,
}
let keys: Vec<&[u8]> = vec![b"key1", b"key2"];
let filter: Bf<&[u8], Bf8> = Bf::from(&keys);
let sst = Sst {
xorf: filter.into(),
};
// Serialize the entire struct
let bytes = bitcode::encode(&sst);
Features
- Ultra-Fast: Picosecond-level query latency.
- High Efficiency: Extremely high space utilization (Bf8 requires only about 8.64 bit per entry, space overhead is 1.08x theoretical lower limit).
- Flexible: Provides
Bfadapter, supports non-u64 types and automatic deduplication. - Portable: Fully supports
no_std, suitable for embedded environments. - Serialization: Optional support for
bitcodeserialization.
Algorithm Details (Mermaid)
1. Construction Phase (Peeling Phase)
graph TD
Start["Start Construction"] --> Init["Calculate Parameters: seg_len, capacity"]
Init --> SeedIter["Try Next Seed"]
SeedIter --> Mapping["Map Key: Calculate 3 slots h0, h1, h2"]
Mapping --> Bucketing["Update Bucket State: t2count++ / t2hash XOR= hash"]
Bucketing --> FindAlone["Scan Buckets: Find alone buckets with count == 1"]
FindAlone --> Queue["Add to alone queue"]
Queue --> PeelLoop{"Is Queue Empty?"}
PeelLoop -- "No" --> Pop["Pop bucket index, Push key to reverse_order stack"]
Pop --> Update["Update 2 adjacent buckets: Decrease count and XOR hash sum"]
Update --> NewAlone{"New alone bucket produced?"}
NewAlone -- "Yes" --> Queue
NewAlone -- "No" --> PeelLoop
PeelLoop -- "Yes" --> Success{"All keys processed?"}
Success -- "No" --> SeedIter
Success -- "Yes" --> Done["Enter Solver Phase"]
2. Solver Phase (Solver Phase)
graph TD
SStart["Start Solver"] --> SInit["Initialize fingerprints array"]
SInit --> PopStack["Pop key and slot info from reverse_order stack top"]
PopStack --> ReadOther["Read other 2 determined or initial fingerprints"]
ReadOther --> Assign["Calculate current fingerprint: fp = target_f XOR fp_other1 XOR fp_other2"]
Assign --> Next{"Is Stack Empty?"}
Next -- "No" --> PopStack
Next -- "Yes" --> SDone["BinaryFuse Construction Successful"]
3. Query Phase (Query Phase)
graph TD
QKey["Input Query Key"] --> QHash["mix64 Hash Mixing"]
QHash --> QSlots["Determine 3 slots: h0, h1, h2"]
QSlots --> QRead["Atomic Read: fp0, fp1, fp2"]
QRead --> QXor["XOR Operation: res = fp0 XOR fp1 XOR fp2"]
QXor --> QMatch{"res == (hash as Fingerprint)?"}
QMatch -- "Yes" --> QPres["Probably Present"]
QMatch -- "No" --> QNot["Definitely Not Present"]
- Hashing: Mix keys using RapidHash or custom hasher.
- Mapping: Determine three slots in the partition graph based on hash value.
- Lookup: Determine membership by XORing the fingerprints of these three slots.
Technology Stack
- Language: Rust (Edition 2024).
- Core Algorithm: Binary-partitioned Fuse Graph algorithm.
- Hash Algorithm: RapidHash (based on
rapidhashcrate), high-quality mixing functionmix64. - Performance Evaluation: Criterion micro-benchmarks.
Directory Structure
src/: Core implementation.base/: Generic Binary Fuse algorithm implementation (construction, query, tools).bf.rs: Generic wrapperBfimplementation (arbitrary types & deduplication).hash.rs: Hasher and high-quality mixing function implementation.
benches/: Performance benchmark suite.analysis/: Uniformity and zero-distribution analysis tools.
API
Trait
Filter<T>: Core trait for membership testing.has(&self, key: &T) -> boollen(&self) -> usize
Types
Bf8,Bf16,Bf32: Managed memory filters (Aliases toBase<u8>,Base<u16>,Base<u32>).Bf<T, F, H = RapidHasher>: Generic wrapper, uses hasherHand filterFto handle arbitrary typeT, with automatic deduplication.
Summary Comparison Table
| Filter | Memory Footprint | Query Speed | Construction Speed | Cache Friendliness | Scenario |
|---|---|---|---|---|---|
| Binary Fuse | Very Low (≈1.08x theoretical limit) | Very Fast (3 accesses) | Very Fast (partition optimized) | Excellent | Best choice for static massive data |
| Xor Filter | Low | Fast | Slow | Poor | Previous generation solution |
| Bloom Filter | Medium | Slow (multiple hashes) | Fast | Poor | Dynamic data/Simple scenarios |
| Cuckoo Filter | Low | Medium (random probing) | Slow | Poor | Deletion support needed |
Construction Failure Probability
The theoretical probability of Binary Fuse filter construction failure is extremely low. This library automatically retries 1000 times during construction (using a different random seed each time).
According to Mueller & Lemire's paper [2], the lower bound for single construction success rate is 90%. Based on measured data from this library (1000 rounds of testing on 100,000 random keys):
- First-time Construction Success Rate: 98.70% (Successful on first try in 987 out of 1000 times)
- Average Attempts: 1.013
This means the failure rate for a single construction is only .
Therefore, the probability of failure for 1000 consecutive constructions is:
is a value that can be considered absolutely 0.
It is much smaller than the reciprocal of the total number of atoms in the universe, and hundreds of orders of magnitude lower than the probability of uncorrectable errors in modern computer hardware (about
).
Google's research on large-scale data centers shows that about 1.3% of machines experience at least one Uncorrectable Error per year. Assuming a construction process takes 0.1 seconds, the probability of a hardware error occurring during this period is on the order of .
| Event Type | Approximate Probability | Risk Qualification |
|---|---|---|
| Hardware Bit Flip During Construction | Real existent extremely low risk | |
| Binary Fuse Construction Failure | Physically "Absolutely Impossible" |
Therefore, the library's design principle is: Treat construction failure as an unrecoverable fatal error (panic), rather than a runtime error (Result/TryFrom).
If you encounter a panic during construction, it is more likely due to:
- Duplicate keys in input data (This is the most common reason, even if you think you have deduplicated).
- Hardware failure (Memory bit flip, etc.).
- Extremely rare probabilistic event (In this case, a simple retry is sufficient, but it is almost impossible to encounter within the timescale of human civilization).
For ease of use, we prioritize using the From trait because it matches the psychological expectation in most scenarios: the construction process is always successful.
Historical Background
The technical evolution of probabilistic filters began with Bloom filters (1970) and went through improvements with Cuckoo filters (2014). The emergence of Xor filters in 2020 brought a paradigm shift, achieving better performance through perfect XOR summation. In 2022, Mueller and Lemire further proposed Binary Fuse filters, which approached theoretical limits in space and time efficiency through graph partitioning technology.
References
- Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters
- Binary Fuse Filters: Fast and Smaller Than Xor Filters
- Fuse Graph
- Go Implementation
- C Implementation
- fuse graph
About
This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.
We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:
Bench
Performance Benchmark
| Library | Filter | Bf Ops | Query Ops | Memory | Speedup |
|---|---|---|---|---|---|
| jdb | Bf8 | 4959.87 | 86727.74 | 116.04 KB | 1.65x |
| xorf | BinaryFuse8 | 3971.65 | 52493.29 | 116.04 KB | - |
| jdb | Bf16 | 4809.60 | 76974.42 | 232.04 KB | 1.56x |
| xorf | BinaryFuse16 | 4015.96 | 49222.88 | 232.04 KB | - |
| jdb | Bf32 | 4676.60 | 67297.71 | 464.04 KB | 1.39x |
| xorf | BinaryFuse32 | 4016.81 | 48393.79 | 464.04 KB | - |
Accuracy
| Library | Filter | False Positive Rate | False Negative Rate |
|---|---|---|---|
| jdb | Bf8 | 0.39196% | 0 |
| xorf | BinaryFuse8 | 0.38768% | 0 |
| jdb | Bf16 | 0.00132% | 0 |
| xorf | BinaryFuse16 | 0.00165% | 0 |
| jdb | Bf32 | 0.00000% | 0 |
| xorf | BinaryFuse32 | 0.00000% | 0 |
About
This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.
We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:
jdb_xorf : 极致性能的 Rust Binary Fuse 过滤器
项目介绍
jdb_xorf 是针对 Rust 开发的高性能 Binary Fuse 过滤器实现。此类概率型数据结构相较于 Bloom 或 Cuckoo 过滤器,具备更快的查询速度与更小的内存占用。Binary Fuse 过滤器代表了目前静态集合成员检测技术的最高水平。
Binary Fuse 是 Xor Filter 系列的巅峰之作,也是目前已知最高效的静态成员检测结构。相比于其他过滤器,它在所有关键指标上都具有显著优势:
为什么选择 Binary Fuse?
- 构建速度更快: 采用了图分区 (Graph Partitioning) 技术,将问题拆解为适应 L1/L2 缓存的小型块,构建速度比传统 Xor Filter 快 10-20 倍。
- 更节省空间: 相同误报率下空间占用更低。
Bf8仅需约 8.64 bits/entry 即可达到约 0.39% 的误报率(空间开销仅为理论下限的 1.08x)。 - 更好的局部性: 分区设计极大幅度减少了 CPU 缓存失效。
- 极速查询: 查询是严格的 O(1),仅需 3 次内存访问 + 1 次哈希混淆 + 2 次 XOR 计算。
- 无误漏: 保证如果元素在集合中,一定会返回 True。
目录
注意事项与前提条件
严禁重复元素
Binary Fuse 过滤器 (Bf8, Bf16, Bf32) 的构建算法有一个严格的前提条件:输入的数据结构中不得包含重复的键 (duplicate keys)。如果输入的 u64 哈希值存在重复,构建过程几乎肯定会失败。如果您直接使用原始过滤器,必须在构建前自行去除重复项。
Bf 自动去重
推荐使用 Bf 包装器来处理任意类型(如 String, &[u8] 等)。Bf 会在内部自动处理所有的哈希计算、排序和去重工作,确保构建成功率。您只需传入数据,剩下的交给它处理。
使用演示
基础 Binary Fuse 过滤器
use jdb_xorf::{Filter, Bf8};
let keys = vec![1u64, 2, 3];
let filter = Bf8::from(&keys);
assert!(filter.has(&1));
assert!(!filter.has(&4));
任意类型的构建与借用查询
Bf 支持类似 HashMap 的借用查询。例如,构建 String 过滤器后,可以直接使用 &str 查询。
use jdb_xorf::{Filter, Bf, Bf8};
let fruits = vec!["apple".to_string(), "banana".to_string()];
// Bf 会自动处理哈希和去重。
// 默认使用 RapidHasher 以获得极高性能。
let filter: Bf<String, Bf8> = Bf::from(&fruits);
// 直接使用 &str 查询 String 类型的过滤器
assert!(filter.has("apple"));
包装 (Wrap) 模式:使用 Bf<str>
如果你希望过滤器在类型层面上就代表 "字符串过滤器" (而非 "字符串引用过滤器"),可以使用 into() 方法将 Bf<&str> 转换为 Bf<str>。
use jdb_xorf::{Bf, Bf8, Filter};
let keys = vec!["apple", "banana"];
// 1. 先构建常规的 Bf<&str>
let temp_filter: Bf<&str, Bf8> = Bf::from(&keys);
// 2. 转换为 Bf<str> (Unsized)
let str_filter: Bf<str, Bf8> = temp_filter.into();
// 3. 查询
assert!(str_filter.has("apple"));
二进制数据:使用 Bf<[u8]>
同样地,你可以创建针对字节切片的过滤器。
use jdb_xorf::{Bf, Bf8, Filter};
let data: Vec<&[u8]> = vec![b"hello", b"world"];
let temp_filter: Bf<&[u8], Bf8> = Bf::from(&data);
// 转换为 Bf<[u8]>
let bytes_filter: Bf<[u8], Bf8> = temp_filter.into();
assert!(bytes_filter.has(b"hello".as_slice()));
序列化与反序列化 (可选特性)
开启 bitcode 特性后,可直接使用 bitcode::encode / bitcode::decode 进行序列化。
use jdb_xorf::{Bf, Bf8};
// 1. 序列化 (Encode)
let keys = vec!["apple", "banana"];
// 无需转换为 String,直接使用 &str
let filter: Bf<&str, Bf8> = Bf::from(&keys);
// 直接使用 bitcode 库函数
let bytes = bitcode::encode(&filter);
// 2. 反序列化 (Decode)
// 能够完全还原类型信息,包括泛型参数
let loaded: Bf<&str, Bf8> = bitcode::decode(&bytes).expect("Decode failed");
assert!(loaded.has("apple"));
将过滤器作为结构体字段
得益于 bitcode 支持,你可以轻松地将 Bf 嵌入到自己的结构体中,甚至是数据库的 SSTable 结构。
use bitcode::{Decode, Encode};
use jdb_xorf::{Bf, Bf8};
#[derive(Encode, Decode)]
pub struct Sst {
// 针对二进制数据的过滤器
pub xorf: Bf<[u8], Bf8>,
}
let keys: Vec<&[u8]> = vec![b"key1", b"key2"];
let filter: Bf<&[u8], Bf8> = Bf::from(&keys);
let sst = Sst {
xorf: filter.into(),
};
// 序列化整个结构体
let bytes = bitcode::encode(&sst);
特性介绍
- 极速: 皮秒级查询延迟。
- 高效: 空间利用率极高(Bf8 每条目仅需约 8.64 bit,空间开销为理论下限的 1.08x)。
- 灵活: 提供
Bf适配器,支持非 u64 类型并自动去重。 - 便携: 完整支持
no_std,适用于嵌入式环境。 - 序列化: 可选支持
bitcode或 DMA 零拷贝加载。
算法细节 (Mermaid)
1. 构造阶段 (Peeling Phase)
graph TD
Start["开始构造"] --> Init["计算参数: seg_len, capacity"]
Init --> SeedIter["尝试下一种子 (Seed)"]
SeedIter --> Mapping["映射键: 计算 3 个槽位 h0, h1, h2"]
Mapping --> Bucketing["更新桶状态: t2count++ / t2hash XOR= hash"]
Bucketing --> FindAlone["扫描桶: 寻找 count == 1 的孤立桶"]
FindAlone --> Queue["加入 alone 队列"]
Queue --> PeelLoop{"队列是否为空?"}
PeelLoop -- "否" --> Pop["弹出桶索引, 将键压入 reverse_order 栈"]
Pop --> Update["更新相邻 2 个桶: 减少计数并异或哈希和"]
Update --> NewAlone{"产生新孤立桶?"}
NewAlone -- "是" --> Queue
NewAlone -- "否" --> PeelLoop
PeelLoop -- "是" --> Success{"是否处理了所有键?"}
Success -- "否" --> SeedIter
Success -- "是" --> Done["进入求解阶段"]
2. 求解阶段 (Solver Phase)
graph TD
SStart["开始求解"] --> SInit["初始化指纹数组 fingerprints"]
SInit --> PopStack["从 reverse_order 栈顶弹出键与槽位信息"]
PopStack --> ReadOther["读取另外 2 个已确定或初始的指纹"]
ReadOther --> Assign["计算当前指纹: fp = target_f XOR fp_other1 XOR fp_other2"]
Assign --> Next{"栈是否为空?"}
Next -- "否" --> PopStack
Next -- "是" --> SDone["BinaryFuse 构建成功"]
3. 查询阶段 (Query Phase)
graph TD
QKey["输入查询键"] --> QHash["mix64 哈希混淆"]
QHash --> QSlots["确定 3 个槽位: h0, h1, h2"]
QSlots --> QRead["原子读取: fp0, fp1, fp2"]
QRead --> QXor["异或运算: res = fp0 XOR fp1 XOR fp2"]
QXor --> QMatch{"res == (hash as Fingerprint)?"}
QMatch -- "是" --> QPres["可能存在 (Probably)"]
QMatch -- "否" --> QNot["绝对不存在 (Definitely)"]
- 哈希化: 使用 RapidHash 或自定义哈希器对键进行混淆。
- 映射: 根据哈希值在分区图中确定三个槽位。
- 查找: 通过对这三个槽位的指纹进行 XOR 运算,判断成员身份。
技术堆栈
- 语言: Rust (Edition 2024)。
- 核心算法: 二进制分区保险丝图 (Binary-partitioned Fuse Graph) 算法。
- 哈希算法: RapidHash (基于
rapidhashcrate), 高质量混淆函数mix64。 - 性能评估: Criterion 微基准测试。
目录结构
src/: 核心实现。base/: 泛型 Binary Fuse 算法实现 (构建、查询、工具)。bf.rs: 通用包装器Bf实现 (支持任意类型与去重)。hash.rs: 哈希器与高质量混淆函数 implementation。
benches/: 性能基准测试集。analysis/: 均匀性与零分布分析工具。
API 说明
Trait
Filter<T>: 成员检测核心 trait。has(&self, key: &T) -> boollen(&self) -> usize
类型
Bf8,Bf16,Bf32: 托管内存的过滤器 (别名Base<u8>,Base<u16>,Base<u32>)。Bf<T, F, H = RapidHasher>: 通用包装器,使用哈希器H与过滤器F处理任意类型T,具备自动去重功能。
总结对比表
| 过滤器 | 内存占用 | 查询速度 | 构建速度 | 缓存友好度 | 场景 |
|---|---|---|---|---|---|
| Binary Fuse | 极低 (≈1.08x 理论下限) | 极快 (3 次访问) | 极快 (分区优化) | 优秀 | 静态海量数据最佳选择 |
| Xor Filter | 低 | 快 | 慢 | 差 | 旧一代方案 |
| Bloom Filter | 中 | 慢 (多次哈希) | 快 | 差 | 动态数据/简单场景 |
| Cuckoo Filter | 低 | 中 (随机探测) | 慢 | 差 | 需要支持删除 |
构建失败概率
Binary Fuse 过滤器构建失败的理论概率极低。本库在构建时会自动进行 1000 次重试(每次使用不同的随机种子)。
根据 Mueller & Lemire 的论文 [2],单次构建的成功率下界为 90%。而基于本库的实测数据(针对 100,000 个随机键进行 1000 轮测试):
- 一次性构建成功率: 98.70% (1000 次中 987 次首次成功)
- 平均尝试次数: 1.013
这意味着单次构建的失败率仅为 。
因此,连续 1000 次构建均失败的概率为:
是一个完全可以视为 0 的数值。
它比宇宙中的原子总数倒数还要小得多,更比现代计算机硬件发生不可纠正错误的概率(约为
) 低几百个数量级。
Google 在大规模数据中心的研究表明,每年约有 1.3% 的机器经历过至少一次不可纠正的内存错误 (Uncorrectable Error)。假设一次构建过程耗时 0.1 秒,那么在此期间发生硬件错误的概率约为 量级。
| 事件类型 | 近似概率 | 风险定性 |
|---|---|---|
| 构建期间硬件位翻转 | 真实存在的极低风险 | |
| Binary Fuse 构建失败 | 物理上的“绝对不可能” |
因此,库的设计原则是:将构建失败视为不可恢复的致命错误(panic),而非运行时错误(Result/TryFrom)。
如果您在构建过程中遇到了 panic,更有可能是因为:
- 输入数据存在重复键(这是最常见的原因,即使您认为已经去重)。
- 硬件故障(内存位翻转等)。
- 极度罕见的概率事件(此时只需简单的重试即可,但在人类文明的时间尺度内几乎不可能遇到)。
出于易用性考虑,我们优先使用 From trait,因为它符合绝大多数场景下的心理预期:构建过程总是成功的。
历史背景
概率过滤器的技术演进从 Bloom 过滤器 (1970) 开始,历经 Cuckoo 过滤器 (2014) 的改进。2020 年 Xor 过滤器的出现带来了范式转移,通过完美的 XOR 求和实现更优性能。2022 年,Mueller 与 Lemire 进一步提出 Binary Fuse 过滤器,通过图分区技术使其在空间和时间效率上逼近了理论极限。
参考引用
- Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters
- Binary Fuse Filters: Fast and Smaller Than Xor Filters
- Fuse Graph
- Go 实现
- C 实现
- fuse graph
关于
本项目是 js0.site ⋅ 重构互联网计划 的开源组件。
我们正在以组件化的方式重新定义互联网的开发范式。欢迎关注我们:
评测
性能基准
| 库 | 过滤器 | 构建(万ops/s) | 查询(万ops/s) | 内存占用 | 对比 |
|---|---|---|---|---|---|
| jdb | Bf8 | 4959.87 | 86727.74 | 116.04 KB | 1.65x |
| xorf | BinaryFuse8 | 3971.65 | 52493.29 | 116.04 KB | - |
| jdb | Bf16 | 4809.60 | 76974.42 | 232.04 KB | 1.56x |
| xorf | BinaryFuse16 | 4015.96 | 49222.88 | 232.04 KB | - |
| jdb | Bf32 | 4676.60 | 67297.71 | 464.04 KB | 1.39x |
| xorf | BinaryFuse32 | 4016.81 | 48393.79 | 464.04 KB | - |
准确率
| 库 | 过滤器 | 假阳率 | 假阴率 |
|---|---|---|---|
| jdb | Bf8 | 0.39196% | 0 |
| xorf | BinaryFuse8 | 0.38768% | 0 |
| jdb | Bf16 | 0.00132% | 0 |
| xorf | BinaryFuse16 | 0.00165% | 0 |
| jdb | Bf32 | 0.00000% | 0 |
| xorf | BinaryFuse32 | 0.00000% | 0 |
关于
本项目为 js0.site ⋅ 重构互联网计划 的开源组件。
我们正在以组件化的方式重新定义互联网的开发范式,欢迎关注:
Dependencies
~1MB
~18K SLoC