#fm-index #pattern-matching #bioinformatics #fmindex #burrows-wheeler

sview-fmindex

FM-index library with slice view architecture for efficient text indexing and pattern matching

2 releases

Uses new Rust 2024

0.1.1 Aug 15, 2025
0.1.0 Jul 14, 2025

#141 in Biology

MIT license

135KB
3K SLoC

sview-fmindex

crates.io CI license

Data is in single blob, and FM-index is slice view.

sview-fmindex is a Rust library for building FM-index into pre-allocated blob and using it with minimal copying through slice view.

  • -fmindex: FM-index is a compressed text index that provides two main operations:
    1. Count the number of occurrences in text
    2. Locate the positions in text
  • sview-: In this library, data is stored in one contiguous blob, and the FM-index structure is created as a slice view into this blob.

Architecture

           builder
          ┌────────┐
          │ header │
          └────────┘
               │
               |-(build with text)
 blob          ▼
┌────────┬──────────────────────┐
│ header │     body (LARGE)     │
└────────┴──────────────────────┘
               │
               │-(load index)
      fm-index ▼
      ┌────────┬────────┐
      │ header │  view  │
      └────────┴────────┘

Usage

Basic Example

use sview_fmindex::{FmIndexBuilder, FmIndex};
use sview_fmindex::blocks::Block2; // Block2 can index 4 symbols
use sview_fmindex::text_encoders::EncodingTable;

// (1) Define symbols to use
let symbols: &[&[u8]] = &[b"Aa", b"Cc", b"Gg", b"Tt"];
let encoding_table = EncodingTable::from_symbols(symbols);
let symbol_count = encoding_table.symbol_count(); // 4

// (2) Build index
let text = b"CTCCGTACACCTGTTTCGTATCGGAXXYYZZ".to_vec();
let builder = FmIndexBuilder::<u32, Block2<u64>, EncodingTable>::new(
    text.len(),
    symbol_count,
    encoding_table,
).unwrap();
// You have to prepare a blob to build the index.
let blob_size = builder.blob_size();
let mut blob = vec![0; blob_size];
// Build the fm-index to the blob.
builder.build(text, &mut blob).unwrap();
// Load the fm-index from the blob.
let fm_index = FmIndex::<u32, Block2<u64>, EncodingTable>::load(&blob[..]).unwrap();

// (3) Match with pattern
let pattern = b"TA";
//   - count
let count = fm_index.count(pattern);
assert_eq!(count, 2);
//   - locate
let mut locations = fm_index.locate(pattern);
locations.sort();  // The locations may not be in order.
assert_eq!(locations, vec![5,18]);

// When using EncodingTable, the last symbol is treated as wild card.
// In the text, X, Y, and Z can match any other unindexed character
let mut locations = fm_index.locate(b"UNDEF");
locations.sort();
// Using the b"XXXXX", b"YYYYY", or b"!@#$%" gives the same result.
assert_eq!(locations, vec![25,26]);

Benchmarks

1 Gbp nucleotide text · 20 bp pattern

Load strategy Avg RSS Peak RSS Blob load (ms) Locate per pattern (ns)
Full in‑memory 2.82 GiB 4.50 GiB 2,388 1,369 ns
mmap (no Advice) 0.48 GiB 0.52 GiB 0.04 1,365 ns
Test setup
  • Data
    • Text: 1 Gbp random nucleotide
    • Patterns: 1 000 000 × 20 bp
    • Index: Position: u32, Block: Block2<u64>, Uncompressed
  • Hardware
    • CPU Intel Xeon E5‑2680 v4 @ 2.40 GHz
    • Memory 256 GiB
    • OS (Kernel) Ubuntu 20.04.2 LTS (5.4.0‑171‑generic)
    • Page size 4 KiB

References

Base repo: This repository was forked from baku4/lt-fm-index (v0.7.0, commit 1327896).

FM-index implementation:

  • Ferragina, P., & Manzini, G. (2004). An Alphabet-Friendly FM-Index. String Processing and Information Retrieval, 150-160.
  • Wang, Y., et al. (2018). Accelerating FM-index Search for Genomic Data Processing. Proceedings of the 47th International Conference on Parallel Processing.

K-mer Lookup table implementation:

  • Anderson, T., & Wheeler, T. J. (2021). An optimized FM-index library for nucleotide and amino acid search. Algorithms for Molecular Biology, 16(1), 25.

Burrows-Wheeler Transform:

  • libdivsufsort by Yuta Mori.
  • Rust-Bio: A fast and safe bioinformatics library, introduced in Köster, J. (2016), "Rust-Bio: a fast and safe bioinformatics library," Bioinformatics, 32(3), 444-446.

Dependencies

~1.6–2.4MB
~43K SLoC