#pattern-matching #lookup-tables #bwt #bio #fm-index

lt-fm-index

Fm-index using k-mer lookup table for exact pattern matching

23 releases

0.7.0-alpha.2 Jun 26, 2023
0.7.0-alpha.1 Feb 28, 2023
0.6.0 Feb 4, 2023
0.5.6 Sep 14, 2022
0.1.0 Jul 21, 2021

#456 in Algorithms

Download history 8/week @ 2024-02-23 7/week @ 2024-03-01 3/week @ 2024-03-08 2/week @ 2024-03-15 61/week @ 2024-03-29 9/week @ 2024-04-05

70 downloads per month
Used in 2 crates

MIT license

81KB
2K SLoC

LtFmIndex

CI crates.io

LtFmIndex is a Rust library for building and using a FM-index that contains a lookup table of the first k-mer of a pattern. This index can be used to (1) count the number of occurrences and (2) locate the positions of a pattern in an indexed text.

Usage

Add to dependency

To use this library, add lt_fm_index to your Cargo.toml:

[dependencies]
lt_fm_index = "0.7.0-alpha"
  • About fastbwt features
    • This feature can accelerate the indexing, but needs cmake to build libdivsufsort and cannot be built as WASM.

Example code

use lt_fm_index::LtFmIndex;
use lt_fm_index::blocks::Block2; // `Block2` can index 3 types of characters.

// (1) Define characters to use
let characters_by_index: &[&[u8]] = &[
    &[b'A', b'a'], // 'A' and 'a' are treated as the same
    &[b'C', b'c'], // 'C' and 'c' are treated as the same
    &[b'G', b'g'], // 'G' and 'g' are treated as the same
];
// Alternatively, you can use this simpler syntax:
let characters_by_index: &[&[u8]] = &[
    b"Aa", b"Cc", b"Gg"
];

// (2) Build index
let text = b"CTCCGTACACCTGTTTCGTATCGGAXXYYZZ".to_vec();
let lt_fm_index= LtFmIndex::<u32, Block2<u128>>::build(
    text,
    characters_by_index,
    2,
    4,
).unwrap();

// (3) Match with pattern
let pattern = b"TA";
//   - count
let count = lt_fm_index.count(pattern);
assert_eq!(count, 2);
//   - locate
let mut locations = lt_fm_index.locate(pattern);
locations.sort();  // The locations may not be in order.
assert_eq!(locations, vec![5,18]);
// All unindexed characters are treated as the same character.
// In the text, X, Y, and Z can match any other unindexed character
let mut locations = lt_fm_index.locate(b"UNDEF");
locations.sort();
// Using the b"XXXXX", b"YYYYY", or b"!@#$%" gives the same result.
assert_eq!(locations, vec![25,26]);

// (4) Save and load
let mut buffer = Vec::new();
lt_fm_index.save_to(&mut buffer).unwrap();
let loaded = LtFmIndex::load_from(&buffer[..]).unwrap();
assert_eq!(lt_fm_index, loaded);

Repository

https://github.com/baku4/lt-fm-index

API Doc

https://docs.rs/lt-fm-index/

Reference

  • Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
  • Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
  • Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
  • Yuta Mori. libdivsufsort

Dependencies

~17MB
~299K SLoC