8 releases
0.2.7 | Nov 18, 2021 |
---|---|
0.2.6 | Jun 20, 2021 |
0.1.2 |
|
#1582 in Algorithms
42KB
675 lines
Xfind
This library provides fast forward and backward substring searchers for stream searches.
Usage
Add this to your Cargo.toml
xfind = "0.2"
Documentation
License
Licensed under either of MIT-LICENSE or UNLICENSE, at your option.
lib.rs
:
Provides forward and backward substring searchers for stream searches.
This crate is built on top of memchr
, a heavily optimized routines for string searches.
But unlike memchr
, utilties provided by this crate operate directly on stream (i.e.,
Read
instances) rather than in-memory buffers.
Note that this crate provides no advantage when searching substring in a source that is already
in memory, in this case consider using the memchr
library instead. Besides, if you want to
search multiple substrings at once, take a look at aho-corasick
.
Complexity
Forward and backward stream search routines provided by this crate is guaranteed to have worst
case linear time complexity with respect to both needle.len()
and stream.len()
, and worst
case constant space complexity with respect to needle.len()
.
Performance
Below is a collected benchmark result for searching all occurrences of dear
in a 767KB book
Pride and Prejudice
.
test group_1::stream_find_iter::aho_corasick ... bench: 558,530 ns/iter (+/- 8,705)
test group_1::stream_find_iter::memchr ... bench: 89,728 ns/iter (+/- 4,979)
test group_1::stream_find_iter::xfind ... bench: 112,766 ns/iter (+/- 2,453)
test group_2::stream_rfind_iter::memchr ... bench: 613,183 ns/iter (+/- 10,610)
test group_2::stream_rfind_iter::xfind ... bench: 681,210 ns/iter (+/- 6,990)
test group_3::memory_find_iter::aho_corasick ... bench: 454,277 ns/iter (+/- 2,030)
test group_3::memory_find_iter::memchr ... bench: 21,564 ns/iter (+/- 657)
test group_3::memory_find_iter::xfind ... bench: 41,548 ns/iter (+/- 2,028)
test group_4::memory_rfind_iter::memchr ... bench: 543,737 ns/iter (+/- 4,420)
test group_4::memory_rfind_iter::xfind ... bench: 590,744 ns/iter (+/- 14,684)
-
When performing forward stream searches,
xfind
is about 1.3x slower thanmemchr::memmem
(group 1), which is actually quite fast becausememmem
itself operates on in-memory buffer butxfind
operates directly on stream. The main difference is memory usage,xfind
done its jobs by using a 8KB-only buffer, butmemmem
needed to read all the contents of the file into a file-sized buffer (767KB in this case). -
xfind
provides no advantage when searching through in-memory buffers (nearly 2x slower) (group 3), so please don't use it for in-memory searches. -
When searching only one substrings,
xfind
beatsaho-corasick
in all cases above (group 1, 3), which is still fair becauseaho-corasick
is mainly used for searching multiple substrings at once. -
Reverse stream searches are by its nature much slower than forward stream searches (group 2, 4). The performances of
xfind
andmemmem
are pretty close, only memory usages differ.
Examples
- Checks if a substring exists in a file.
use std::fs::File;
fn main() -> std::io::Result<()> {
let mut rdr = File::open("foo.txt")?;
let found = xfind::find(b"bar", &mut rdr).is_some();
Ok(())
}
- Gets the indexes of the first 10 occurrences of a substring in a file.
use std::fs::File;
use std::io;
fn main() -> io::Result<()> {
let mut rdr = File::open("foo.txt")?;
let indexes = xfind::find_iter(b"bar", &mut rdr)
.take(10)
.collect::<io::Result<Vec<usize>>>()?;
println!("{:?}", indexes);
Ok(())
}
- Constructs a searcher once and searches for the same needle in multiple streams.
use std::fs::File;
use std::io;
use xfind::StreamFinder;
fn main() -> io::Result<()> {
let mut f1 = File::open("foo.txt")?;
let mut f2 = File::open("bar.txt")?;
let mut finder = StreamFinder::new(b"baz");
let found_in_f1 = finder.find(&mut f1).is_some();
let found_in_f2 = finder.find(&mut f2).is_some();
Ok(())
}
- Reads the last line of a file, without loading the entire contents of the file into memory.
use std::fs::File;
use std::io::{self, Read};
use std::path::Path;
fn main() -> io::Result<()> {
let path = "foo.txt";
let mut buf = Vec::new();
read_last_line(path, &mut buf)?;
// For simplicity, we just assume the contents is valid UTF-8 and unwrap here.
println!("{}", std::str::from_utf8(&buf).unwrap());
Ok(())
}
fn read_last_line<P: AsRef<Path>>(
path: P,
buf: &mut Vec<u8>,
) -> io::Result<usize> {
let mut f = File::open(path)?;
let mut iter = xfind::rfind_iter(b"\n", &mut f)?;
let read_pos = match iter.next().transpose()? {
// if the file contains no newline, we read from the start.
None => 0,
// if the file ends with a newline, we need to perform another search.
Some(pos) if pos + 1 == iter.stream_len() => {
(iter.next().transpose()?.map(|x| x + 1)).unwrap_or(0)
}
// if the file doesn't end with a newline, then `pos + 1` is the `read_pos`.
Some(pos) => pos + 1,
};
iter.seek_to(read_pos)?;
f.read_to_end(buf)
}
Dependencies
~110–250KB