### 11 releases (1 stable)

1.0.0 | Aug 1, 2022 |
---|---|

1.0.0-rc.1 | Jul 7, 2022 |

0.4.3 | Jun 7, 2022 |

0.4.1 | Mar 6, 2022 |

0.2.1 | Nov 1, 2021 |

#**39** in Text processing

**4,123** downloads per month

Used in **6** crates
(4 directly)

**MIT/Apache**

165KB

3K
SLoC

# π daachorse: Double-Array Aho-Corasick

A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.

The main technical ideas behind this library appear in this paper.

## Overview

Daachorse is a crate for fast multiple pattern matching using the Aho-Corasick algorithm, running in linear time over the length of the input text. This crate uses the compact double-array data structure for implementing the pattern match automaton for time and memory efficiency. The data structure not only supports constant-time state-to-state traversal but also represents each state in the space of only 12 bytes.

For example, compared to the NFA of the aho-corasick
crate, which is the most popular Aho-Corasick implementation in Rust, Daachorse can perform pattern
matching **3.0β5.2 times faster** while consuming **56β60% smaller** memory when using a word
dictionary of 675K patterns. Other experimental results are available on
Wiki.

## Requirements

Rust 1.58 or higher is required to build this crate.

## Example usage

Daachorse contains some search options, ranging from standard matching with the Aho-Corasick algorithm to trickier matching. They will run very fast based on the double-array data structure and can be easily plugged into your application, as shown below.

### Finding overlapped occurrences

To search for all occurrences of registered patterns that allow for positional overlap in the input
text, use

. When you use `find_overlapping_iter``(``)`

for construction, the library assigns a
unique identifier to each pattern in the input order. The match result has the byte positions of
the occurrence and its identifier.`new``(``)`

`use` `daachorse``::`DoubleArrayAhoCorasick`;`
`let` patterns `=` `vec!``[``"`bcd`"``,` `"`ab`"``,` `"`a`"``]``;`
`let` pma `=` `DoubleArrayAhoCorasick``::`new`(`patterns`)``.``unwrap``(``)``;`
`let` `mut` it `=` pma`.``find_overlapping_iter``(``"`abcd`"``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `1``,` `2``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `2``,` `1``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``1``,` `4``,` `0``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`assert_eq!``(``None``,` it`.``next``(``)``)``;`

### Finding non-overlapped occurrences with the standard matching

If you do not want to allow positional overlap, use

instead.
It performs the search on the Aho-Corasick automaton
and reports patterns first found in each iteration.`find_iter``(``)`

`use` `daachorse``::`DoubleArrayAhoCorasick`;`
`let` patterns `=` `vec!``[``"`bcd`"``,` `"`ab`"``,` `"`a`"``]``;`
`let` pma `=` `DoubleArrayAhoCorasick``::`new`(`patterns`)``.``unwrap``(``)``;`
`let` `mut` it `=` pma`.``find_iter``(``"`abcd`"``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `1``,` `2``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``1``,` `4``,` `0``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`assert_eq!``(``None``,` it`.``next``(``)``)``;`

### Finding non-overlapped occurrences with the longest matching

If you want to search for the longest pattern without positional overlap in each iteration, use

with specifying `leftmost_find_iter``(``)`

in the construction.`MatchKind ::`LeftmostLongest

`use` `daachorse``::``{`DoubleArrayAhoCorasickBuilder`,` MatchKind`}``;`
`let` patterns `=` `vec!``[``"`ab`"``,` `"`a`"``,` `"`abcd`"``]``;`
`let` pma `=` `DoubleArrayAhoCorasickBuilder``::`new`(``)`
`.``match_kind``(``MatchKind``::`LeftmostLongest`)`
`.``build``(``&`patterns`)`
`.``unwrap``(``)``;`
`let` `mut` it `=` pma`.``leftmost_find_iter``(``"`abcd`"``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `4``,` `2``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`assert_eq!``(``None``,` it`.``next``(``)``)``;`

### Finding non-overlapped occurrences with the leftmost-first matching

If you want to find the earliest registered pattern among ones starting from the search position,
use

with specifying `leftmost_find_iter``(``)`

.`MatchKind ::`LeftmostFirst

This is the so-called *leftmost first match*, a tricky search option supported in the
aho-corasick crate. For example, in the following
code,

is reported because it is the earliest registered one.`ab`

`use` `daachorse``::``{`DoubleArrayAhoCorasickBuilder`,` MatchKind`}``;`
`let` patterns `=` `vec!``[``"`ab`"``,` `"`a`"``,` `"`abcd`"``]``;`
`let` pma `=` `DoubleArrayAhoCorasickBuilder``::`new`(``)`
`.``match_kind``(``MatchKind``::`LeftmostFirst`)`
`.``build``(``&`patterns`)`
`.``unwrap``(``)``;`
`let` `mut` it `=` pma`.``leftmost_find_iter``(``"`abcd`"``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `2``,` `0``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`assert_eq!``(``None``,` it`.``next``(``)``)``;`

### Associating arbitrary values with patterns

To build the automaton from pairs of a pattern and user-defined value, instead of assigning identifiers
automatically, use

.`with_values``(``)`

`use` `daachorse``::`DoubleArrayAhoCorasick`;`
`let` patvals `=` `vec!``[``(``"`bcd`"``,` `0``)``,` `(``"`ab`"``,` `10``)``,` `(``"`a`"``,` `20``)``]``;`
`let` pma `=` `DoubleArrayAhoCorasick``::`with_values`(`patvals`)``.``unwrap``(``)``;`
`let` `mut` it `=` pma`.``find_overlapping_iter``(``"`abcd`"``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `1``,` `20``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `2``,` `10``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``1``,` `4``,` `0``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`assert_eq!``(``None``,` it`.``next``(``)``)``;`

### Building faster automata on multibyte characters

To build a faster automaton on multibyte characters, use

instead.`CharwiseDoubleArrayAhoCorasick`

The standard version

handles strings as UTF-8 sequences and defines
transition labels using byte values. On the other hand, `DoubleArrayAhoCorasick`

uses
Unicode code point values, reducing the number of transitions and faster matching.`CharwiseDoubleArrayAhoCorasick`

`use` `daachorse``::`CharwiseDoubleArrayAhoCorasick`;`
`let` patterns `=` `vec!``[``"`ε
¨δΈη`"``,` `"`δΈη`"``,` `"`γ«`"``]``;`
`let` pma `=` `CharwiseDoubleArrayAhoCorasick``::`new`(`patterns`)``.``unwrap``(``)``;`
`let` `mut` it `=` pma`.``find_iter``(``"`ε
¨δΈηδΈγ«`"``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``0``,` `9``,` `0``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`let` m `=` it`.``next``(``)``.``unwrap``(``)``;`
`assert_eq!``(``(``12``,` `15``,` `2``)``,` `(`m`.``start``(``)``,` m`.``end``(``)``,` m`.``value``(``)``)``)``;`
`assert_eq!``(``None``,` it`.``next``(``)``)``;`

`no_std`

`no_std`

Daachorse has no dependency on

(but requires a global allocator with the `std`

crate).`alloc`

## CLI

This repository contains a command-line interface named

for searching patterns in text
files.`daacfind`

`%` cat `.``/`pat`.`txt
`fn`
`const` `fn`
`pub` `fn`
`unsafe` `fn`
`%` find `.` `-`name `"`*.rs`"` `|` xargs cargo run `-``-`release `-`p daacfind `-``-` `-``-`color`=`auto `-`nf `.``/`pat`.`txt
`...`
`...`
`.``/`src`/`errors`.`rs`:``67``:` `fn` `fmt``(``&``self`, `f``:` `&``mut` `fmt``::`Formatter`)`` ``->` `fmt``::`Result `{`
`.``/`src`/`errors`.`rs`:``81``:` `fn` `fmt``(``&``self`, `f``:` `&``mut` `fmt``::`Formatter`)` `->` `fmt``::`Result `{`
`.``/`src`/`lib`.`rs`:``115``:` `fn` `default``(``)` `->` `Self` `{`
`.``/`src`/`lib`.`rs`:``126``:` `pub` `fn` `base``(``&``self``)` `->` `Option``<``u32``>` `{`
`.``/`src`/`lib`.`rs`:``131``:` `pub` `const` `fn` `check``(``&``self``)` `->` `u8` `{`
`.``/`src`/`lib`.`rs`:``136``:` `pub` `const` `fn` `fail``(``&``self``)` `->` `u32` `{`
`...`
`...`

## License

Licensed under either of

- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

For software under

, follow the license terms of each software.`bench /data`

## Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

#### No runtime deps

#### Features

- alloc