1 unstable release
0.1.0 | Sep 21, 2021 |
---|
#1923 in Text processing
9KB
133 lines
Lexmatch
This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text. It can be used compute a frequency list for a lexicon, on a target corpus.
The implementation uses suffix arrays. The text must be plain-text UTF-8 and is limited to 2^32 bytes (about 4GB).
Installation
You can build and install the latest stable release using Rust's package manager:
cargo install lexmatch
or if you want the development version after cloning this repository:
cargo install --path .
No cargo/rust on your system yet? Do sudo apt install cargo
on Debian/ubuntu based systems, brew install rust
on mac, or use rustup.
Usage
See lexmatch --help
.
Simple example:
$ lexmatch --lexicon lexicon.lst --text corpus.txt
The lexicon must be plain-text UTF-8 containing one entry per line, an entry need not be a single word and is not constrained in length.
Dependencies
~435KB