#nlp #split #words #string #english #wikipedia #frequencies

bin+lib untanglr

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies

2 stable releases

1.1.0 Sep 2, 2022
1.0.0 Jan 31, 2022
0.6.0 Jul 29, 2021
0.5.0 Jul 28, 2021
0.3.0 Jul 22, 2021

#850 in Text processing

MIT license

4MB
139 lines

Untanglr

Untanglr

Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistically into words. The crate includes both a bin.rs and a lib.rs to facilitate both usage as a command line utility, and as a library that you can use in your code.

Usage

Pass the tangled words as a cli argument:

$ untanglr thequickbrownfoxjumpedoverthelazydog
the quick brown fox jumped over the lazy dog

Or use it in your projects:

extern crate untanglr;

fn main() {
	let lm = untanglr::LanguageModel::new();
	println!("{:?}", lm.untangle("helloworld"));
}

Installation

If you find that untanglr might be useful on your machine you can install it. Just make sure cargo is installed and run:

$ cargo install untanglr

Note: Don't be discouraged if this project hasn't been updated in a while. I will address potential issues but the crate does not need regular updates.

Credits

I have developed this project around Derek Anderson's wordninja python implementation for some exercising in rust while producing something useful.

Dependencies