14 releases (4 stable)
new 1.0.3 | Feb 19, 2025 |
---|---|
1.0.2 | Feb 18, 2025 |
0.3.0 | Oct 25, 2024 |
0.1.8 | Feb 12, 2023 |
0.1.1 | Sep 25, 2022 |
#165 in Text processing
268 downloads per month
140KB
1.5K
SLoC
YAKE (Yet Another Keyword Extractor)

Yake is a statistical keyword extractor. It weighs several factors such as acronyms, position in paragraph, capitalization, how many sentences the keyword appears in, stopwords, punctuation and more.
How it works
For Yake ✨keyphrase✨ is an n-gram (1-, 2-, 3-) not starting nor ending in a stopword, not having numbers and punctuation inside, without long and short terms, etc.
The input text is split into sentences and terms via the segtok crate. Yake assigns an importance score to each term in the text.
Eventually, the most important terms:
- occur more frequently
- occur mostly at the beginning of the text
- occur in many different sentences
- prefer being Capitalized or UPPERCASED
- prefer having the same neighbour terms
✨Keyphrases✨ are ranked in order of importance (most important first).
Duplicates are then detected by Levenshtein distance and removed.
Example
use yake_rust::{get_n_best, Config, StopWords};
fn main() {
let text = include_str!("input.txt");
let config = Config { ngrams: 3, ..Config::default() };
let ignored = StopWords::predefined("en").unwrap();
let keywords = get_n_best(10, &text, &ignored, &config);
println!("{:?}", keywords);
}
Features
By default, stopwords for all languages are included. However, you can choose to include only specific ones:
[dependencies]
yake-rust = { version = "*", default-features = false, features = ["en", "de"] }
Dependencies
~4–6MB
~98K SLoC