#nlp #extractor #keyword

yake-rust

Yake (Yet Another Keyword Extractor) in Rust

14 releases (4 stable)

new 1.0.3 Feb 19, 2025
1.0.2 Feb 18, 2025
0.3.0 Oct 25, 2024
0.1.8 Feb 12, 2023
0.1.1 Sep 25, 2022

#165 in Text processing

Download history 9/week @ 2024-10-29 3/week @ 2024-12-10 229/week @ 2025-02-04 39/week @ 2025-02-11

268 downloads per month

MIT license

140KB
1.5K SLoC

YAKE (Yet Another Keyword Extractor)

Yake is a statistical keyword extractor. It weighs several factors such as acronyms, position in paragraph, capitalization, how many sentences the keyword appears in, stopwords, punctuation and more.

How it works

For Yake ✨keyphrase✨ is an n-gram (1-, 2-, 3-) not starting nor ending in a stopword, not having numbers and punctuation inside, without long and short terms, etc.

The input text is split into sentences and terms via the segtok crate. Yake assigns an importance score to each term in the text.

Eventually, the most important terms:

  • occur more frequently
  • occur mostly at the beginning of the text
  • occur in many different sentences
  • prefer being Capitalized or UPPERCASED
  • prefer having the same neighbour terms

✨Keyphrases✨ are ranked in order of importance (most important first).

Duplicates are then detected by Levenshtein distance and removed.

Example

use yake_rust::{get_n_best, Config, StopWords};

fn main() {
    let text = include_str!("input.txt");

    let config = Config { ngrams: 3, ..Config::default() };
    let ignored = StopWords::predefined("en").unwrap();
    let keywords = get_n_best(10, &text, &ignored, &config);

    println!("{:?}", keywords);
}

Features

By default, stopwords for all languages are included. However, you can choose to include only specific ones:

[dependencies]
yake-rust = { version = "*", default-features = false, features = ["en", "de"] }

Dependencies

~4–6MB
~98K SLoC