#fst #english #head #extract #speech #part #info

bin+lib wiktionary-part-of-speech-extract

English Wiktionary parsed for part-of-speech info and placed into a precompiled FST

3 releases

0.1.2 Aug 8, 2021
0.1.1 May 18, 2021
0.1.0 Apr 2, 2021

#799 in Audio


Used in 2 crates (via layered-part-of-speech)

MIT/Apache

2MB
5.5K SLoC

F* 5K SLoC // 0.2% comments Rust 483 SLoC // 0.0% comments

wiktionary-part-of-speech-extract

./sample.xml is just the head of the entire wikimedia enwiktionary-20210320-pages-articles.xml download (source).

The purpose of this generator is to parse the file in its entirety.

cargo run ./sample.xml

lib.rs:

cargo run regenerate --release enwiktionary-pages-*.xml # regenerate "words.fst" binary
cargo publish # publish lib including "words.fst" binary

Usage

use wiktionary_part_of_speech_extract::{ENGLISH_TAG_LOOKUP, TagSet, Tag};

assert_eq!(Some(TagSet::of(&[Tag::Noun, Tag::Verb])), ENGLISH_TAG_LOOKUP.get("harbor"));

Dependencies

~1.6–2.9MB
~33K SLoC