#trie #language #index #node #reading #reader #string

nutrimatic

Tools for reading Nutrimatic (https://nutrimatic.org) index files

2 releases

0.1.1 Apr 2, 2021
0.1.0 Apr 1, 2021

#1735 in Text processing

MIT/Apache

19KB
325 lines

Nutrimatic index file reader for Rust

This is a Rust library for reading Nutrimatic index files, which store language frequency data in a trie.


lib.rs:

An API for reading Nutrimatic index files.

See the Nutrimatic source code for a full description of the file format or instructions for creating an index file.

An index file is taken in as a &[u8] containing the contents of the file; typically, this will be created by memory-mapping a file on disk (of course, it also works fine to read an index file fully into memory if it fits).

An index file describes a trie of strings; edges are labeled with characters (ASCII space, digits, and letters) and each node stores the total frequency in some corpus of all phrases starting with the sequence of characters leading up to the node.

This library does no consistency checking of index files. If you attempt to use an invalid file, you will see random panics or garbage results (but no unsafety). Don't do that!

Examples

use nutrimatic::Node;

// Collect all phrases in the trie in alphabetical order along with their
// frequencies.
fn collect(node: &Node, word: &mut String, out: &mut Vec<(String, u64)>) {
    for child in &node.children() {
        // The space indicates that this transition corresponds to a word
        // boundary.
        if child.ch() == ' ' as u8 {
            out.push((word.clone(), child.freq()));
        }
        word.push(child.ch() as char);
        collect(&child, word, out);
        word.pop();
    }
}

fn main() {
    // This buffer describes a trie containing the words "ru" and "st"; a
    // trie would normally be generated ahead of time by external tools. The
    // byte values are written a bit oddly to hint at each one's purpose in
    // the serialization.
    let buf: &[u8] = &[
        ' ' as u8, 17, 0x00 | 1,
        'u' as u8, 17, 0, 0x80 | 1,
        ' ' as u8, 18, 0x00 | 1,
        't' as u8, 18, 0, 0x80 | 1,
        'r' as u8, 17, 7, 's' as u8, 18, 0, 0x80 | 2,
    ];

    let root = Node::new(buf);

    let mut words = vec![];
    collect(&root, &mut String::new(), &mut words);
    assert_eq!(words, vec![("ru".to_owned(), 17), ("st".to_owned(), 18)]);
}

Dependencies

~120KB