2 releases

new 0.6.1	May 4, 2025
0.6.0	May 3, 2025

#345 in Text processing

158 downloads per month

GPL-3.0 license

195KB
4.5K SLoC

Sophia NLU Engine (cicero-sophia)

High-performance NLU (natural language understanding) engine built in Rust for speed, accuracy, and privacy.

(LICENSE)

Features

Core Capabilities

Industry-leading vocabulary with 914,000 (full) or 145,000 (lite) words
Sophisticated categorization system spanning 8,700+ hierarchical categories, allowing for easy word to action mapping
Advanced language processing including POS tagging, anaphora resolution, and named entity recognition
Intelligent phrase parsing with automated spelling correction

Performance

Process ~25,000 words per second on a single thread
Lightweight deployment: Single 79MB (lite) or 177MB (full) data store
Zero external dependencies or API calls required
Privacy-focused with all processing done locally

Github: https://github.com/cicero-ai/cicero/

License

Typical dual license model, free and open source for individual use via the GPLv3 license, but premium license required for commercial use. For full details including online demo, please visit: https://cicero.sh/sophia/.

Installation

Add cicero-sophia to your project by including it in your Cargo.toml:

toml

[dependencies] cicero-sophia = "0.3.0"

Vocabulary Data Store

To use Sophia, you must obtain the vocabulary data store, which is available free of charge. Simply visit https://cicero.sh/ register for a free account, and the vocabulary data store is available for download within the member's area.

Usage

Example 1: Tokenizing Text

use sophia::{Sophia, Error};

fn main() -> Result<(), Error> {
    // Initialize Sophia
    let datadir = "./vocab_data";
    let sophia = Sophia::new(datadir, "en")?;

    // Tokenize the input text
    let output = sophia.tokenize("The quick brown fox jumps over the lazy dog")?;

    // Print individual tokens
    println!("Individual Tokens:");
    for token in output.iter() {
        println!("  Word: {} POS: {}", token.word, token.pos);
    }

    // Print MWEs
    println!("\nMulti-Word Entities (MWEs):");
    for token in output.mwe() {
        println!("  Word: {} POS: {}", token.word, token.pos);
    }

    Ok(())
}

Example 2: Interpreting Text


use sophia::{Sophia, Error};

fn main() -> Result<(), Error> {
    // Initialize Sophia
    let datadir = "./vocab_data";
    let sophia = Sophia::new(datadir, "en")?;

    // Interpret the input text
    let output = sophia.interpret("The quick brown fox jumps over the lazy dog")?;

    // Print phrases
    println!("Phrases:");
    for phrase in output.phrases.iter() {
        println!("  {:?}", phrase);
    }

    // Print individual tokens
    println!("\nIndividual Tokens:");
    for token in output.tokens.iter() {
        println!("  Word: {} POS: {}", token.word, token.pos);
    }

    Ok(())
}

Example 3: Retrieve individual word / toekn


use sophia::{Sophia, Error};

fn main() -> Result<(), Error> {
    // Initialize Sophia
    let datadir = "./vocab_data";
    let sophia = Sophia::new(datadir, "en")?;

    // Get word
    let token = sophia.get_word("future").unwrap();
    println!("Got word {}, id {}, pos {}", token.word, token.index, token.pos);

    // Get specific token
    let token = sophia.get_token(82251).unwrap();
    println!("Got word {}, id {}, pos {}", token.word, token.index, token.pos);

    Ok(())
}

Example 4: Retrieve Category


use sophia::{Sophia, Error};

fn main() -> Result<(), Error> {
    // Initialize Sophia
    let datadir = "./vocab_data";
    let sophia = Sophia::new(datadir, "en")?;

    // Get category
    let cat = sophia.get_category("verbs/action/travel/depart").unwrap();
    println!("name {}", cat.name);
    println!("fqn: {}", cat.fqn);
    println!("word ids: {:?}", cat.words);

    Ok(())
}

Contact

For all inquiries, please complete the contact form at: https://cicero.sh/contact

Dependencies

~3.5–5MB
~93K SLoC