33 releases (15 breaking)

0.21.0 Jan 23, 2023
0.19.1 Jan 8, 2023
0.19.0 Dec 20, 2022
0.18.0 Oct 27, 2022
0.1.0 Feb 25, 2020

#425 in Text processing

Download history 112/week @ 2022-10-10 91/week @ 2022-10-17 98/week @ 2022-10-24 70/week @ 2022-10-31 75/week @ 2022-11-07 87/week @ 2022-11-14 88/week @ 2022-11-21 50/week @ 2022-11-28 76/week @ 2022-12-05 121/week @ 2022-12-12 113/week @ 2022-12-19 61/week @ 2022-12-26 87/week @ 2023-01-02 109/week @ 2023-01-09 48/week @ 2023-01-16 125/week @ 2023-01-23

381 downloads per month
Used in 6 crates (3 directly)

MIT license

26KB
264 lines

Lindera tokenizer for Tantivy

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

Lindera Tokenizer for Tantivy.

Usage

Make sure you have activated the required dictionaries for the  Lindera in Cargo.toml. The following example enables IPADIC.

[dependencies]
lindera-tantivy = { version = "0.12.0", features = ["ipadic"] }
  • ipadic: Japanese dictionary
  • unidic: Japanese dictionary
  • ko-dic: Korean dictionary
  • cc-cedict: Chinese dictionary

Basic example

use tantivy::{
    collector::TopDocs,
    doc,
    query::QueryParser,
    schema::{IndexRecordOption, Schema, TextFieldIndexing, TextOptions},
    Index,
};

use lindera_tantivy::tokenizer::LinderaTokenizer;

fn main() -> tantivy::Result<()> {
    // create schema builder
    let mut schema_builder = Schema::builder();

    // add id field
    let id = schema_builder.add_text_field(
        "id",
        TextOptions::default()
            .set_indexing_options(
                TextFieldIndexing::default()
                    .set_tokenizer("raw")
                    .set_index_option(IndexRecordOption::Basic),
            )
            .set_stored(),
    );

    // add title field
    let title = schema_builder.add_text_field(
        "title",
        TextOptions::default()
            .set_indexing_options(
                TextFieldIndexing::default()
                    .set_tokenizer("lang_ja")
                    .set_index_option(IndexRecordOption::WithFreqsAndPositions),
            )
            .set_stored(),
    );

    // add body field
    let body = schema_builder.add_text_field(
        "body",
        TextOptions::default()
            .set_indexing_options(
                TextFieldIndexing::default()
                    .set_tokenizer("lang_ja")
                    .set_index_option(IndexRecordOption::WithFreqsAndPositions),
            )
            .set_stored(),
    );

    // build schema
    let schema = schema_builder.build();

    // create index on memory
    let index = Index::create_in_ram(schema.clone());

    // register Lindera tokenizer
    index
        .tokenizers()
        .register("lang_ja", LinderaTokenizer::default());

    // create index writer
    let mut index_writer = index.writer(50_000_000)?;

    // add document
    index_writer.add_document(doc!(
    id => "1",
    title => "成田国際空港",
    body => "成田国際空港(なりたこくさいくうこう、英: Narita International Airport)は、千葉県成田市南東部から芝山町北部にかけて建設された日本最大の国際拠点空港である。首都圏東部(東京の東60km)に位置している。空港コードはNRT。"
    )).unwrap();

    // add document
    index_writer.add_document(doc!(
    id => "2",
    title => "東京国際空港",
    body => "東京国際空港(とうきょうこくさいくうこう、英語: Tokyo International Airport)は、東京都大田区にある日本最大の空港。通称は羽田空港(はねだくうこう、英語: Haneda Airport)であり、単に「羽田」と呼ばれる場合もある。空港コードはHND。"
    )).unwrap();

    // add document
    index_writer.add_document(doc!(
    id => "3",
    title => "関西国際空港",
    body => "関西国際空港(かんさいこくさいくうこう、英: Kansai International Airport)は大阪市の南西35㎞に位置する西日本の国際的な玄関口であり、関西三空港の一つとして大阪国際空港(伊丹空港)、神戸空港とともに関西エアポート株式会社によって一体運営が行われている。"
    )).unwrap();

    // commit
    index_writer.commit()?;

    // create reader
    let reader = index.reader()?;

    // create searcher
    let searcher = reader.searcher();

    // create querhy parser
    let query_parser = QueryParser::for_index(&index, vec![title, body]);

    // parse query
    let query_str = "東京";
    let query = query_parser.parse_query(query_str)?;
    println!("Query String: {}", query_str);

    // search
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    println!("Search Result:");
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        println!("{}", schema.to_json(&retrieved_doc));
    }

    Ok(())
}

API reference

The API reference is available. Please see following URL:

Dependencies

~34–71MB
~1M SLoC