#tantivy #pinyin

tantivy-pinyin

Just like the name, this is a pinyin tokenizer of tantivy

4 releases

Uses new Rust 2021

0.1.3 Jun 20, 2022
0.1.2 Jun 20, 2022
0.1.1 Jun 20, 2022
0.1.0 Jun 20, 2022

#76 in Internationalization (i18n)

Download history 94/week @ 2022-06-20

94 downloads per month

MIT license

15KB
133 lines

tantivy-pinyin

就像名字一样,这是一个 tantivy 的拼音分析器

Just like the name, this is a pinyin tokenizer of tantivy

Usage (用法)

add dependencies

tantivy_pinyin = "0.1.0"

This is an example of pinyin tokenizer:

use tantivy::collector::{Count, TopDocs};
use tantivy::query::TermQuery;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
use tantivy::tokenizer::{PreTokenizedString, Token, Tokenizer};
use tempfile::TempDir;

use tantivy_pinyin::PinyinTokenizer;

fn pre_tokenize_text(text: &str) -> Vec<Token> {
  let mut token_stream = PinyinTokenizer.token_stream(text);
  let mut tokens = vec![];
  while token_stream.advance() {
      tokens.push(token_stream.token().clone());
  }
  tokens
}


pub fn main() -> tantivy::Result<()> {
  let index_path = TempDir::new()?;

  let mut schema_builder = Schema::builder();

  schema_builder.add_text_field("title", TEXT | STORED);
  schema_builder.add_text_field("body", TEXT);

  let schema = schema_builder.build();

  let index = Index::create_in_dir(&index_path, schema.clone())?;

  let mut index_writer = index.writer(50_000_000)?;

  // We can create a document manually, by setting the fields
  // one by one in a Document object.
  let title = schema.get_field("title").unwrap();
  let body = schema.get_field("body").unwrap();

  let title_text = "大多数知识,不需要我们记住";
  let body_text = "大多数知识,只需要认知即可";

  // Content of our first document
  // We create `PreTokenizedString` which contains original text and vector of tokens
  let title_tok = PreTokenizedString {
      text: String::from(title_text),
      tokens: pre_tokenize_text(title_text),
  };

  println!(
      "Original text: \"{}\" and tokens: {:?}",
      title_tok.text, title_tok.tokens
  );

  let body_tok = PreTokenizedString {
      text: String::from(body_text),
      tokens: pre_tokenize_text(body_text),
  };

  // Now lets create a document and add our `PreTokenizedString`
  let old_man_doc = doc!(title => title_tok, body => body_tok);

  // ... now let's just add it to the IndexWriter
  index_writer.add_document(old_man_doc)?;

  // Let's commit changes
  index_writer.commit()?;

  // ... and now is the time to query our index

  let reader = index
      .reader_builder()
      .reload_policy(ReloadPolicy::OnCommit)
      .try_into()?;

  let searcher = reader.searcher();

  // We want to get documents with token "Man", we will use TermQuery to do it
  // Using PreTokenizedString means the tokens are stored as is avoiding stemming
  // and lowercasing, which preserves full words in their original form
  let query = TermQuery::new(
      //Term::from_field_text(title, "liu"),
      Term::from_field_text(body, "xin"),
      IndexRecordOption::Basic,
  );

  let (top_docs, count) = searcher.search(&query, &(TopDocs::with_limit(2), Count))?;

  println!("Found {} documents", count);

  // Now let's print out the results.
  // Note that the tokens are not stored along with the original text
  // in the document store
  for (_score, doc_address) in top_docs {
      let retrieved_doc = searcher.doc(doc_address)?;
      println!("Document: {}", schema.to_json(&retrieved_doc));
  }

  Ok(())
}

Features

stop_words 中文停用词

Test

cargo test

附言

项目比较小,如果帮助到了你,给个 star 鼓励一下作者吧

Dependencies

~17MB
~272K SLoC