3 releases

Uses new Rust 2024

new 0.5.2 Nov 4, 2025
0.5.1 Oct 8, 2025
0.5.0 Sep 17, 2025

#1644 in Machine learning

Download history 134/week @ 2025-09-16 20/week @ 2025-09-23 16/week @ 2025-09-30 167/week @ 2025-10-07 23/week @ 2025-10-14 10/week @ 2025-10-21 1/week @ 2025-10-28

203 downloads per month
Used in gtars

MIT license

165KB
3.5K SLoC

gtars-tokenizers

Wrapper around gtars-overlaprs for producing tokens for machine learning models.

Purpose

This module wraps the core overlap infrastructure from gtars-overlaprs to convert genomic regions into vocabulary tokens for machine learning pipelines. It is specifically designed for ML applications that need to represent genomic intervals as discrete tokens.

Design Philosophy

All overlap computation is delegated to gtars-overlaprs. This module focuses on:

  • Token vocabulary management
  • Encoding/decoding strategies
  • Integration with ML frameworks (HuggingFace, etc.)

Use Cases

  • Transformer Models: Convert genomic regions to token sequences
  • Feature Extraction: Represent intervals as discrete features for ML
  • Language Model Input: Prepare genomic data for NLP-based models

Main Components

  • Tokenizer: Maps regions to vocabulary tokens using overlap detection
  • Universe: Vocabulary of genomic regions (peaks/intervals)

Example

use std::path::Path;
use gtars_tokenizers::Tokenizer;
use gtars_core::models::Region;

let tokenizer = Tokenizer::from_bed(Path::new("../tests/data/tokenizers/peaks.bed")).unwrap();

let regions = vec![Region {
    chr: "chr1".to_string(),
    start: 100,
    end: 200,
    rest: None,
}];
let tokens = tokenizer.tokenize(&regions);

Dependencies

~3.5–8.5MB
~151K SLoC