#genomics #bioinformatics #agc

ragc-core

Core compression and decompression algorithms for the AGC genome compression format

2 releases

0.1.1 Jan 6, 2026
0.1.0 Nov 5, 2025

#575 in Biology


Used in ragc-cli

MIT license

1MB
21K SLoC

Core compression and decompression algorithms for the AGC genome compression format.

This crate implements the complete AGC compression pipeline with full C++ AGC format compatibility. Archives created by this library can be read by the C++ implementation and vice versa.

Features

  • Compression - Create AGC archives from FASTA files
  • Decompression - Extract genomes from AGC archives
  • C++ Compatibility - Bidirectional format interoperability
  • Multi-sample support - Handle multiple genomes in one archive
  • LZ differential encoding - Efficient encoding against reference sequences
  • ZSTD compression - High-ratio compression of segments

Examples

Compressing genomes

use ragc_core::{Compressor, CompressorConfig};
use std::path::Path;

# fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a compressor
let config = CompressorConfig::default();
let mut compressor = Compressor::new("output.agc", config)?;

// Add FASTA files
compressor.add_fasta_file("sample1", Path::new("genome1.fasta"))?;
compressor.add_fasta_file("sample2", Path::new("genome2.fasta"))?;

// Finalize the archive
compressor.finalize()?;
# Ok(())
# }

Decompressing genomes

use ragc_core::{Decompressor, DecompressorConfig};

// Open an archive
let config = DecompressorConfig::default();
let mut decompressor = Decompressor::open("archive.agc", config)?;

// List available samples
let samples = decompressor.list_samples();
println!("Found {} samples", samples.len());

// Extract a sample
let contigs = decompressor.get_sample("sample1")?;
for (name, sequence) in contigs {
    println!(">{}",  name);
    // sequence is Vec<u8> with numeric encoding (A=0, C=1, G=2, T=3)
}

Working with k-mers

use ragc_core::{Kmer, KmerMode};

// Create a canonical k-mer
let mut kmer = Kmer::new(21, KmerMode::Canonical);

// Insert bases (0=A, 1=C, 2=G, 3=T)
kmer.insert(0); // A
kmer.insert(1); // C
kmer.insert(2); // G

if kmer.is_full() {
    let value = kmer.data();
    println!("K-mer value: {}", value);
}

Custom compression settings

use ragc_core::CompressorConfig;

let config = CompressorConfig {
    kmer_length: 25,        // Use 25-mers instead of default 21
    segment_size: 2000,     // Larger segments
    min_match_len: 20,      // Minimum LZ match length
    verbosity: 2,           // More verbose output
};

Archive Format

The AGC format organizes data into streams:

  • file_type_info - Version and producer metadata
  • params - Compression parameters (k-mer length, segment size)
  • splitters - Singleton k-mers used for segmentation (future)
  • seg-NN or seg_dNN - Compressed genome segments
  • collection - Sample and contig metadata

Compatibility

This implementation is tested for compatibility with C++ AGC:

  • Archives created by ragc can be read by C++ AGC
  • Archives created by C++ AGC can be read by ragc
  • Format version 3.0 support
  • SHA256-verified roundtrip testing

Dependencies

~7–14MB
~255K SLoC