#genomics #bioinformatics #data-analysis #methylation #bisulfite #epigenetics

bsxplorer2

A high-performance library for bisulfite sequencing data analysis and DNA methylation research

7 releases

0.2.1 May 28, 2025
0.2.0 May 23, 2025
0.1.1 Mar 28, 2025
0.1.1-post1 Apr 4, 2025

#153 in Biology

Download history 200/week @ 2025-03-23 149/week @ 2025-03-30 44/week @ 2025-04-06 6/week @ 2025-04-13 1/week @ 2025-04-20 1/week @ 2025-04-27 119/week @ 2025-05-04 11/week @ 2025-05-11 206/week @ 2025-05-18 192/week @ 2025-05-25 23/week @ 2025-06-01

433 downloads per month
Used in bsxplorer-ci

Custom license

410KB
9K SLoC

BSXplorer2: Accelerating DNA Methylation Analysis

Documentation Version codecov

License Downloads

A cutting-edge, high-performance toolkit built in Rust for bisulfite sequencing data analysis and DNA methylation research.

Overview

BSXplorer2 is designed from the ground up for speed and efficiency, enabling researchers and developers to process and analyze large-scale bisulfite sequencing datasets with unprecedented performance. By leveraging Rust's powerful features and integrating with modern data processing libraries like Polars and Arrow, BSXplorer2 provides a robust and scalable solution for identifying differentially methylated regions (DMRs), calculating methylation statistics, and handling various report formats.

Whether you prefer command-line tools for quick analyses or a programmatic interface for complex pipelines, BSXplorer2 offers flexible access through its console binary and Python bindings.

For detailed documentation and usage examples please refer to the documentation.

Features

Core Capabilities for High-Impact Research

  • Blazing Fast Data Handling: Process massive datasets efficiently using memory-optimized data structures and native parallelization.
  • Comprehensive Report Support: Seamlessly work with Bismark, CGmap, BedGraph, and Coverage formats, plus our high-performance BSX format.
  • Context-Aware Analysis: Drill down into CG, CHG, and CHH methylation patterns.
  • Advanced DMR Detection: Pinpoint differentially methylated regions using cutting-edge segmentation and statistical methods.
  • Robust Statistics: Calculate detailed methylation statistics, coverage distributions, and apply sophisticated statistical tests.

Engineered for Performance

  • Rust Native Speed: Built on Rust for maximum performance and reliability.
  • Polars & Arrow Integration: Leverage column-oriented processing for speed and memory efficiency.
  • Parallel Execution: Utilize multi-core processors effectively with Rayon.

🤝 User-Friendly & Accessible

  • Intuitive Console App: Perform common tasks easily with the bsxplorer command-line tool.
  • Flexible Python API: Build custom analysis workflows using the bsx2 Python library.
  • Detailed Documentation: Get started quickly with clear guides and examples.

Components

BSXplorer2 is composed of three main parts:

Core Rust Library

The heart of BSXplorer2, containing all the core data structures, algorithms, and file format implementations. Designed for high performance and low-level control. Explore the Rust source code: @src

Python Wrapper (bsx2)

Provides idiomatic Python bindings to the core Rust library using PyO3. Enables seamless integration with the Python data science ecosystem (Polars, NumPy, SciPy, Matplotlib, Plotly, Pydantic). Ideal for building complex analysis pipelines and interactive data exploration in Jupyter notebooks. Find the Python package here: @python

Console Application (bsxplorer)

A standalone command-line tool built on the Rust library. Offers convenient commands for file format conversion, DMR calling, validation, and more. Perfect for scripting and integrating into existing bioinformatics workflows without writing Rust or Python code. Check out the console source and commands: @console

Installation

For the Console Application (bsxplorer)

Install the console binary directly using Cargo:

cargo install --locked bsxplorer-ci

Ensure your Cargo bin directory is in your system's PATH.

For the Python Library (bsx2)

🚧 WIP! Python library is currently being actively developed! 🚧

Usage

Dive into analyzing your methylation data using the bsxplorer console application or the bsx2 Python library.

  • Console Usage: Get detailed help and command examples in the Console Application README.
  • Python Usage: Explore the bsx2 package documentation (coming soon!) and examples within the @python directory to use the Python API.

BSX Format (Arrow IPC File Format)

BSXplorer2 introduces the BSX file format, leveraging the power of Apache Arrow's Interprocess Communication (IPC) format. This isn't just another file type; it's a foundation for highly efficient methylation data processing:

Performance Benefits

  • Memory Efficiency: Column-oriented storage significantly reduces memory footprint compared to row-based formats.
  • Zero-Copy Reading: Data can be accessed in memory without expensive copying, boosting speed.
  • Parallel Processing: Designed for concurrent access, perfectly complementing multi-threaded operations.
  • Vectorized Operations: Enables leveraging modern CPU instructions (SIMD) for faster calculations.

Compression Capabilities

  • Flexible Compression: Supports LZ4 (optimized for speed) and ZSTD (optimized for compression ratio).
  • Column-Specific: Compression is applied per column, adapting to different data types.
  • Efficient Decompression: Only necessary columns are decompressed, minimizing overhead.

Data Organization

  • Efficient Categorical Encoding: Methylation contexts (CG, CHG, CHH) and strands are stored as efficient categorical types, not verbose strings.
  • Batched Storage: Data is chunked into logical batches for efficient processing in memory.
  • Type-Aware: Data types (integers, floats, booleans) are stored in optimized binary representations.

Integration Advantages

  • Cross-Platform: Works consistently across various operating systems.
  • Language Interoperability: Accessible from any language with robust Arrow bindings (Python, R, Java, etc.).
  • Schema Enforcement: Strict schema ensures data integrity and prevents format ambiguities.
  • Rich Metadata: Supports embedding custom metadata for better data tracking and provenance.

The BSX format is purpose-built for methylation data, providing the optimal storage solution for BSXplorer2's high-performance analytical tasks.

Roadmap

BSXplorer2 is under active development. Future plans include:

  • High-performance file format support (BSX, Bismark, CGmap, BedGraph, Coverage) including reading, writing, conversion, validation, and sorting.
  • Efficient indexing and region-based querying for BSX files.
  • Core DMR identification algorithm implementation.
  • Basic methylation statistics calculation.
  • Enhanced visualization tools within the Python library.
  • Tighter integration and utilities for genomic annotation data (genes, regulatory elements).
  • Exploration of a web-based interactive analysis interface.
  • Expansion of statistical methods for sophisticated differential methylation analysis.
  • Implement Metagene profile generation.

Contributions and feature requests are welcome!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • The foundational work for the total variation segmentation algorithm is inspired by Laurent Condat.
  • Statistical implementations draw upon established techniques from bioinformatics literature.
  • We gratefully acknowledge the contributions of community-developed libraries, including bio-types, polars, pyo3, and rayon, which are integral to BSXplorer2.

Created by shitohana - Empowering your DNA methylation research with speed and precision.

Dependencies

~54–87MB
~1.5M SLoC