1 unstable release
Uses new Rust 2024
new 0.1.2 | Mar 17, 2025 |
---|
#4 in #biological-data
90 downloads per month
160KB
3K
SLoC

TSG - Transcript Segment Graph
TSG is a Rust library and command-line tool for creating, manipulating, and analyzing transcript segment graphs. It provides a comprehensive framework for modeling segmented transcript data, analyzing non-linear splicing events, and working with genomic structural variants.
Features
- Parse and write TSG format files
- Build and manipulate transcript segment graphs
- Support for multiple graphs within a single file
- Analyze paths and connectivity between transcript segments
- Support for various element types: nodes, edges, groups, and chains
- Export graphs to DOT format for visualization
- Traverse the graph to identify valid transcript paths
- Read identity tracking to ensure biological validity
- Build graphs from chains and validate path traversals
- Support for genomic coordinates with strand information
- Support for read evidence with types
- Inter-graph links for fusion events and other cross-graph relationships
Installation
Library
Add this to your Cargo.toml
:
[dependencies]
tsg = "0.1.0"
Command-line Tool
Install the CLI tool:
cargo install tsg
Library Usage
Loading a TSG file
use tsg::graph::TSGraph;
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load graph from a TSG file
let graph = TSGraph::from_file("path/to/file.tsg")?;
// Access graph elements
println!("Number of graphs: {}", graph.get_graphs().len());
println!("Number of nodes: {}", graph.get_nodes().len());
println!("Number of edges: {}", graph.get_edges().len());
// Export to DOT format for visualization
let dot = graph.to_dot()?;
std::fs::write("graph.dot", dot)?;
// Save modified graph
graph.write_to_file("output.tsg")?;
Ok(())
}
Working with Multiple Graphs
use tsg::graph::{TSGraph, NodeData, EdgeData};
use bstr::BString;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut graph = TSGraph::new();
// Define multiple graphs
graph.add_graph("gene_a", Some("BRCA1 transcripts"))?;
graph.add_graph("gene_b", Some("BRCA2 transcripts"))?;
// Add nodes to different graphs
let node1 = NodeData {
id: "gene_a:n1".into(),
reference_id: "chr17".into(),
..Default::default()
};
let node2 = NodeData {
id: "gene_b:n1".into(),
reference_id: "chr13".into(),
..Default::default()
};
graph.add_node(node1)?;
graph.add_node(node2)?;
// Add edges within each graph
let edge1 = EdgeData {
id: "gene_a:e1".into(),
..Default::default()
};
graph.add_edge("gene_a:n1".into(), "gene_a:n2".into(), edge1)?;
// Add inter-graph link (e.g., for fusion transcript)
graph.add_link("fusion1", "gene_a:n3", "gene_b:n1", "fusion", None)?;
// Write to file
graph.write_to_file("multi_graph.tsg")?;
Ok(())
}
Building Graphs from Chains
use tsg::graph::{TSGraph, Group};
use std::collections::HashMap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create chains for different graphs
let chains = vec![
Group::Chain {
id: "gene_a:chain1".into(),
elements: vec!["gene_a:n1".into(), "gene_a:e1".into(), "gene_a:n2".into()],
attributes: HashMap::new(),
},
Group::Chain {
id: "gene_b:chain1".into(),
elements: vec!["gene_b:n1".into(), "gene_b:e1".into(), "gene_b:n2".into()],
attributes: HashMap::new(),
},
];
// Build graphs from chains
let graph = TSGraph::from_chains(chains)?;
// Write to file
graph.write_to_file("output.tsg")?;
Ok(())
}
Finding Valid Paths Through Specific Graphs
use tsg::graph::TSGraph;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let graph = TSGraph::from_file("transcript.tsg")?;
// Find all valid paths through a specific graph
let paths = graph.traverse_graph("gene_a")?;
for (i, path) in paths.iter().enumerate() {
println!("Path {}: {}", i+1, path);
}
Ok(())
}
CLI Usage
The TSG command-line tool provides a convenient interface for common operations:
# Display help
tsg --help
# Parse and validate a TSG file
tsg validate path/to/file.tsg
# List all graphs in a TSG file
tsg list-graphs path/to/file.tsg
# Convert a specific graph to DOT format for visualization
tsg dot --graph=gene_a path/to/file.tsg > gene_a.dot
# Extract statistics from a TSG file
tsg stats path/to/file.tsg
# Find all paths through a specific graph
tsg paths --graph=gene_a path/to/file.tsg
# Find all inter-graph links
tsg links path/to/file.tsg
TSG File Format
The TSG format is a tab-delimited text format representing transcript assemblies as graphs. It supports multiple independent graphs within a single file.
Multi-Graph Support
TSG supports multiple graphs within a single file using a graph namespace approach. Each element in the file can be associated with a specific graph using a graph ID prefix:
graph_id:element_id
For example, gene_a:n1
refers to node n1 in the graph identified as "gene_a".
Record Types
Each line in a TSG file starts with a letter denoting the record type:
H
- Header information (including graph definitions)N
- Node definition (exon or transcript segment)E
- Edge definition (splice junction or structural variant)U
- Unordered group (set of elements)P
- Path (ordered traversal through the graph)C
- Chain (alternating nodes and edges)A
- Attribute for any element (metadata)L
- Inter-graph link (connections between different graphs)
Conceptual Model
In the TSG model:
- Graphs (G) represent independent transcript graphs, each with its own set of nodes and edges.
- Chains (C) are used to build each graph's structure.
- Paths (P) are traversals through the constructed graphs.
- Links (L) establish relationships between elements in different graphs.
This distinction is important: chains define what each graph is, paths define ways to traverse each graph, and links define relationships between graphs.
Example with Multiple Graphs
# File header
H TSG 1.0
H reference GRCh38
# Graph definitions
H graph gene_a BRCA1 transcripts
H graph gene_b BRCA2 transcripts
# Nodes for gene_a
N gene_a:n1 chr17:+:41196312-41196402 read1:SO,read2:SO ACGTACGT
N gene_a:n2 chr17:+:41199660-41199720 read2:IN,read3:IN TGCATGCA
N gene_a:n3 chr17:+:41203080-41203134 read1:SI,read2:SI CTGACTGA
# Nodes for gene_b
N gene_b:n1 chr13:+:32315480-32315652 read4:SO,read5:SO GATTACA
N gene_b:n2 chr13:+:32316528-32316800 read4:IN,read5:IN TACGATCG
N gene_b:n3 chr13:+:32319077-32319325 read4:SI,read5:SI CGTACGTA
# Edges for gene_a
E gene_a:e1 gene_a:n1 gene_a:n2 chr17,chr17,41196402,41199660,splice
E gene_a:e2 gene_a:n2 gene_a:n3 chr17,chr17,41199720,41203080,splice
# Edges for gene_b
E gene_b:e1 gene_b:n1 gene_b:n2 chr13,chr13,32315652,32316528,splice
E gene_b:e2 gene_b:n2 gene_b:n3 chr13,chr13,32316800,32319077,splice
# Chains for gene_a
C gene_a:chain1 gene_a:n1 gene_a:e1 gene_a:n2 gene_a:e2 gene_a:n3
# Chains for gene_b
C gene_b:chain1 gene_b:n1 gene_b:e1 gene_b:n2 gene_b:e2 gene_b:n3
# Paths for gene_a
P gene_a:transcript1 gene_a:n1+ gene_a:e1+ gene_a:n2+ gene_a:e2+ gene_a:n3+
# Paths for gene_b
P gene_b:transcript1 gene_b:n1+ gene_b:e1+ gene_b:n2+ gene_b:e2+ gene_b:n3+
# Inter-graph link (e.g., for a fusion transcript)
L fusion1 gene_a:n3 gene_b:n1 fusion type:Z:chromosomal
# Attributes
A N gene_a:n1 expression:f:10.5
A P gene_a:transcript1 tpm:f:8.2
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Dependencies
~16–28MB
~433K SLoC