6 releases

0.2.6	Nov 25, 2024
0.2.5	~~Nov 25, 2024~~
0.2.4	Sep 7, 2024
0.2.3	Apr 30, 2024
0.2.1	Mar 28, 2024

#124 in Biology

MIT license

780KB
11K SLoC

RIGHOR

This package, based on IGoR, is meant to learn models of V(D)J recombination.

It can:

generate sequences
evaluate sequences (infer the most likely recombination scenarios)
compute "pgen"

It's probably easier to use the companion python package (pip install righor), but working in Rust directly should also be viable.

How to use the python package:

Load a model:

import righor
import matplotlib.pyplot as plt
import seaborn
import pandas as pd
from tqdm.notebook import tqdm
from collections import Counter
import numpy as np


igor_model = righor.load_model("human", "trb")

# alternatively, you can load a model from igor files
# igor_model = righor.load_model_from_files(params.txt, marginals.txt, anchor_v.csv, anchor_j.csv)

Generate sequences fast:

# Create a generator object
generator = igor_model.generator(seed=42) # or igor_model.generator() to run it without a seed

# Generate 10'000 functional sequences (not out-of-frame, no stop codons, right boundaries)
for _ in tqdm(range(10000)):
    # generate_without_errors ignore Igor error model, use "generate" if this is needed
    sequence = generator.generate_without_errors(functional=True)
    if "IGH" in sequence.cdr3_aa:
        print("TRB CDR3 containing \"IGH\":", sequence.cdr3_aa)

# Generate one sequence with a particular V/J genes family
V_genes = righor.genes_matching("TRBV5", igor_model) # return all the V genes that match TRBV5
J_genes = righor.genes_matching("TRBJ", igor_model) # all the J genes
generator = igor_model.generator(seed=42, available_v=V_genes, available_j=J_genes)
generation_result = generator.generate_without_errors(functional=True)
print("Result:")
print(generation_result)
print("Explicit recombination event:")
print(generation_result.recombination_event)

Evaluate a given sequence:

## Evaluate a given sequence

my_sequence = "ACCCTCCAGTCTGCCAGGCCCTCACATACCTCTCAGTACCTCTGTGCCAGCAGTGAGGACAGGGACGTCACTGAAGCTTTCTTTGGACAAGGCACC"

# evaluate the sequence
result_inference = igor_model.evaluate(my_sequence)

# Most likely scenario
best_event = result_inference.best_event

print(f"Probability that this specific event chain created the sequence: {best_event.likelihood / result_inference.likelihood:.2f}.")
print(f"Reconstructed sequence (without errors):", best_event.reconstructed_sequence)
print(f"Pgen: {result_inference.pgen:.1e}")

Infer a model:

# Inference of a model
# use a very small number of sequences to keep short (takes ~30s)

# here we just generate the sequences needed
generator = igor_model.generator()
example_seq = generator.generate(False)
sequences = [generator.generate(False).full_seq for _ in range(500)]

# define parameters for the alignment and the inference (also possible for the evaluation)
align_params = righor.AlignmentParameters()
align_params.left_v_cutoff = 70
infer_params = righor.InferenceParameters()

# generate an uniform model as a starting point
# (it's generally *much* faster to start from an already inferred model)
model = igor_model.copy()
model.p_ins_vd = np.ones(model.p_ins_vd.shape)
model.error_rate = 0

# align multiple sequences at once
aligned_sequences = model.align_all_sequences(sequences, align_params)

# multiple round of expectation-maximization to infer the model
models = {}
model = igor_model.uniform()
model.error_rate = 0
models[0] = model
for ii in tqdm(range(35)):
    models[ii+1] = models[ii].copy()
    models[ii+1].infer(aligned_sequences, infer_params)

Visualize and save the model

# visualisation of the results
fig = righor.plot_vdj(*[models[ii] for ii in [10, 2, 1, 0]] + [igor_model],
            plots_kws=[{'label':f'Round #{ii}', 'alpha':0.8} for ii in [10,2, 1, 0]] + [{'label':f'og'}] )
# save the model in the Igor format
# will return an error if the directory already exists
models[10].save_model('test_save')
# load the model
igor_model = righor.vdj.Model.load_model_from_files('test_save/model_params.txt',
                                          'test_save/model_marginals.txt',
                                          'test_save/V_gene_CDR3_anchors.csv',
                                          'test_save/J_gene_CDR3_anchors.csv')

# save the model in json format (one file)
models[10].save_json('test_save.json')
# load the model in json
igor_model = righor.vdj.Model.load_json('test_save.json')

Extra stuff:

Main differences with IGoR:

"dynamic programming" method, instead of summing over all events we first pre-compute over sum of events. This means that we can run it with undefined nucleotides like N (at least in theory, I need to add full support for these).
The D gene alignment is less constrained
can measure pgen for amino-acid sequences (like olga)
more error models (and more flexible, better for IGH)

Limitations:

Need to get rid of any primers/ends on the V gene side before running it
The reads need to be long enough to fully cover the CDR3 (even when it's particularly long)

Programming stuff:

There's a wasm version for web use.
python version is in a different crate now.
to add a model permanently, add it to "models.json". First model in a category is the default model. Each field is one independant model. The elements in chain and species should always be lower-case.
ambiguous nucleotide with "errors", the pgen won't work very well probably.

Dependencies

~20–33MB
~545K SLoC