#sequence #model #sequencing #generate #evaluate #package #recombination

bin+lib righor

Righor creates model of Ig/TCR sequences from sequencing data

6 releases

0.2.6 Nov 25, 2024
0.2.5 Nov 25, 2024
0.2.4 Sep 7, 2024
0.2.3 Apr 30, 2024
0.2.1 Mar 28, 2024

#94 in Biology

Download history 155/week @ 2024-09-02 15/week @ 2024-09-09 26/week @ 2024-09-16 16/week @ 2024-09-23 15/week @ 2024-09-30 439/week @ 2024-11-25 104/week @ 2024-12-02 68/week @ 2024-12-09 35/week @ 2024-12-16

646 downloads per month

MIT license

780KB
11K SLoC

RIGHOR

This package, based on IGoR, is meant to learn models of V(D)J recombination.

It can:

  • generate sequences
  • evaluate sequences (infer the most likely recombination scenarios)
  • compute "pgen"

It's probably easier to use the companion python package (pip install righor), but working in Rust directly should also be viable.

How to use the python package:

Load a model:

import righor
import matplotlib.pyplot as plt
import seaborn
import pandas as pd
from tqdm.notebook import tqdm
from collections import Counter
import numpy as np


igor_model = righor.load_model("human", "trb")

# alternatively, you can load a model from igor files
# igor_model = righor.load_model_from_files(params.txt, marginals.txt, anchor_v.csv, anchor_j.csv)

Generate sequences fast:

# Create a generator object
generator = igor_model.generator(seed=42) # or igor_model.generator() to run it without a seed

# Generate 10'000 functional sequences (not out-of-frame, no stop codons, right boundaries)
for _ in tqdm(range(10000)):
    # generate_without_errors ignore Igor error model, use "generate" if this is needed
    sequence = generator.generate_without_errors(functional=True)
    if "IGH" in sequence.cdr3_aa:
        print("TRB CDR3 containing \"IGH\":", sequence.cdr3_aa)

# Generate one sequence with a particular V/J genes family
V_genes = righor.genes_matching("TRBV5", igor_model) # return all the V genes that match TRBV5
J_genes = righor.genes_matching("TRBJ", igor_model) # all the J genes
generator = igor_model.generator(seed=42, available_v=V_genes, available_j=J_genes)
generation_result = generator.generate_without_errors(functional=True)
print("Result:")
print(generation_result)
print("Explicit recombination event:")
print(generation_result.recombination_event)

Evaluate a given sequence:

## Evaluate a given sequence

my_sequence = "ACCCTCCAGTCTGCCAGGCCCTCACATACCTCTCAGTACCTCTGTGCCAGCAGTGAGGACAGGGACGTCACTGAAGCTTTCTTTGGACAAGGCACC"

# evaluate the sequence
result_inference = igor_model.evaluate(my_sequence)

# Most likely scenario
best_event = result_inference.best_event

print(f"Probability that this specific event chain created the sequence: {best_event.likelihood / result_inference.likelihood:.2f}.")
print(f"Reconstructed sequence (without errors):", best_event.reconstructed_sequence)
print(f"Pgen: {result_inference.pgen:.1e}")

Infer a model:

# Inference of a model
# use a very small number of sequences to keep short (takes ~30s)

# here we just generate the sequences needed
generator = igor_model.generator()
example_seq = generator.generate(False)
sequences = [generator.generate(False).full_seq for _ in range(500)]

# define parameters for the alignment and the inference (also possible for the evaluation)
align_params = righor.AlignmentParameters()
align_params.left_v_cutoff = 70
infer_params = righor.InferenceParameters()

# generate an uniform model as a starting point
# (it's generally *much* faster to start from an already inferred model)
model = igor_model.copy()
model.p_ins_vd = np.ones(model.p_ins_vd.shape)
model.error_rate = 0

# align multiple sequences at once
aligned_sequences = model.align_all_sequences(sequences, align_params)

# multiple round of expectation-maximization to infer the model
models = {}
model = igor_model.uniform()
model.error_rate = 0
models[0] = model
for ii in tqdm(range(35)):
    models[ii+1] = models[ii].copy()
    models[ii+1].infer(aligned_sequences, infer_params)

Visualize and save the model

# visualisation of the results
fig = righor.plot_vdj(*[models[ii] for ii in [10, 2, 1, 0]] + [igor_model],
            plots_kws=[{'label':f'Round #{ii}', 'alpha':0.8} for ii in [10,2, 1, 0]] + [{'label':f'og'}] )
# save the model in the Igor format
# will return an error if the directory already exists
models[10].save_model('test_save')
# load the model
igor_model = righor.vdj.Model.load_model_from_files('test_save/model_params.txt',
                                          'test_save/model_marginals.txt',
                                          'test_save/V_gene_CDR3_anchors.csv',
                                          'test_save/J_gene_CDR3_anchors.csv')

# save the model in json format (one file)
models[10].save_json('test_save.json')
# load the model in json
igor_model = righor.vdj.Model.load_json('test_save.json')

Extra stuff:

Main differences with IGoR:

  • "dynamic programming" method, instead of summing over all events we first pre-compute over sum of events. This means that we can run it with undefined nucleotides like N (at least in theory, I need to add full support for these).
  • The D gene alignment is less constrained
  • can measure pgen for amino-acid sequences (like olga)
  • more error models (and more flexible, better for IGH)

Limitations:

  • Need to get rid of any primers/ends on the V gene side before running it
  • The reads need to be long enough to fully cover the CDR3 (even when it's particularly long)

Programming stuff:

  • There's a wasm version for web use.
  • python version is in a different crate now.
  • to add a model permanently, add it to "models.json". First model in a category is the default model. Each field is one independant model. The elements in chain and species should always be lower-case.
  • ambiguous nucleotide with "errors", the pgen won't work very well probably.

Dependencies

~19–32MB
~508K SLoC