2 unstable releases
0.2.0 | Jan 3, 2025 |
---|---|
0.1.0 | Oct 9, 2024 |
#115 in Biology
528 downloads per month
Used in 3 crates
(2 directly)
185KB
2K
SLoC
Coordinates upon a molecule.
A coordinate is the fundamental unit for describing a location within a genome. Coordinates point to a single location within a contiguous molecule (typically a nucleic acid molecule, such as DNA or RNA, or a protein) and are specified at the nucleotide level of abstraction.
Coordinates are comprised of three components:
- The name of the molecule upon which the coordinate sits is known as the contig.
- Each molecule is made of a contiguous series of elements. The offset of the selected element with respect to the starting element of the molecule is known as the position.
- Optionally, if the molecule is stranded, the strand upon which the coordinate sits is known as the strand.
Coordinates, via their positions, can fall within the interbase coordinate
system (which is closely related to the 0-based, half-open coordinate
system) or the in-base coordinate system (closely related to the 1-based,
full-closed coordinate system). In this crate, the interbase coordinate
system is denoted using the interbase
/Interbase
identifiers, and the
in-base coordinate system is denoted using the base
/Base
identifiers (we
didn't like the way in_base
/InBase
looked).
If you want to learn more about the supported coordinate systems, or if you want to learn why this crate uses the terms that it does (e.g., "in-base" instead of "1-based"), please jump to this section of the docs.
Scope
At present, omics-coordinate
is focused almost exclusively on nucleic acid
molecules. In the future, however, we expect to expand this to cover
proteins as well.
Quickstart
To get started, you'll need to decide if you want to use interbase or in-base coordinates. This decision largely depends on your use case, the consumers of the data, and the context of both (a) where input data is coming from and (b) where output data will be shared. Note that, if you're working with a common bioinformatics file format, the coordinate system is often dictated by the format itself. If you need help deciding which coordinate system to use, you should start by reading the positions section of the docs.
Once you've decided on which coordinate system you'd like to use, you can create coordinates like so:
use omics_coordinate::Coordinate;
use omics_coordinate::system::Base;
use omics_coordinate::system::Interbase;
// An interbase coordinate.
let coordinate = Coordinate::<Interbase>::try_new("seq0", "+", 0)?;
println!("{:#}", coordinate);
// A in-base coordinate.
let coordinate = Coordinate::<Base>::try_new("seq0", "+", 1)?;
println!("{:#}", coordinate);
For convenience, the crate also provides type aliases for the interbase and
in-base variants of the relevant concepts. For example, you can use a
Position<Interbase>
by instead simply importing a
zero::Position
.
use omics_coordinate::interbase::Coordinate;
let coordinate = Coordinate::try_new("seq0", "+", 0)?;
println!("{:#}", coordinate);
Background
Coordinate systems can be surprisingly hard to find comprehensive, authoritative material for and, thus, have a reputation for being confusing to newcomers to the field. To address this lack of material and to describe how terms are used within this crate, the authors lay out their understanding of the history behind the terminology used in the community and then cover their perspective on what terms are most appropriate to be used within different contexts. Notably, this may not match the worldview of other popular resources or papers out there. In these cases, departures from convention are noted alongside carefully reasoned opinions on why the departure was made.
Biology Primer
Before diving into the coordinate system-specific details, we must first lay some groundwork for terms used within genomics in general. These definitions serve as a quick overview to orient you to the discussion around coordinate systems—if you're interested in more detailed information, you can learn more at https://learngenomics.dev.
- A genome is the complete set of genetic code stored within a cell (learn more).
- Deoxyribose nucleic acid, or DNA, is a molecule that warehouses
the aforementioned genetic code. In eukaryotic cells, DNA resides in the
nucleus of a cell.
- DNA is stored as a sequence of nucleotides (i.e.,
A
,C
,G
, andT
). - DNA is double-stranded, meaning there are two, complementary sequences of nucleotides that run in antiparallel.
- DNA is stored as a sequence of nucleotides (i.e.,
- Ribonucleic acid, or RNA, is a molecule that is transcribed from
a particular stretch of DNA.
- RNA is also stored as sequence of nucleotides (though, in this case,
the nucleotides are
A
,C
,G
, andU
). - RNA is single-stranded, meaning that it represents the transcription of only one of the strands of DNA.
- RNA generally either (a) serves as a template for the production of a protein or (b) has some functional role in and of itself.
- RNA is also stored as sequence of nucleotides (though, in this case,
the nucleotides are
- Proteins are macromolecules that are assembled by translating the nucleotide sequence stored with an RNA molecule into a chain of amino acids. Proteins play a wide variety of roles in the function of a cell.
Though there are exceptions to this rule, the core idea is this: through a series of steps described within the central dogma of molecular biology, genetic code stored within DNA is commonly transcribed to RNA and either (a) the RNA is used as a template to assemble a functional protein through the process of translation [in the case of coding RNA], or (b) that RNA plays some functional role in and of itself [in the case of non-coding RNA].
This crate attempts to provide facilities to effectively describe coordinates within the context of DNA molecules and RNA molecules in the various notations used within the community. We'll start with the most granular concepts (e.g., contigs, positions, and strands) and work our way up to the most broad reaching concepts (e.g., intervals and coordinate systems).
Contigs
Typically, genetic information that constitutes a genome is not stored as a single, contiguous molecule. Instead, genomes are commonly broken up into multiple, contiguous molecules of DNA known as chromosomes. Beyond the chromosomes, other sequences, such as the Epstein–Barr virus, the mitochondrial genome, or decoy sequences are inserted as contigs within a reference genome to serve various purposes. This broader category of contiguous nucleotide sequences are colloquially referred to as "contigs".
As we learn more about the human genome, new versions, called genome
builds are released that describe the known genetic sequence therein. Each
contigs contained within a particular genome build is assigned a unique
identifier within that build (e.g., chr1
within the hg38
genome build).
Specifying the contiguous molecule upon which a coordinate is located is the
first step in anchoring the coordinate within a genome.
For example, the most recent release (ref) of the human genome at the time of writing has exactly 24 contigs—these represent the 22 autosomes and the X/Y sex chromosomes present in the human genome. Interestingly, earlier versions of the human genome, such as GRCh37 and GRCh38, contain more contigs that represent phenomenon such as unplaced sequences (i.e., sequences that we know are located somewhere in the human genome, but we didn't know exactly where when the reference genome was released) and unlocalized sequences (i.e., sequences where we know the chromosome upon which the sequence was located but not the exact position).
Design Considerations
There are no current or planned restrictions on what a contig can be named,
as the crate needs to remain able to support all possible use cases. That
said, the authors may introduce (optional) convenience methods based on
common naming conventions in the future, such as the detection of chr
prefixes, which is a convention for the naming of chromosomes specifically.
Positions
This section lays out a detailed, conceptual model within which we can compare and contrast the two kinds of positions used within genomic coordinate systems: namely, in-base positions and interbase positions. We then cover how these terms relate to commonly used terms in the community (including a "0-based, half-open coordinate system" and a "1-based, fully-closed coordinate system") and how you can use this crate to flexibly represent a spectrum of locations within a genome.
Before we begin, a word of caution—many materials attempt to make the differences between in-base and interbase positions (or the closely related 0-based, half-open and 1-based, fully closed coordinate systems) appear small and unremarkable (e.g., by providing seemingly straightforward formulas to convert between the two). In fact, after a quick scan of these materials, you may even be tempted to view the two systems as simply a difference in accounting and off-by-one hoopla!
In the authors' opinion, not only is this not true, it also doesn't serve you well to think of the coordinate systems as anything less than entirely different universes that must be explicitly and responsibly traversed between. To be clear, we're not suggesting that the existing materials are wrong—often, you can follow the conventions laid out, and, as long as the baked-in assumptions are consistently true for your use case, everything will be well. That said, we endeavour to go futher within this crate—to explore the very fabric of these coordinate systems, point out the assumptions made in each coordinate system, and enable you to understand and write code that works across the spectrum of possible position representations.
In-base and Interbase Positions
Positions within a genomic coordinate system can be represented as either in-base positions or interbase positions:
- In-base positions point directly to and fully encapsulate a nucleotide. These types of positions are generally considered to be intuitive from a biological reasoning standpoint and are often used in contexts where data is reported back to a biological audience (e.g., genome browsers and public variant databases). Though we use the term "in-base" exclusively in this document, these types of positions are also sometimes referred to as simply "base" positions in the broader community.
- Interbase positions point to the spaces between nucleotides. These positions are generally considered to be easier to work with computationally for a variety of reasons that will become apparent in the text that follows. It is also possible to unambiguously represent certain types of variation, such as insertions and structural variant breakpoints, using interbase positions. As such, interbase positions are commonly used as the internal representation of positions within bioinformatics tools as well as in situations where the output is meant to be consumed computationally (e.g., APIs).
For example, SAM files, which are intended to be human-readable, use in-base positions to make themselves more easily interpretable and compatible with genomic databases. Their non-human-readable, binary counterparts, known as BAM files, use interbase positions for the reasons describe aboved. The decision on which coordinate system to use was largely based on the distinction on how the two file types were meant to be consumed (to learn more about what the author of SAM/BAM said about the decision, read the end of this StackExchange answer).
Conceptual Model
Here, we introduce a conceptual model that is useful for comparing and contrasting the two coordinate systems. Under this model, nucleotides and the spaces between them are pulled apart and considered to coexist as independent entities laid out along a discrete axis. Both nucleotides and spaces represent a "slot", and the kind of slot may be distinguished by designating it as a "nucleotide slot" and a "space slot" respectively. Numbered positions are assigned equidistantly at every other slot within either system, but the type of slot where positions are assigned is mutually exclusive between the two systems:
- Numbered positions are assigned to each of the nucleotide slots within the in-base coordinate system.
- Numbered positions are assigned to each of the space slots within the interbase coordinate system.
Importantly, in both systems, only slots with an assigned position can be specified using a position. This has incredibly important implications on what locations can and cannot be expressed within the two coordinate systems.
The diagram below depicts the model applied over a short sequence of seven
nucleotides. Each slot has a series of double pipe characters (║
) that
links a slot with its assigned, numbered position (if it exists) within the
in-base and interbase coordinate systems. Note that, though the two
positions systems are displayed in parallel in the diagram below, that is
only so that they can be compared/contrasted more easily. More specifically,
they do not interact with each other in any way.
========================== seq0 =========================
• G • A • T • A • T • G • A •
║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║[--1--]║[--2--]║[--3--]║[--4--]║[--5--]║[--6--]║[--7--]║ In-base Positions
0 1 2 3 4 5 6 7 Interbase Positions
As was alluded to above, reasoning about the in-base coordinate system under
this model is relatively straightforward—if one wants to create a position
representing the location of the first nucleotide (G
), it can be done by
simply denoting the numbered position assigned to same slot as the G
nucleotide, which is position 1
.
Creating a position that represents the same nucleotide using the interbase
coordinate system is more complicated. Recall that (a) no numbered positions
are assigned to nucleotide slots within the interbase coordinate system and
(b) only numbered slots may be referenced as a position. As such, referring
to the first nucleotide using a single, numbered position is impossible.
Indeed, in a strict sense, a range of numbered positions must be used to
encapsulate even this single nucleotide (🤯)—namely, the range [0-1]
(note
that the range of interbase positions is generally considered exclusive,
but that does not apply here when the space slots and nucleotide slots are
split).
Starting Position
By convention within the community, interbase positions almost always start
at position zero (0
) and in-base positions almost always start at position
one (1
). As far as the authors can tell, this is for three main reasons
(please contribute to the docs if you disagree with any of these assertions
or know of other reasons):
- History. Biological coordinate systems and databases have historically
started with the first entity of a sequence at position
1
. Thus, in-base coordinates (which, again, are generally considered to be more suitable for a broader biological audience) tend to follow these same conventions. Because interbase positions effectively capture the space around these entities, a number before one is needed to represent the space before the first entity. - Intention. This interplay works out well, as interbase coordinates
depart from a biologically intuitive model in favor of a more
computationally intuitive model. To that end, interbase positions
typically mirror programming languages in that counting starts at
0
. This suggests that, many times, interbase coordinates are a more natural fit for existing data structures and algorithms. - Convention. Beyond the reasons above (and, further, not strictly
imposed by the definitions of interbase and in-base coordinate systems),
the community has evolved to use the starting position of
0
or1
to allude to the use of interbase and in-base positions, respectively.
Strand
DNA is a double-stranded molecule that stores genetic code. This means that
two sequences of complementary nucleotides run in antiparallel. This is
often referred to as being read from 5' to
3',
referring to connections within the underlying chemical structure. For
example, below is a fictional double-stranded molecule with the name seq0
.
---------------- Read this direction --------------->
5' 3'
===================== seq0 (+) ======================
G A T A T G A A T A T G A G
| | | | | | | | | | | | | |
C T A T A C T T A T A C T C
===================== seq0 (-) ======================
3' 5'
<--------------- Read this direction ----------------
In a real-world, biological context, both strands contain genetic information that is important to the function of the cell—though both strands are biologically important, some system of labelling must be introduced to distinguish which of the two strands a genomic coordinate is located on.
To address this, a reference genome selects one of the strands as the
positive strand (also called the "sense" strand, the "reference" strand,
or the +
strand) for each contiguous molecule. This implies that the
opposite, complementary strand is the negative strand (also called the
"antisense" strand, the "complementary" strand, or the -
strand). Notably,
reference genomes only specify the nucleotide sequence for the positive
strand, as the negative strand's nucleotide sequence may be computed as the
reverse complement of the positive strand.
The concept of strandedness is useful when describing the location of coordinate on a molecule with two strands. Some nucleic acid molecules, such as RNA are single-stranded molecules—RNA is derived from a particular strand of DNA, but the RNA molecule itself is not considered to be stranded.
Within this crate, a Strand
always refers to the strand of the
coordinate upon a molecule (if the molecule is stranded). If the molecule
upon which the nucleotide(s) sit is not stranded, then no strand should be
specified.
This means that,
- Coordinates that lie upon a DNA molecule must always have a strand. The
Strand::Positive
andStrand::Negative
variants are used to distinguish which strand a coordinate sits upon relative to the strand specified in the reference genome. - Coordinates that lie upon an RNA molecule have no strand. In particular, the the original strand of DNA from which a position on RNA is derived is lost during any conversion from one to the other. If it is of interest, you may keep track of this kind of thing on your own at conversion time.
Intervals
Intervals describe a range of positions upon a contiguous molecule. Generally speaking, you can think of an interval as simply a start coordinate and end coordinate within one of the coordinate systems. Intervals are always closed with respect to their comprising coordinates.
The following figure illustrates this concept using the notation described in the position section of the docs.
========================== seq0 ===========================
• G • A • T • A • T • G • A •
║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║ 1 ║ 2 ║ 3 ║ 4 ║ 5 ║ 6 ║ 7 ║ In-base Positions
0 1 2 3 4 5 6 7 Interbase Positions
===========================================================
┃ ┃ ┃ ┃
┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┃ seq0:+:1-7 (In-base interval)
┃ Both contain "GATATGA" ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ seq0:+:0-7 (Interbase interval)
Crate Design
Throughout the crate, you will see references to interbase and in-base
variants of the concepts above. For example, there is a core Position
struct that is defined like so:
pub struct Position<S>
where
S: System, {
// private fields
}
The struct takes a single, generic parameter that is a System
. In this
design, functionality that is fundamental to both interbase and in-base
position types are implemented in the core Position
struct.
Functionality that is different between the two coordinate systems is
implemented through traits (in the case of positions, the Position
trait) and exposed through
trait-constrained methods (e.g., Position::checked_add
).
Note that some concepts, such as Contig
and Strand
are coordinate
system invariant. As such, they don't take a System
generic type
parameter.
Learning More
In the original writing of these docs, it was difficult to find a single, authoritative source regarding all of the conventions and assumptions that go into coordinate systems. Here are a few links that the authors consulted when writing this crate.
- This blog post
from the UCSC genome browser team does a pretty good job explaining the
basics of 0-based versus 1-based coordinate systems and why they are used
in different contexts.
- Note that this crate does not follow the conventions UCSC uses for
formatting the two coordinate systems differently (e.g.
seq0 0 1
for 0-based coordinates andseq1:1-1
). Instead, the two coordinate systems are distinguished by the Rust type system and are serialized similarly (e.g.,seq0:+:0-1
for 0-based coordinates andseq0:+:1-1
for 1-based coordinates).
- Note that this crate does not follow the conventions UCSC uses for
formatting the two coordinate systems differently (e.g.
- This blog post also presents the two coordinate systems and gives some details about concrete file formats where each are used.
- This cheat sheet is a popular community resource (though, you should be sure to read the comments!).
Dependencies
~230–670KB
~16K SLoC