Skip to content
Open
3 changes: 0 additions & 3 deletions docs-src/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,6 @@ Interval
DisjointIntervalSequence
========================

.. autoclass:: genome_kit.diseq.IndexDirection
:members:

.. autoclass:: genome_kit.diseq.DisjointIntervalSequence
:special-members:
:members:
Expand Down
218 changes: 194 additions & 24 deletions docs-src/diseq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,15 +51,15 @@ A DIS has two aspects:
The following examples illustrate how the coordinate space and segment interact,
using both diagrams and code.

Consider a transcript on the + strand with the following genomic layout:
::
Consider a transcript on the + strand with the following genomic layout::

Genomic Coordinates: 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
DNA Sequence: | A T G C C G C A T G C C G C |
| |<------->| |<------->| |<--->| |<----------->| |<--->| |
5' Exon1 Intron1 Exon2 Intron2 Exon3 3'

Extracting only the exons yields the following disjoint intervals:
::
Extracting only the exons yields the following disjoint intervals::

Genomic Coordinates: 153 154 155 159 160 165 166 167
DNA Sequence: | A T G C A G C |
| |<------->| |<--->| |<--->| |
Expand All @@ -74,15 +74,15 @@ These exon intervals can be represented as :py:class:`~genome_kit.Interval` obje
>>> exon3 = Interval("chr1", "+", 165, 167, "hg38")

To define a segment spanning the full exonic sequence (from the start of Exon1 to
the end of Exon3), the exon intervals are first converted into a DIS coordinate space
::
the end of Exon3), the exon intervals are first converted into a DIS coordinate space::

DIS Coordinates: 0 1 2 3 4 5 6 7
DNA Sequence: | A T G C A G C |
| |<------->| |<--->| |<--->| |
5' Exon1 Exon2 Exon3 3'

The default segment spans the entire coordinate space
::
The default segment spans the entire coordinate space::

DIS Coordinates: 0 1 2 3 4 5 6 7
DNA Sequence: A T G C A G C
|<--------------------->|
Expand Down Expand Up @@ -111,15 +111,15 @@ and an end index of 7::
A DIS can also represent a segment on the strand opposite the coordinate space.
This is useful for modeling the complementary sequence or a binding partner.

Starting from the coordinate space defined above
::
Starting from the coordinate space defined above::

DIS Coordinates: 0 1 2 3 4 5 6 7
DNA Sequence: | A T G C A G C |
| |<----->| |<->| |<->| |
5' Exon1 Exon2 Exon3 3'

The opposite strand shares the same DIS coordinate indices
::
The opposite strand shares the same DIS coordinate indices::

5' Positive strand 3'
DIS Coordinates: | 0 1 2 3 4 5 6 7 |
DNA Sequence (+): | A T G C A G C |
Expand All @@ -131,8 +131,8 @@ The opposite strand shares the same DIS coordinate indices
The DIS coordinate indices are identical on both strands. To obtain the complement
of a given segment, the same start and end indices apply; only the
``on_coordinate_strand`` flag changes. The following shows the full-length segment
on the opposite strand
::
on the opposite strand::

5' Coordinate Strand 3'
DIS Coordinates: | 0 1 2 3 4 5 6 7 |
DNA Sequence (+): | A T G C A G C |
Expand Down Expand Up @@ -171,23 +171,24 @@ segments, since the start and end indices alone do not encode strand information
>>> dis = DisjointIntervalSequence.from_intervals([iv])
>>> assert dis.end5.start < dis.end3.start

Consider a transcript on the negative strand:
::
Consider a transcript on the negative strand::

3' Exon3 Intron2 Exon2 Intron1 Exon1 5'
| |<------->| |<------->| |<--->| |<----------->| |<--->| |
DNA Sequence (-): | G T C A G T C A G T C A G T |
Genomic Coordinates: 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
Negative Strand
Extracting only the exons:
::

Extracting only the exons::

3' Exon3 Exon2 Exon1 5'
| |<------->| |<--->| |<--->| |
DNA Sequence (-): | G T C C A G T |
Genomic Coordinates: 153 154 155 159 160 165 166 167
Negative Strand

Converting these exons into a DIS coordinate space:
::
Converting these exons into a DIS coordinate space::

DIS Coordinates: 0 1 2 3 4 5 6 7
DNA Sequence: | T G A C C T G |
| |<->| |<->| |<----->| |
Expand All @@ -210,8 +211,8 @@ largest index to the 3' end::
>>> dis_neg.coordinate_length
7

A full-length segment on the coordinate strand
::
A full-length segment on the coordinate strand::

DIS Coordinates: 0 1 2 3 4 5 6 7
DNA Sequence: T G A C C T G
|<--------------------->|
Expand All @@ -231,8 +232,8 @@ objects, strand is expressed only as "same strand" or "opposite strand"::
>>> dis_neg.on_coordinate_strand
True

The same coordinate space with an opposite-strand segment
::
The same coordinate space with an opposite-strand segment::

DIS Coordinates: 0 1 2 3 4 5 6 7
DNA Sequence (-): T G A C C T G
-----------------------------------------------------
Expand Down Expand Up @@ -408,6 +409,7 @@ Strand Methods
A DIS segment can sit on either 'virtual' strand independently of the coordinate
intervals. The ``on_coordinate_strand`` property indicates whether the
segment is on the same strand as the coordinate intervals::

On Coordinate Strand: True
Start Index: 1
End Index: 6
Expand All @@ -419,6 +421,8 @@ segment is on the same strand as the coordinate intervals::
DIS Coordinates: 0 1 2 3 4 5 6 7
Opposite Strand

::
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
::
.. code-block:: python


>>> dis.on_coordinate_strand
True
>>> dis.is_same_strand()
Expand Down Expand Up @@ -549,6 +553,56 @@ Negative values contract the segment::
Contracting to exactly zero length is valid, but contracting past
zero raises ``ValueError``.

expand_coord
~~~~~~~~~~~~

``expand`` and ``expand_coord`` both grow the segment, but they do so at
different layers. ``expand`` stretches the segment *within* the existing
coordinate space; the coord intervals are unchanged.
``expand_coord`` instead extends the coordinate
space itself by lengthening the outermost coord intervals, and grows the
segment by the same amount so it tracks the new edges.

This distinction matters when you want flanking genomic context to become
part of the indexed coordinate space rather than just a virtual extension,
such as when getting the DNA of the DIS segment (off-coordinate bases are
N-padded).

For example, when modeling a transcript plus its surrounding promoter and
poly-A regions, ``expand_coord(50)`` adds 50 genomic bases to each end
of the coordinate space, and those bases are now real, indexable, DNA-
backed positions rather than N-padding::

>>> dis.coord_strand
'+'
>>> dis.coordinate_intervals
(Interval("chr1", "+", 1000, 1100, "hg38"),
Interval("chr1", "+", 1200, 1300, "hg38"),
Interval("chr1", "+", 1500, 1600, "hg38"))
>>> dis.coordinate_length
300
>>> dis.start, dis.end
(0, 300)

>>> expanded = dis.expand_coord(50)
>>> expanded.coordinate_intervals
(Interval("chr1", "+", 950, 1100, "hg38"),
Interval("chr1", "+", 1200, 1300, "hg38"),
Interval("chr1", "+", 1500, 1650, "hg38"))
>>> expanded.coordinate_length
400
>>> expanded.start, expanded.end
(0, 400)

Only the outer 5' edge of the first coord interval and the outer 3' edge
of the last coord interval are extended; gaps between coord intervals are
preserved. On the negative strand, ``upstream`` and ``dnstream`` still
refer to the transcript's 5' and 3' ends, so the underlying genomic
adjustments mirror those on the plus strand.

Unlike ``expand``, ``expand_coord`` does not accept negative values —
shrinking the coordinate space is not supported.

Positional Comparisons
======================

Expand Down Expand Up @@ -656,3 +710,119 @@ space and strand apply::
>>> z = DisjointIntervalSequence(coord_ivs, start=50, end=50)
>>> z.within(a)
True

Mapping Between Genomic and DIS Coordinates
===========================================

A DIS lives in two coordinate systems at once: the genomic coordinates of
its underlying intervals, and the spliced DIS index space. ``lower`` and
``lift_interval`` are the two directions of travel between them.

``lower`` projects the segment back to genomic space. ``lift_interval``
projects a genomic interval into the DIS's index space and clips it
against the segment. They are conceptual inverses, but each has to deal
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are they mechanically inverses as well? I remember the old code merged adjacent intervals. this one returns in sorted order?

with a structural mismatch — gaps in genomic space have no representation
in DIS space, and contiguous DIS indices may correspond to two
non-adjacent genomic regions — so neither is a clean bijection.

lower
~~~~~

Because a segment can straddle one or more boundaries between coord
intervals, ``lower`` returns a *list* of genomic
:py:class:`~genome_kit.Interval` objects rather than a single one. The
list is in 5'→3' order with respect to the segment, regardless of the
underlying coord strand or whether the segment is on the coord strand.

A segment that fits inside a single coord interval lowers to a one-element
list. A segment that crosses N coord-interval boundaries lowers to N+1
intervals — the gaps between coord intervals are skipped::

>>> dis = DisjointIntervalSequence(
... [Interval("chr1", "+", 100, 200, "hg38"),
... Interval("chr1", "+", 300, 400, "hg38")],
... start=50, end=150,
... )
>>> dis.lower()
[Interval("chr1", "+", 150, 200, "hg38"),
Interval("chr1", "+", 300, 350, "hg38")]

When the segment extends past the coord space (``start < 0`` or
``end > coordinate_length``), the out-of-bounds portion is **linearly
extrapolated** from the nearest outer edge of the coord intervals. The
returned intervals can therefore have negative starts or ends past the
chromosome length — there is no clipping to chromosome boundaries.

A segment on the opposite strand lowers to genomic intervals on the
opposite strand, listed in *segment* 5'→3' order (which is the reverse of
coord 5'→3' order)::

>>> dis_opp = dis.as_opposite_strand()
>>> dis_opp.lower()
[Interval("chr1", "-", 300, 350, "hg38"),
Interval("chr1", "-", 150, 200, "hg38")]

lift_interval
~~~~~~~~~~~~~

``lift_interval`` is the inverse: it takes a genomic
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``lift_interval`` is the inverse: it takes a genomic
:py:meth:`~genome_kit.DisjointIntervalSequence.lift_interval` is the inverse: it takes a genomic

and etc below

:py:class:`~genome_kit.Interval` and returns a new DIS whose segment is
the intersection of that interval with this DIS's segment, expressed
in the DIS coordinate space. The input must share the same chromosome, reference
genome, and effective strand as the DIS.

The lift is well-defined even when the input straddles gaps between coord
intervals: positions inside a gap collapse to the boundary index of the
adjacent coord interval, and positions outside the coord space extrapolate
linearly. The result is then clipped against the DIS's segment, so a
genomic interval that overlaps coord space but falls entirely outside the
segment returns ``None``::

>>> dis = DisjointIntervalSequence(
... [Interval("chr1", "+", 100, 200, "hg38"),
... Interval("chr1", "+", 300, 400, "hg38")],
... start=0, end=200,
... )
>>> lifted = dis.lift_interval(Interval("chr1", "+", 320, 360, "hg38"))
>>> lifted.start, lifted.end
(120, 160)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an example to demonstrate the above mentioned scenario of "The lift is well-defined even when the input straddles gaps between coord
intervals...The result is then clipped against the DIS's segment". Like dis.lift_interval(Interval("chr1", "+", 250, 450, "hg38"))? (if I'm understanding this correctly)

>>> # Falls in a coord-interval gap with no overlap of any coord interval
>>> dis.lift_interval(Interval("chr1", "+", 250, 260, "hg38")) is None
True

Extracting DNA
==============

``dna`` returns the spliced DNA sequence corresponding to the segment as
a single string in 5'→3' order. Internally, the segment is decomposed
via ``lower`` into one or more genomic intervals, each is read from the
reference, and the pieces are concatenated. This returns the spliced sequence
of the transcript: the gaps between coord intervals are dropped so that introns
(or other intervening regions) never appear in the output.

The returned string already accounts for strand: if the segment is on the
opposite strand, the segment's bases are returned, and the bases are
ordered 5'→3' along the segment, not along the genome::
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"along the segment, not along the genome" sounds a bit confusing, I think it's sufficient to say "the bases are ordered 5'→3'"


>>> dis = DisjointIntervalSequence.from_transcript(transcript)
>>> dis.dna()
'ACGTGGTTTCA' # full spliced sequence, 5'→3'

>>> opp = dis.as_opposite_strand()
>>> opp.dna() # reverse complement, 5'→3' along the opposite strand
Comment thread
SophiaPerzan-DG marked this conversation as resolved.
'TGAAACCACGT'

When the segment extends past the coord space — e.g. after ``shift`` or
``expand`` pushed the indices outside ``[0, coordinate_length)`` — those
out-of-coord positions are returned as ``N`` by default. This is in
contrast to ``expand_coord``, which extends the coord space itself so
that new positions are backed by real reference DNA. Set
``allow_outside_coord=False`` to opt out of N-padding and raise an error instead.

For convenience, :py:meth:`Genome.dna <genome_kit.Genome.dna>` accepts a
DIS directly and dispatches to ``DisjointIntervalSequence.dna``, so
either of the following is equivalent::

>>> dis.dna()
>>> genome.dna(dis)
1 change: 1 addition & 0 deletions genome_kit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@
"Cds",
"CdsTable",
"DataManager",
"DisjointIntervalSequence",
"Exon",
"ExonTable",
"Gene",
Expand Down
Loading
Loading