-
Notifications
You must be signed in to change notification settings - Fork 9
chore: add lower, lift, intersect, dna, expand_coord DIS methods #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ac64101
4c88f50
9b53737
a305fab
572bc63
4012afc
2e32278
b5d237d
b0785e0
2922725
83c3660
02569ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -51,15 +51,15 @@ A DIS has two aspects: | |||||
| The following examples illustrate how the coordinate space and segment interact, | ||||||
| using both diagrams and code. | ||||||
|
|
||||||
| Consider a transcript on the + strand with the following genomic layout: | ||||||
| :: | ||||||
| Consider a transcript on the + strand with the following genomic layout:: | ||||||
|
|
||||||
| Genomic Coordinates: 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ||||||
| DNA Sequence: | A T G C C G C A T G C C G C | | ||||||
| | |<------->| |<------->| |<--->| |<----------->| |<--->| | | ||||||
| 5' Exon1 Intron1 Exon2 Intron2 Exon3 3' | ||||||
|
|
||||||
| Extracting only the exons yields the following disjoint intervals: | ||||||
| :: | ||||||
| Extracting only the exons yields the following disjoint intervals:: | ||||||
|
|
||||||
| Genomic Coordinates: 153 154 155 159 160 165 166 167 | ||||||
| DNA Sequence: | A T G C A G C | | ||||||
| | |<------->| |<--->| |<--->| | | ||||||
|
|
@@ -74,15 +74,15 @@ These exon intervals can be represented as :py:class:`~genome_kit.Interval` obje | |||||
| >>> exon3 = Interval("chr1", "+", 165, 167, "hg38") | ||||||
|
|
||||||
| To define a segment spanning the full exonic sequence (from the start of Exon1 to | ||||||
| the end of Exon3), the exon intervals are first converted into a DIS coordinate space | ||||||
| :: | ||||||
| the end of Exon3), the exon intervals are first converted into a DIS coordinate space:: | ||||||
|
|
||||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| DNA Sequence: | A T G C A G C | | ||||||
| | |<------->| |<--->| |<--->| | | ||||||
| 5' Exon1 Exon2 Exon3 3' | ||||||
|
|
||||||
| The default segment spans the entire coordinate space | ||||||
| :: | ||||||
| The default segment spans the entire coordinate space:: | ||||||
|
|
||||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| DNA Sequence: A T G C A G C | ||||||
| |<--------------------->| | ||||||
|
|
@@ -111,15 +111,15 @@ and an end index of 7:: | |||||
| A DIS can also represent a segment on the strand opposite the coordinate space. | ||||||
| This is useful for modeling the complementary sequence or a binding partner. | ||||||
|
|
||||||
| Starting from the coordinate space defined above | ||||||
| :: | ||||||
| Starting from the coordinate space defined above:: | ||||||
|
|
||||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| DNA Sequence: | A T G C A G C | | ||||||
| | |<----->| |<->| |<->| | | ||||||
| 5' Exon1 Exon2 Exon3 3' | ||||||
|
|
||||||
| The opposite strand shares the same DIS coordinate indices | ||||||
| :: | ||||||
| The opposite strand shares the same DIS coordinate indices:: | ||||||
|
|
||||||
| 5' Positive strand 3' | ||||||
| DIS Coordinates: | 0 1 2 3 4 5 6 7 | | ||||||
| DNA Sequence (+): | A T G C A G C | | ||||||
|
|
@@ -131,8 +131,8 @@ The opposite strand shares the same DIS coordinate indices | |||||
| The DIS coordinate indices are identical on both strands. To obtain the complement | ||||||
| of a given segment, the same start and end indices apply; only the | ||||||
| ``on_coordinate_strand`` flag changes. The following shows the full-length segment | ||||||
| on the opposite strand | ||||||
| :: | ||||||
| on the opposite strand:: | ||||||
|
|
||||||
| 5' Coordinate Strand 3' | ||||||
| DIS Coordinates: | 0 1 2 3 4 5 6 7 | | ||||||
| DNA Sequence (+): | A T G C A G C | | ||||||
|
|
@@ -171,23 +171,24 @@ segments, since the start and end indices alone do not encode strand information | |||||
| >>> dis = DisjointIntervalSequence.from_intervals([iv]) | ||||||
| >>> assert dis.end5.start < dis.end3.start | ||||||
|
|
||||||
| Consider a transcript on the negative strand: | ||||||
| :: | ||||||
| Consider a transcript on the negative strand:: | ||||||
|
|
||||||
| 3' Exon3 Intron2 Exon2 Intron1 Exon1 5' | ||||||
| | |<------->| |<------->| |<--->| |<----------->| |<--->| | | ||||||
| DNA Sequence (-): | G T C A G T C A G T C A G T | | ||||||
| Genomic Coordinates: 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ||||||
| Negative Strand | ||||||
| Extracting only the exons: | ||||||
| :: | ||||||
|
|
||||||
| Extracting only the exons:: | ||||||
|
|
||||||
| 3' Exon3 Exon2 Exon1 5' | ||||||
| | |<------->| |<--->| |<--->| | | ||||||
| DNA Sequence (-): | G T C C A G T | | ||||||
| Genomic Coordinates: 153 154 155 159 160 165 166 167 | ||||||
| Negative Strand | ||||||
|
|
||||||
| Converting these exons into a DIS coordinate space: | ||||||
| :: | ||||||
| Converting these exons into a DIS coordinate space:: | ||||||
|
|
||||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| DNA Sequence: | T G A C C T G | | ||||||
| | |<->| |<->| |<----->| | | ||||||
|
|
@@ -210,8 +211,8 @@ largest index to the 3' end:: | |||||
| >>> dis_neg.coordinate_length | ||||||
| 7 | ||||||
|
|
||||||
| A full-length segment on the coordinate strand | ||||||
| :: | ||||||
| A full-length segment on the coordinate strand:: | ||||||
|
|
||||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| DNA Sequence: T G A C C T G | ||||||
| |<--------------------->| | ||||||
|
|
@@ -231,8 +232,8 @@ objects, strand is expressed only as "same strand" or "opposite strand":: | |||||
| >>> dis_neg.on_coordinate_strand | ||||||
| True | ||||||
|
|
||||||
| The same coordinate space with an opposite-strand segment | ||||||
| :: | ||||||
| The same coordinate space with an opposite-strand segment:: | ||||||
|
|
||||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| DNA Sequence (-): T G A C C T G | ||||||
| ----------------------------------------------------- | ||||||
|
|
@@ -408,6 +409,7 @@ Strand Methods | |||||
| A DIS segment can sit on either 'virtual' strand independently of the coordinate | ||||||
| intervals. The ``on_coordinate_strand`` property indicates whether the | ||||||
| segment is on the same strand as the coordinate intervals:: | ||||||
|
|
||||||
| On Coordinate Strand: True | ||||||
| Start Index: 1 | ||||||
| End Index: 6 | ||||||
|
|
@@ -419,6 +421,8 @@ segment is on the same strand as the coordinate intervals:: | |||||
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||||||
| Opposite Strand | ||||||
|
|
||||||
| :: | ||||||
|
|
||||||
| >>> dis.on_coordinate_strand | ||||||
| True | ||||||
| >>> dis.is_same_strand() | ||||||
|
|
@@ -549,6 +553,56 @@ Negative values contract the segment:: | |||||
| Contracting to exactly zero length is valid, but contracting past | ||||||
| zero raises ``ValueError``. | ||||||
|
|
||||||
| expand_coord | ||||||
| ~~~~~~~~~~~~ | ||||||
|
|
||||||
| ``expand`` and ``expand_coord`` both grow the segment, but they do so at | ||||||
| different layers. ``expand`` stretches the segment *within* the existing | ||||||
| coordinate space; the coord intervals are unchanged. | ||||||
| ``expand_coord`` instead extends the coordinate | ||||||
| space itself by lengthening the outermost coord intervals, and grows the | ||||||
| segment by the same amount so it tracks the new edges. | ||||||
|
|
||||||
| This distinction matters when you want flanking genomic context to become | ||||||
| part of the indexed coordinate space rather than just a virtual extension, | ||||||
| such as when getting the DNA of the DIS segment (off-coordinate bases are | ||||||
| N-padded). | ||||||
|
|
||||||
| For example, when modeling a transcript plus its surrounding promoter and | ||||||
| poly-A regions, ``expand_coord(50)`` adds 50 genomic bases to each end | ||||||
| of the coordinate space, and those bases are now real, indexable, DNA- | ||||||
| backed positions rather than N-padding:: | ||||||
|
|
||||||
| >>> dis.coord_strand | ||||||
| '+' | ||||||
| >>> dis.coordinate_intervals | ||||||
| (Interval("chr1", "+", 1000, 1100, "hg38"), | ||||||
| Interval("chr1", "+", 1200, 1300, "hg38"), | ||||||
| Interval("chr1", "+", 1500, 1600, "hg38")) | ||||||
| >>> dis.coordinate_length | ||||||
| 300 | ||||||
| >>> dis.start, dis.end | ||||||
| (0, 300) | ||||||
|
|
||||||
| >>> expanded = dis.expand_coord(50) | ||||||
| >>> expanded.coordinate_intervals | ||||||
| (Interval("chr1", "+", 950, 1100, "hg38"), | ||||||
| Interval("chr1", "+", 1200, 1300, "hg38"), | ||||||
| Interval("chr1", "+", 1500, 1650, "hg38")) | ||||||
| >>> expanded.coordinate_length | ||||||
| 400 | ||||||
| >>> expanded.start, expanded.end | ||||||
| (0, 400) | ||||||
|
|
||||||
| Only the outer 5' edge of the first coord interval and the outer 3' edge | ||||||
| of the last coord interval are extended; gaps between coord intervals are | ||||||
| preserved. On the negative strand, ``upstream`` and ``dnstream`` still | ||||||
| refer to the transcript's 5' and 3' ends, so the underlying genomic | ||||||
| adjustments mirror those on the plus strand. | ||||||
|
|
||||||
| Unlike ``expand``, ``expand_coord`` does not accept negative values — | ||||||
| shrinking the coordinate space is not supported. | ||||||
|
|
||||||
| Positional Comparisons | ||||||
| ====================== | ||||||
|
|
||||||
|
|
@@ -656,3 +710,119 @@ space and strand apply:: | |||||
| >>> z = DisjointIntervalSequence(coord_ivs, start=50, end=50) | ||||||
| >>> z.within(a) | ||||||
| True | ||||||
|
|
||||||
| Mapping Between Genomic and DIS Coordinates | ||||||
| =========================================== | ||||||
|
|
||||||
| A DIS lives in two coordinate systems at once: the genomic coordinates of | ||||||
| its underlying intervals, and the spliced DIS index space. ``lower`` and | ||||||
| ``lift_interval`` are the two directions of travel between them. | ||||||
|
|
||||||
| ``lower`` projects the segment back to genomic space. ``lift_interval`` | ||||||
| projects a genomic interval into the DIS's index space and clips it | ||||||
| against the segment. They are conceptual inverses, but each has to deal | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are they mechanically inverses as well? I remember the old code merged adjacent intervals. this one returns in sorted order? |
||||||
| with a structural mismatch — gaps in genomic space have no representation | ||||||
| in DIS space, and contiguous DIS indices may correspond to two | ||||||
| non-adjacent genomic regions — so neither is a clean bijection. | ||||||
|
|
||||||
| lower | ||||||
| ~~~~~ | ||||||
|
|
||||||
| Because a segment can straddle one or more boundaries between coord | ||||||
| intervals, ``lower`` returns a *list* of genomic | ||||||
| :py:class:`~genome_kit.Interval` objects rather than a single one. The | ||||||
| list is in 5'→3' order with respect to the segment, regardless of the | ||||||
| underlying coord strand or whether the segment is on the coord strand. | ||||||
|
|
||||||
| A segment that fits inside a single coord interval lowers to a one-element | ||||||
| list. A segment that crosses N coord-interval boundaries lowers to N+1 | ||||||
| intervals — the gaps between coord intervals are skipped:: | ||||||
|
|
||||||
| >>> dis = DisjointIntervalSequence( | ||||||
| ... [Interval("chr1", "+", 100, 200, "hg38"), | ||||||
| ... Interval("chr1", "+", 300, 400, "hg38")], | ||||||
| ... start=50, end=150, | ||||||
| ... ) | ||||||
| >>> dis.lower() | ||||||
| [Interval("chr1", "+", 150, 200, "hg38"), | ||||||
| Interval("chr1", "+", 300, 350, "hg38")] | ||||||
|
|
||||||
| When the segment extends past the coord space (``start < 0`` or | ||||||
| ``end > coordinate_length``), the out-of-bounds portion is **linearly | ||||||
| extrapolated** from the nearest outer edge of the coord intervals. The | ||||||
| returned intervals can therefore have negative starts or ends past the | ||||||
| chromosome length — there is no clipping to chromosome boundaries. | ||||||
|
|
||||||
| A segment on the opposite strand lowers to genomic intervals on the | ||||||
| opposite strand, listed in *segment* 5'→3' order (which is the reverse of | ||||||
| coord 5'→3' order):: | ||||||
|
|
||||||
| >>> dis_opp = dis.as_opposite_strand() | ||||||
| >>> dis_opp.lower() | ||||||
| [Interval("chr1", "-", 300, 350, "hg38"), | ||||||
| Interval("chr1", "-", 150, 200, "hg38")] | ||||||
|
|
||||||
| lift_interval | ||||||
| ~~~~~~~~~~~~~ | ||||||
|
|
||||||
| ``lift_interval`` is the inverse: it takes a genomic | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
and etc below |
||||||
| :py:class:`~genome_kit.Interval` and returns a new DIS whose segment is | ||||||
| the intersection of that interval with this DIS's segment, expressed | ||||||
| in the DIS coordinate space. The input must share the same chromosome, reference | ||||||
| genome, and effective strand as the DIS. | ||||||
|
|
||||||
| The lift is well-defined even when the input straddles gaps between coord | ||||||
| intervals: positions inside a gap collapse to the boundary index of the | ||||||
| adjacent coord interval, and positions outside the coord space extrapolate | ||||||
| linearly. The result is then clipped against the DIS's segment, so a | ||||||
| genomic interval that overlaps coord space but falls entirely outside the | ||||||
| segment returns ``None``:: | ||||||
|
|
||||||
| >>> dis = DisjointIntervalSequence( | ||||||
| ... [Interval("chr1", "+", 100, 200, "hg38"), | ||||||
| ... Interval("chr1", "+", 300, 400, "hg38")], | ||||||
| ... start=0, end=200, | ||||||
| ... ) | ||||||
| >>> lifted = dis.lift_interval(Interval("chr1", "+", 320, 360, "hg38")) | ||||||
| >>> lifted.start, lifted.end | ||||||
| (120, 160) | ||||||
|
|
||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can add an example to demonstrate the above mentioned scenario of "The lift is well-defined even when the input straddles gaps between coord |
||||||
| >>> # Falls in a coord-interval gap with no overlap of any coord interval | ||||||
| >>> dis.lift_interval(Interval("chr1", "+", 250, 260, "hg38")) is None | ||||||
| True | ||||||
|
|
||||||
| Extracting DNA | ||||||
| ============== | ||||||
|
|
||||||
| ``dna`` returns the spliced DNA sequence corresponding to the segment as | ||||||
| a single string in 5'→3' order. Internally, the segment is decomposed | ||||||
| via ``lower`` into one or more genomic intervals, each is read from the | ||||||
| reference, and the pieces are concatenated. This returns the spliced sequence | ||||||
| of the transcript: the gaps between coord intervals are dropped so that introns | ||||||
| (or other intervening regions) never appear in the output. | ||||||
|
|
||||||
| The returned string already accounts for strand: if the segment is on the | ||||||
| opposite strand, the segment's bases are returned, and the bases are | ||||||
| ordered 5'→3' along the segment, not along the genome:: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "along the segment, not along the genome" sounds a bit confusing, I think it's sufficient to say "the bases are ordered 5'→3'" |
||||||
|
|
||||||
| >>> dis = DisjointIntervalSequence.from_transcript(transcript) | ||||||
| >>> dis.dna() | ||||||
| 'ACGTGGTTTCA' # full spliced sequence, 5'→3' | ||||||
|
|
||||||
| >>> opp = dis.as_opposite_strand() | ||||||
| >>> opp.dna() # reverse complement, 5'→3' along the opposite strand | ||||||
|
SophiaPerzan-DG marked this conversation as resolved.
|
||||||
| 'TGAAACCACGT' | ||||||
|
|
||||||
| When the segment extends past the coord space — e.g. after ``shift`` or | ||||||
| ``expand`` pushed the indices outside ``[0, coordinate_length)`` — those | ||||||
| out-of-coord positions are returned as ``N`` by default. This is in | ||||||
| contrast to ``expand_coord``, which extends the coord space itself so | ||||||
| that new positions are backed by real reference DNA. Set | ||||||
| ``allow_outside_coord=False`` to opt out of N-padding and raise an error instead. | ||||||
|
|
||||||
| For convenience, :py:meth:`Genome.dna <genome_kit.Genome.dna>` accepts a | ||||||
| DIS directly and dispatches to ``DisjointIntervalSequence.dna``, so | ||||||
| either of the following is equivalent:: | ||||||
|
|
||||||
| >>> dis.dna() | ||||||
| >>> genome.dna(dis) | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -71,6 +71,7 @@ | |
| "Cds", | ||
| "CdsTable", | ||
| "DataManager", | ||
| "DisjointIntervalSequence", | ||
| "Exon", | ||
| "ExonTable", | ||
| "Gene", | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.