chore: add lower, lift, intersect, dna, expand_coord DIS methods#229
chore: add lower, lift, intersect, dna, expand_coord DIS methods#229SophiaPerzan-DG wants to merge 12 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends DisjointIntervalSequence (DIS) with additional coordinate/interval manipulation and mapping utilities (including genomic ↔ DIS coordinate conversion and DNA extraction), updates documentation accordingly, and adds tests for the new behaviors. It also removes the prior DIS “index direction” toggle and introduces a Genome.dna(...) convenience path intended to accept a DIS directly.
Changes:
- Added DIS methods:
expand_coord,_lower_coord,lower,genomic_span,_lift_position,lift_interval,intersect, anddna; removed the global index-direction toggle. - Updated
Genome.dnabehavior to dispatch when passed aDisjointIntervalSequence. - Added/expanded unit tests and documentation for the new DIS coordinate mapping and DNA behaviors.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_genome.py | Adds coverage for Genome.dna() when called with an Interval vs a DIS. |
| tests/test_diseq.py | Adds extensive tests for DIS coordinate expansion, lowering/lifting, intersection, span, and DNA. |
| genome_kit/genome.py | Changes Genome.dna implementation to dispatch for DIS inputs. |
| genome_kit/diseq.py | Implements new DIS methods (expand/lower/lift/intersect/dna) and removes index-direction configuration. |
| genome_kit/init.py | Exposes DisjointIntervalSequence in __all__. |
| docs-src/diseq.rst | Documents expand_coord, coordinate mapping (lower/lift_interval), and DIS DNA behavior (including Genome.dna(dis) convenience). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This reverts commit 572bc63.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
docs-src/diseq.rst:819
- The docs claim
Genome.dnaaccepts a DIS directly (genome.dna(dis)), butGenomeDNA.__call__only acceptsInterval-typed arguments viaPyAsInterval(seesrc/py_interval.h). Unless there’s an unshown overload, this example will raiseTypeError. Either remove this claim/example or implement DIS dispatch in the DNA API.
For convenience, :py:meth:`Genome.dna <genome_kit.Genome.dna>` accepts a
DIS directly and dispatches to ``DisjointIntervalSequence.dna``, so
either of the following is equivalent::
>>> dis.dna()
>>> genome.dna(dis)
| 5' -> 3' when ``coord`` falls exactly on an internal boundary between two | ||
| adjacent coord intervals — where ``iv_upstream_coord`` is the upstream-most | ||
| coordinate boundary, and ``iv_downstream_coord`` is the downstream-most | ||
| coordinate boundary (regardless of strand). Touching intervals produce |
There was a problem hiding this comment.
I forgot that we don't normalize user input when they create a DIS with 2 adjacent intervals. Shouldn't GK normalize to a single interval internally?
@AliceDG any reason to conserve user input in this scenario?
There was a problem hiding this comment.
Adjacent here means "the next coordinate interval in the list of coordinate intervals" not that they are adjacent in genomic space. A 2-element list would be returned even if there was a gap between two coord intervals (in genomic space)
There was a problem hiding this comment.
Yeah, my concern is not so relevant to this PR but more general. I was referring to "touching" rather than "adjacent".
There was a problem hiding this comment.
Are we talking about the case when user initialize a DIS coord with e.g. (10, 20), (20, 30)? And whether on the DIS coord this need to be treated as:
- coord aligned to genomic intervals
(10, 20), (20, 30) - normalized, so coord aligned to one genomic interval
(10, 30)(which also ensures uniqueness of representation)
There was a problem hiding this comment.
Basically, yes. Currently the implementation would align the DIS to (10, 20), (20, 30), but this should be identical to (10, 30). I can't think of a good reason not to merge adjacent coord Intervals.
| ``dnstream`` bases. The segment is expanded an equal amount in the | ||
| upstream/downstream direction as the coordinate intervals. |
There was a problem hiding this comment.
What happens when the segment isn't adjacent to the coord system edge?
There was a problem hiding this comment.
it still get's expanded. So if the segment is inset 10bp on both 5' and 3' ends and then you expand the coordinate, it will remain so after the expansion. This also means if you have a 0-length interval and expand the coordinate space, the segment will no longer be 0-length after expansion.
There was a problem hiding this comment.
@AliceDG is this the desired behavior?
(this comment is not blocking)
There was a problem hiding this comment.
I don't think this is the desired behaviour, unless start, end represents the interval spanning the full-length coord.
One use case where we need to expand the coord is when we're constructing a DIS coord from an annotated transcript with e.g. inaccurate 3' annotation. In this case we'll create the DIS coord from the transcript first, and then expand it by say (0, 500). Note that I have not introduced any intervals on DIS coord at this point.
Under the proposed API, I imagine user code would look like this for the above use case:
tx = genome.transcripts["foo"]
dis_old_coord = DisjointIntervalSequence.from_transcript(tx) # important: this will default start-end to the full length
dis_new_coord = dis_old.expand_coord(0, 500) # now start-end also got expanded
# now I'm ready to define my interval: (10, 20) on the coord space
# approach 1: making use of `coord_end5, coord_end3`
dis_1 = dis_new_coord.coord_end5.expand(-10, 20)
# approach 2: alternatively, making use of `end5, end3`
dis_2= dis_new_coord.end5.expand(-10, 20)
- In the above use case, it's conceptually easier to think in terms of
coord_end5, coord_end3(and ignoreend5, end3), because we think about manipulating the coordinate when callingexpand_coord. - I don't really see a use case where:
start, endis not full-length & user needs to expandstart, endaccording to the coord expansion.
There was a problem hiding this comment.
@AliceDG How about expand_coord() then only expands start and end when the DIS has a full-length interval, and otherwise leaves the segment as-is?
Also, another thought related to the examples you gave, how would you feel about a function cut() that would let you create/set the DIS segment based on absolute positions within the DIS coord-space? So your example instead could be:
tx = genome.transcripts["foo"]
dis_old_coord = DisjointIntervalSequence.from_transcript(tx)
dis_new_coord = dis_old.expand_coord(0, 500)
dis_1 = dis_new_coord.cut(10, 20) # Equivalent to dis_new_coord.coord_end5.expand(-10, 20)
| DIS Coordinates: 0 1 2 3 4 5 6 7 | ||
| Opposite Strand | ||
|
|
||
| :: |
There was a problem hiding this comment.
| :: | |
| .. code-block:: python |
|
|
||
| ``lower`` projects the segment back to genomic space. ``lift_interval`` | ||
| projects a genomic interval into the DIS's index space and clips it | ||
| against the segment. They are conceptual inverses, but each has to deal |
There was a problem hiding this comment.
are they mechanically inverses as well? I remember the old code merged adjacent intervals. this one returns in sorted order?
| lift_interval | ||
| ~~~~~~~~~~~~~ | ||
|
|
||
| ``lift_interval`` is the inverse: it takes a genomic |
There was a problem hiding this comment.
| ``lift_interval`` is the inverse: it takes a genomic | |
| :py:meth:`~genome_kit.DisjointIntervalSequence.lift_interval` is the inverse: it takes a genomic |
and etc below
|
|
||
| Parameters | ||
| ---------- | ||
| coord : :py:class:`int` |
There was a problem hiding this comment.
| coord : :py:class:`int` | |
| coord |
sphinx can parse out the type annotations, no need to repeat (and possibly have a conflict)
| Args: | ||
| upstream: Bases to add at the coordinate space's 5' end. | ||
| dnstream: Bases to add at the coordinate space's 3' end. | ||
| Defaults to upstream (symmetric). | ||
|
|
||
| Raises: | ||
| ValueError: If either argument is negative. |
There was a problem hiding this comment.
Mixing google style docstrings with numpy ones
| iv0, ivn = coord_ivs[0], coord_ivs[-1] | ||
| chrom = iv0.chromosome | ||
| refg = iv0.reference_genome | ||
| if self.coord_strand == "+": | ||
| if len(coord_ivs) == 1: | ||
| coord_ivs = [Interval(chrom, "+", iv0.start - upstream, iv0.end + dnstream, refg)] | ||
| else: | ||
| coord_ivs[0] = Interval(chrom, "+", iv0.start - upstream, iv0.end, refg) | ||
| coord_ivs[-1] = Interval(chrom, "+", ivn.start, ivn.end + dnstream, refg) | ||
| else: | ||
| if len(coord_ivs) == 1: | ||
| coord_ivs = [Interval(chrom, "-", iv0.start - dnstream, iv0.end + upstream, refg)] | ||
| else: | ||
| coord_ivs[0] = Interval(chrom, "-", iv0.start, iv0.end + upstream, refg) | ||
| coord_ivs[-1] = Interval(chrom, "-", ivn.start - dnstream, ivn.end, refg) |
There was a problem hiding this comment.
is it possible to just use the iv0/ivn.expand function here. Less math (which is the point of gk)
| on_coordinate_strand: bool | ||
|
|
||
|
|
||
| class DisjointIntervalSequence: |
There was a problem hiding this comment.
is a DIS supposed to handle the same protocol as a regular interval? EG, you don't need to if/else on the type. If so, I think maybe there's some missing methods like spanning/midpoint/distance etc
There was a problem hiding this comment.
The current understanding is no, DIS and Interval are separate and not interchangeable, however they do have significant overlap. As of right now, methods like spanning/distance are poorly defined for a DIS (is the returned value on the genomic coord, some combined shared coord, coord-space of just 1 of the DISes?), and other methods like midpoint are lower priority.
| >>> lifted = dis.lift_interval(Interval("chr1", "+", 320, 360, "hg38")) | ||
| >>> lifted.start, lifted.end | ||
| (120, 160) | ||
|
|
There was a problem hiding this comment.
Maybe we can add an example to demonstrate the above mentioned scenario of "The lift is well-defined even when the input straddles gaps between coord
intervals...The result is then clipped against the DIS's segment". Like dis.lift_interval(Interval("chr1", "+", 250, 450, "hg38"))? (if I'm understanding this correctly)
|
|
||
| The returned string already accounts for strand: if the segment is on the | ||
| opposite strand, the segment's bases are returned, and the bases are | ||
| ordered 5'→3' along the segment, not along the genome:: |
There was a problem hiding this comment.
"along the segment, not along the genome" sounds a bit confusing, I think it's sufficient to say "the bases are ordered 5'→3'"
| if on_plus: | ||
| return [upstream.end, iv.start] # indices ordered 5' -> 3' | ||
| else: | ||
| return [upstream.start, iv.end] # indices ordered 5' -> 3' |
There was a problem hiding this comment.
I think it's misleading to say "indices ordered 5' -> 3'" since we're returning primitive integers. Without an associated strand, an integer is a direction-less object. Maybe we can make it clear by saying it's ordered 5'->3' w.r.t. the DIS coord strand?
| # On an internal boundary _lower_coord returns two flanking values; | ||
| # collapse to a single 0-length Interval corresponding to the value of | ||
| # the index that is an Interval.start (we do this because Interval.end | ||
| # is exclusive). |
There was a problem hiding this comment.
Maybe worth adding as a note/caution to the docstr? i.e. depend on the DIS coord strand, this 0-length interval / gap will be either lowered to the upstream or downstream interval boundary.
| start_positions = self._lower_coord(start) | ||
| end_positions = self._lower_coord(end) | ||
| start_pos = start_positions[-1] # Since sorted 5' -> 3', -1 is downstream | ||
| end_pos = end_positions[0] # 0 is the upstream value |
There was a problem hiding this comment.
For this chunk of code (L750 - L785), I wonder if it's more readable if we construct the full length interval Interval(chrom, coord_strand, start_pos, end_pos, refg) and call Interval.intersect between that and all the DIS intervals?
| ------- | ||
| :py:class:`~genome_kit.Interval` | ||
| """ | ||
| ivs = self.lower() |
There was a problem hiding this comment.
nit: lower() does more than finding the two outer-most boundaries (needed by this method). I think this can be made more efficient by only performing the necessary steps (i.e. first half of lower()).
There was a problem hiding this comment.
Can you explain what "necessary/needed" means for this method? I believe genomic_span() may be the function you are describing. If I understand correctly, a 'naive lower' implementation could include introns when a DIS is lowered.
| if pos > ivs[-1].end: | ||
| return coord_len + (pos - ivs[-1].end) | ||
| cumulative = 0 | ||
| for iv in ivs: |
There was a problem hiding this comment.
nit: (just some thoughts, feel free to ignore) for readability, I wonder if it's cleaner to replace this type of for-loops with bisect, since the intervals are always sorted.
There was a problem hiding this comment.
nit: (just some thoughts, feel free to ignore) on the same topic of readability, I wonder if there's a cleaner way to handle the if-else conditions on strand. One trick we used before is to negate integer indices on - strand, which essentially makes operations strand-agnostic, since all indices are naturally ordered 5' -> 3' from -inf to +inf. Of course we need to be careful with boundaries and 0-index-point.
| return cumulative | ||
| cumulative += len(iv) | ||
| assert False, "Position not found in any interval" | ||
| # minus |
There was a problem hiding this comment.
nit: better to add an else here for consistency
There was a problem hiding this comment.
I actually prefer this style of early return, I find it easier to reason about
| f"Interval reference_genome {other.reference_genome!r} does not " | ||
| f"match DIS reference_genome {self.reference_genome!r}" | ||
| ) | ||
| if other.strand != self.strand: |
There was a problem hiding this comment.
I wonder if we need to support this case. If I have a oligo defined on genomic coordinate, and I want to lift its to DIS coord, by design oligo.strand != dis.strand. And I don't think as a user I want to flip the strand, lift, then flip the strand again. What do you think?
| cumulative += len(iv) | ||
| assert False, "Position not found in any interval" | ||
|
|
||
| def lift_interval( |
There was a problem hiding this comment.
My understanding is that this method finds the maximal-compatible interval on DIS by intersecting with DIS coord. I wonder if it's better to use a different method name, especially in our old API lift_interval raises an error if it's not compatible (since it doesn't do the intersection).
also: