Skip to content

kcleal/superintervalsr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SuperIntervals

The R Bioconductor package superintervalsr provides fast interval intersection queries, using a novel superset-lookup approach with batch operations in mind.

Why SuperIntervals?

  • Faster interval intersection queries
  • Low memory overhead
  • Bioconductor friendly: GRanges → IntervalMap → query → back to GRanges

Installation

remotes::install_github("kcleal/superintervalsr")

Core Workflow

1. Create IntervalMap from Your Data

# From vectors (recommended - no build() needed)
imap <- IntervalMap.from_vectors(starts, ends, values)

# From GRanges (Bioconductor integration)
library(GenomicRanges)
gr <- GRanges("chr1", IRanges(starts, ends), gene_id = values)
imap <- IntervalMap(gr, value_column = "gene_id", seqname = "chr1")

# From data.frame
df <- data.frame(start = starts, end = ends, gene = values)
imap <- IntervalMap(df, value_column = "gene")

2. Batch Query Operations (The Fast Way)

# Multiple query regions
query_starts <- c(2000, 7000, 12000, 50000)
query_ends <- c(4000, 9000, 14000, 55000)

# Batch counting - orders of magnitude faster
counts <- count.batch(imap, query_starts, query_ends)
#> [1] 3 2 1 4

# Batch index search - get all overlapping interval indices  
indices_list <- search_idxs.batch(imap, query_starts, query_ends)
#> [[1]]
#> [1] 12 15 23
#> [[2]] 
#> [1] 45 67
#> ... etc

# Use indices to get back to original data
for (i in seq_along(indices_list)) {
  if (length(indices_list[[i]]) > 0) {
    overlapping_features <- values[indices_list[[i]]]
    cat("Query", i, "overlaps:", paste(overlapping_features, collapse = ", "), "\n")
  }
}

3. Single Query Operations (When You Need Just One)

# For occasional single queries
single_count <- count(imap, 4000, 6000)
single_indices <- search_idxs(imap, 4000, 6000)
single_values <- search_values(imap, 4000, 6000)

# Get complete overlap information
overlaps <- search_items(imap, 4000, 6000)
# Returns: list(start = c(...), end = c(...), value = list(...))

Real-World Genomics Example

library(superintervalsr)
library(GenomicRanges)

# Load your gene annotations
genes <- GRanges("chr1", IRanges(c(1000, 5000, 10000), c(4000, 8000, 15000)),
                 gene_id = c("ENSG001", "ENSG002", "ENSG003"),
                 gene_name = c("GENE_A", "GENE_B", "GENE_C"))

# Convert to IntervalMap for fast queries
gene_imap <- IntervalMap(genes, value_column = "gene_id", seqname = "chr1")

# You have thousands of ChIP-seq peaks to annotate
n_peaks <- 5000
peak_starts <- sort(sample(1:20000, n_peaks))
peak_ends <- peak_starts + sample(200:1000, n_peaks, replace = TRUE)

# Batch annotate all peaks (lightning fast!)
peak_gene_counts <- count.batch(gene_imap, peak_starts, peak_ends)
peak_gene_indices <- search_idxs.batch(gene_imap, peak_starts, peak_ends)

# Analyze results
peaks_with_genes <- sum(peak_gene_counts > 0)
cat("Peaks overlapping genes:", peaks_with_genes, "out of", n_peaks, "\n")

# Get detailed annotations for peaks with genes
for (i in which(peak_gene_counts > 0)[1:5]) {  # Show first 5
  overlapping_genes <- genes[peak_gene_indices[[i]]]
  cat("Peak", i, "overlaps:", 
      paste(mcols(overlapping_genes)$gene_name, collapse = ", "), "\n")
}

Performance Comparison

Rscript ./benchmark.R

>>> 4. SUMMARY
----------
Performance vs IRanges (NCLS algorithm):
- Overlap counting: 48.9x faster
- Overlap finding:  10.6x faster

Complete API Reference

Core Data Structure

# Create IntervalMap
IntervalMap()                                     # Empty map
IntervalMap.from_vectors(starts, ends, values)    # From vectors (recommended)
IntervalMap(granges, value_column, seqname)       # From GRanges
IntervalMap(dataframe, start_column, end_column)  # From data.frame

Batch Operations (Use These!)

count.batch(imap, starts, ends)        # Count overlaps for multiple queries
search_idxs.batch(imap, starts, ends)  # Get indices for multiple queries

Single Query Operations

has_overlaps(imap, start, end)    # TRUE/FALSE for any overlap
count(imap, start, end)           # Count overlapping intervals
search_idxs(imap, start, end)     # Get indices of overlapping intervals
search_values(imap, start, end)   # Get values of overlapping intervals
search_keys(imap, start, end)     # Get coordinates of overlapping intervals
search_items(imap, start, end)    # Get complete overlap information (coords + values)
coverage(imap, start, end)        # Coverage statistics

Bioconductor Integration

as_granges(search_items_result, seqname, valuename)  # Convert back to GRanges

Utilities

length(imap)           # Number of intervals
at(imap, index)        # Get interval at specific index
clear(imap)            # Remove all intervals
reserve(imap, n)       # Pre-allocate space for n intervals

Advanced: Manual Interval Addition

# For dynamic construction (less common)
imap <- IntervalMap()
add(imap, 1000, 5000, "interval1")
add(imap, 3000, 8000, "interval2")
build(imap)  # Required before queries when using add()

# Batch operations work the same way
counts <- count.batch(imap, query_starts, query_ends)

Notes

  • End-inclusive intervals: Both start and end positions are included
  • Automatic sorting: Intervals are maintained in sorted order
  • SIMD optimization: Automatic vectorization on modern CPUs
  • Memory efficient: One size_t per interval overhead
  • Thread safe: Read operations can be parallelized
  • Cache friendly: Optimized memory layout for fast iteration

About

Fast interval intersection queries

Topics

Resources

License

Stars

Watchers

Forks

Contributors