Skip to content

Performance issue: identifyOverExpressedGenes extremely slow on Xenium data (~36k cells) #13

@emily-grantham

Description

@emily-grantham

Package: SpatialCellChat (CellChat v3)
Data: 10x Genomics Xenium, ~36,000 cells, ~5,000 genes
Environment: 32 CPUs, 125 GB RAM, 60 GB disk


Description

I am running SpatialCellChat on a Xenium region of interest (~36,000 cells, 5,000 genes) and finding that identifyOverExpressedGenes takes several hours to complete — if it completes at all.

I have parallelized my environment using the future package, but it is unclear whether identifyOverExpressedGenes in SpatialCellChat v3 actually dispatches work through future. In v2, this function does use future-based parallelism, but given the new per-cell resolution infrastructure in v3 and the added MERINGUE/ALRA steps, I'm not sure whether that is still the case.


Steps to reproduce

library(SpatialCellChat)
library(patchwork)
options(stringsAsFactors = FALSE)

# Normalize counts (log1p library-size normalization)
# Skip if your matrix is already normalized
counts_norm <- normalizeData(counts)

# Create SpatialCellChat object directly from matrix — no Seurat needed
chat <- createSpatialCellChat(
  object          = counts_norm,       # genes × cells normalized matrix
  meta            = meta,              # data.frame with cell annotations
  group.by        = annotation_col,   # column name set above
  datatype        = "spatial",
  coordinates     = spatial.locs,
  spatial.factors = spatial.factors
)

CellChatDB <- CellChatDB.human

# Subset to protein-based signaling categories
CellChatDB.use <- subsetDB(
  CellChatDB,
  search      = c("Secreted Signaling", "ECM-Receptor", "Cell-Cell Contact"),
  non_protein = FALSE
)

# Attach database to the CellChat object
chat@DB <- CellChatDB.use

# Subset to signaling genes only (required step)
chat <- subsetData(chat)
chat <- preProcessing(chat)

# Identify over-expressed ligands / receptors.
chat <- identifyOverExpressedGenes(
  chat,
  selection.method = "moransi"  #<-- extremely slow/does not complete with meringue
)

What I have tried

  1. MERINGUE for spatial neighborhood detection — analysis never completed
  2. Moran's I as alternative (spatial.factors parameter) — completed once after several hours, but subsequent steps (computeCommunProb etc.) were also prohibitively slow
  3. Parallelization via future — set using plan("multisession", workers = 30). I see no evidence that the parallelization is being picked up by identifyOverExpressedGenes.

Questions

  1. Does identifyOverExpressedGenes in SpatialCellChat v3 support future-based parallelism? If so, is there anything specific to the spatial/per-cell mode that prevents it from being picked up?
  2. Is sketchData() (available in CellChat v2 for downsampling) compatible with SpatialCellChat v3? This seems like an important workaround for large imaging-based datasets, but it's not mentioned in the v3 tutorial.
  3. For imaging-based data at this scale (tens of thousands of cells from a single Xenium ROI), what is the expected runtime on typical hardware? Is 36k cells within the intended scope of SpatialCellChat v3?
  4. Is there a recommended per-cell type downsampling strategy (e.g., via subset() in Seurat) that is compatible with the spatial coordinate requirements of v3?

Additional context

  • I did not encounter memory errors, suggesting RAM is not the bottleneck; the process appears to be CPU-bound
  • This appears related to the per-cell resolution design of v3 — at 36k cells, the number of cell-cell pairs that need to be evaluated is orders of magnitude larger than in a cluster-level analysis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions