Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
f044b14
delete old folders
OlivierCoen Mar 19, 2026
7a739ab
updated all environments (docker + apptainer + conda) of modules usin…
OlivierCoen Mar 19, 2026
f02b605
insist on the experimental state of fetching geo datasets in README.md
OlivierCoen Mar 20, 2026
2adc3b4
sort dataframe for output consistency
OlivierCoen Mar 20, 2026
817f09c
sort list of files given to multiqc
OlivierCoen Mar 20, 2026
7d15085
add wget to geo getdata environment
OlivierCoen Mar 20, 2026
9e67438
fix issue not caught in geo get_data
OlivierCoen Mar 20, 2026
a58b2ef
fix bug introduced in detect_rare_genes.py
OlivierCoen Mar 21, 2026
1ade9db
fix bugs in download_geo_data.R
OlivierCoen Mar 22, 2026
b552b90
prevent modules from publising failure and warning reason files
OlivierCoen Mar 22, 2026
510992d
update tests
OlivierCoen Mar 22, 2026
cbc28b4
pass linter
OlivierCoen Mar 22, 2026
e9fd713
Merge branch 'nf-core:dev' into dev
OlivierCoen Mar 22, 2026
f3c557d
better handle connection issues in expression atlas getdata
OlivierCoen Mar 22, 2026
89fdc04
update broken links in documentation
OlivierCoen Mar 22, 2026
1564c47
update snapshots
OlivierCoen Mar 23, 2026
acbc1b1
update min version of Nextflow
OlivierCoen Mar 23, 2026
68898d7
add contributors
OlivierCoen Mar 23, 2026
32d543d
update min Nextflow version to 25.10.4
OlivierCoen Mar 23, 2026
18d41b8
act possibility to run nf-test through act
OlivierCoen Mar 24, 2026
35a1081
add pipeline test command to README.md
OlivierCoen Mar 25, 2026
960571d
Update default.nf.test.snap
OlivierCoen Mar 25, 2026
5cfd8bc
add act tests
OlivierCoen Mar 29, 2026
b989a35
fix tests
OlivierCoen Mar 29, 2026
95dd20e
authorize failure and ignore error for expression get data and geo de…
OlivierCoen Mar 30, 2026
d94cb49
fix tests
OlivierCoen Apr 2, 2026
7f81d56
improve testing with act and add README.md
OlivierCoen Apr 2, 2026
68627a8
update checkCounts function
OlivierCoen Apr 2, 2026
237ec2c
fix weird insertions in snapshots
OlivierCoen Apr 2, 2026
b79fde2
add skewness.csv to .nftignore
OlivierCoen Apr 2, 2026
12d33f8
update snapshots
OlivierCoen Apr 3, 2026
f7c4a50
fix reproductibility issue
OlivierCoen Apr 4, 2026
abcda5e
fix skewness issue
OlivierCoen Apr 4, 2026
d876bbd
fix issue with act
OlivierCoen Apr 4, 2026
25e7791
remove a pipeline test
OlivierCoen Apr 4, 2026
5919420
Update nf-test.yml
OlivierCoen Apr 4, 2026
772ca82
update snapshots
OlivierCoen Apr 4, 2026
3a3c759
fix prettier error
OlivierCoen Apr 5, 2026
a946490
update multiqc
OlivierCoen Apr 5, 2026
21489b2
adapt pipeline to new multiqc input syntax
OlivierCoen Apr 5, 2026
d315b9e
remove tests involving geo get data
OlivierCoen Apr 6, 2026
0f2b6af
improve testing with act
OlivierCoen Apr 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/actions/nf-test/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ runs:
--changed-since HEAD^ \
--verbose \
--tap=test.tap \
--shard ${{ inputs.shard }}/${{ inputs.total_shards }}
--shard ${{ inputs.shard }}/${{ inputs.total_shards }} --debug

# Save the absolute path of the test.tap file to the output
echo "tap_file_path=$(realpath test.tap)" >> $GITHUB_OUTPUT
Expand Down
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ modules/nf-core/
subworkflows/nf-core/
galaxy/
docs/
tests/act
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,15 +50,15 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- Get NBCI [GEO](https://www.ncbi.nlm.nih.gov/gds) **microarray** dataset accessions corresponding to the provided species (and optionally keywords)
This is optional and **NOT** run by default. Set `--fetch_geo_accessions` to run it.

#### 2. Download data (see [usage](conf/usage.md#3-provide-your-own-accessions))
#### 2. Download data (see [usage](./conf/usage.md#3-provide-your-own-accessions))

- Download [Expression Atlas](https://www.ebi.ac.uk/gxa/home) data if any
- Download NBCI [GEO](https://www.ncbi.nlm.nih.gov/gds) data if any

> [!NOTE]
> At this point, datasets downloaded from public databases are merged with datasets provided by the user using the `--datasets` parameter. See [usage](conf/usage.md#4-use-your-own-expression-datasets) for more information about local datasets.
> At this point, datasets downloaded from public databases are merged with datasets provided by the user using the `--datasets` parameter. See [usage](./conf/usage.md#4-use-your-own-expression-datasets) for more information about local datasets.

#### 3. ID Mapping (see [usage](conf/usage.md#5-custom-gene-id-mapping--metadata))
#### 3. ID Mapping (see [usage](./conf/usage.md#5-custom-gene-id-mapping--metadata))

- Gene IDs are cleaned
- Map gene IDS to NCBI Entrez Gene IDS (or Ensembl IDs) for standardisation among datasets using [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost) (run by default; optional)
Expand Down Expand Up @@ -99,6 +99,14 @@ Base statistics are computed for each gene, platform-wide and for each platform
- Make [`MultiQC`](http://multiqc.info/) report
- Prepare [Dash Plotly](https://dash.plotly.com/) app for further investigation of gene / sample counts

## Test pipeline

You can test the execution of the pipeline locally with:

```bash
nextflow run nf-core/stableexpression -profile test,<docker/apptainer/conda/micromamba/...>
```

## Basic usage

> [!NOTE]
Expand All @@ -125,7 +133,7 @@ please refer to the [usage documentation](https://nf-co.re/stableexpression/usag

## Resource allocation

For setting pipeline CPU / memory usage, see [here](docs/configuration.md).
For setting pipeline CPU / memory usage, see [here](./docs/configuration.md).

## Profiles

Expand All @@ -150,6 +158,8 @@ nf-core/stableexpression was originally written by Olivier Coen.
We thank the following people for their assistance in the development of this pipeline:

- Rémy Costa
- Shaheen Acheche
- Janine Soares

## Contributions and Support

Expand Down
2 changes: 1 addition & 1 deletion assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ run_modules:

disable_version_detection: true

max_table_rows: 5000
max_table_rows: 100000

table_cond_formatting_colours:
- first: "#ffd700"
Expand Down
4 changes: 4 additions & 0 deletions bin/aggregate_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,8 +290,10 @@ def search_target_genes(df: pl.DataFrame, target_genes: list[str]) -> list[dict]
)
unique_gene_ids |= set(original_gene_ids)

# putting all unique gene IDs, gene names and original gene IDs into single list
all_unique_gene_ids = [gene for gene in unique_gene_ids if gene is not None]

# formatting all gene IDs found
formated_gene_ids_df = pl.DataFrame({"gene": all_unique_gene_ids}).with_columns(
pl.col("gene")
.map_batches(
Expand All @@ -301,6 +303,7 @@ def search_target_genes(df: pl.DataFrame, target_genes: list[str]) -> list[dict]
.alias("formatted_gene")
)

# formatting target genes
formated_target_genes_df = pl.DataFrame({"target_gene": target_genes}).with_columns(
pl.col("target_gene")
.map_batches(
Expand All @@ -315,6 +318,7 @@ def search_target_genes(df: pl.DataFrame, target_genes: list[str]) -> list[dict]
formated_target_genes_df, on="formatted_gene", how="inner"
)
.select(["target_gene", "gene"])
.sort("target_gene")
.to_dicts()
)

Expand Down
7 changes: 6 additions & 1 deletion bin/compute_dataset_statistics.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
logger = logging.getLogger(__name__)

KEY_TO_OUTFILE = {"skewness": "skewness.txt"}
FLOAT_PRECISION = 6


#####################################################
Expand All @@ -39,6 +40,10 @@ def compute_dataset_statistics(df: pl.DataFrame) -> dict:
return dict(skewness=list(skewness))


def format_value(value: float) -> str:
return f"{value:.{FLOAT_PRECISION}f}" if value != 0 else "0"


def export_count_data(stats: dict):
"""
Export dataset statistics to CSV files.
Expand All @@ -47,7 +52,7 @@ def export_count_data(stats: dict):
for key, outfile_name in KEY_TO_OUTFILE.items():
logger.info(f"Exporting dataset statistics {key} to: {outfile_name}")
with open(outfile_name, "w") as outfile:
outfile.write(",".join([str(val) for val in stats[key]]))
outfile.write(",".join([format_value(val) for val in stats[key]]))


#####################################################
Expand Down
8 changes: 5 additions & 3 deletions bin/compute_gene_statistics.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,11 @@ def compute_ratios_null_values(
# the samples showing a low gene count will not be taken into account for the zero count penalty
nb_nulls = df.select(pl.exclude(config.GENE_ID_COLNAME).is_null()).sum_horizontal()

if valid_samples:
found_valid_samples = [sample for sample in valid_samples if sample in df.columns]

if found_valid_samples:
nb_nulls_valid_samples = df.select(
pl.col(valid_samples).is_null()
pl.col(found_valid_samples).is_null()
).sum_horizontal()
else:
nb_nulls_valid_samples = nb_nulls
Expand All @@ -143,7 +145,7 @@ def compute_ratios_null_values(
(nb_nulls / nb_samples).alias(
get_colname(config.RATIO_NULLS_COLNAME, platform)
),
(nb_nulls_valid_samples / len(valid_samples)).alias(
(nb_nulls_valid_samples / len(found_valid_samples)).alias(
get_colname(config.RATIO_NULLS_VALID_SAMPLES_COLNAME, platform)
),
)
Expand Down
9 changes: 6 additions & 3 deletions bin/detect_rare_genes.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,10 +106,13 @@ def main():
.unique()
)

# sorting (for output consistency)
df = df.sort(["total_occurrences_quantile", "gene_id"], descending=[True, False])

# writing total occurrences in a csv before filtering
df.select([config.GENE_ID_COLNAME, "total_occurrences_quantile"]).sort(
"total_occurrences_quantile", descending=True
).write_csv(TOTAL_OCCURRENCES_OUTFILE)
df.select([config.GENE_ID_COLNAME, "total_occurrences_quantile"]).write_csv(
TOTAL_OCCURRENCES_OUTFILE
)

# filtering genes
valid_gene_ids = (
Expand Down
4 changes: 4 additions & 0 deletions bin/download_eatlas_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ download_expression_atlas_data_with_retries <- function(accession, max_retries =
warning(w$message)
write("EXPERIMENT NOT FOUND", file = FAILURE_REASON_FILE)
quit(save = "no", status = 0)
} else if (grepl("FTP status was", w$message)) {
warning(w$message)
write("FTP ERROR", file = FAILURE_REASON_FILE)
quit(save = "no", status = 101)
} else {
warning("Unhandled warning: ", w$message)
write("UNKNOWN ERROR", file = FAILURE_REASON_FILE)
Expand Down
45 changes: 35 additions & 10 deletions bin/download_geo_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,26 @@ get_microarray_counts <- function(platform) {
return(counts)
}

parse_first_line <- function(filename, sep){
tryCatch({
counts <- read.table(filename, header = FALSE, sep = sep, row.names = 1, nrows = 1)
return(counts)
}, error = function(e) {
write_warning(paste("ERROR PARSING FIRST LINE IN", filename))
return(NULL)
})
}

download_file <- function(data_url, filename){
tryCatch({
download.file(data_url, filename, method = "wget", quiet = TRUE)
return("SUCCESS")
}, error = function(e) {
write_warning(paste("ERROR WHILE DOWNLOADING:", filename))
return("FAILURE")
})
}


get_raw_counts_from_url <- function(data_url) {

Expand All @@ -399,20 +419,23 @@ get_raw_counts_from_url <- function(data_url) {
}

message(paste("Downloading", filename))
tryCatch({
download.file(data_url, filename, method = "wget", quiet = TRUE)
}, error = function(e) {
write_warning(paste("ERROR WHILE DOWNLOADING:", filename))
return(NULL)
})
download_status <- download_file(data_url, filename)
if (download_status == "FAILURE") {
return(NULL)
}

separator <- NULL
for (sep in c("\t", ",", " ")) {

# parsing the first line to determine the separator and see if there is a header
counts <- read.table(filename, header = FALSE, sep = sep, row.names = 1, nrows = 1)
if (ncol(counts) > 0) {
first_line <- parse_first_line(filename, sep)
if (is.null(first_line)) {
return(NULL)
}

if (ncol(first_line) > 0) {
separator <- sep
if (is.numeric(counts[1, 1])) {
if (is.numeric(first_line[1, 1])) {
has_header <- FALSE
} else {
has_header <- TRUE
Expand All @@ -430,7 +453,7 @@ get_raw_counts_from_url <- function(data_url) {
tryCatch({
counts <- read.table(filename, header = has_header, sep = separator, row.names = 1)
}, error = function(e) {
write_warning(paste("ERROR WHILE PARSING", filename, ":", e))
write_warning(paste("ERROR WHILE PARSING", filename))
return(NULL)
})

Expand Down Expand Up @@ -793,6 +816,8 @@ main <- function() {
write_warning(paste("UNSUPPORTED PLATFORM:", series$experiment_type))
}
}

message("Done")
}


Expand Down
13 changes: 8 additions & 5 deletions bin/get_eatlas_accessions.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,7 @@ def main():
# getting accessions of selected experiments
selected_accessions = [exp_dict["accession"] for exp_dict in results]

sampling_status = "ok"
if args.random_sampling_size and args.random_sampling_seed:
selected_accession_to_nb_samples = [
{
Expand All @@ -469,11 +470,13 @@ def main():
f"Kept {len(selected_accessions)} experiments after random sampling"
)

# writing status to file
# so that the wrapper module can get the status
with open(SAMPLING_QUOTA_OUTFILE, "w") as fout:
sampling_status = "full" if sampling_quota_reached else "ok"
fout.write(sampling_status)
if sampling_quota_reached:
sampling_status = "full"

# writing status to file
# so that the wrapper module can get the status
with open(SAMPLING_QUOTA_OUTFILE, "w") as fout:
fout.write(sampling_status)

# keeping metadata only for selected experiments
selected_experiments = get_metadata_for_selected_experiments(experiments, results)
Expand Down
5 changes: 2 additions & 3 deletions bin/merge_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@

import config
import polars as pl
from tqdm import tqdm

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -41,7 +40,7 @@ def parse_args():

def get_lazyframes(files: list[Path]) -> list[pl.LazyFrame]:
"""Get a list of LazyFrames from a list of files."""
return [pl.scan_parquet(file, low_memory=True) for file in tqdm(files)]
return [pl.scan_parquet(file, low_memory=True) for file in files]


def get_columns(lf: pl.LazyFrame) -> list[str]:
Expand Down Expand Up @@ -99,7 +98,7 @@ def collect_all_gene_ids(lfs: list[pl.LazyFrame]) -> pl.DataFrame:
"""
logger.info("Getting the full list of gene IDs")
gene_id_set = set()
for lf in tqdm(lfs):
for lf in lfs:
lf_gene_ids = lf.select(config.GENE_ID_COLNAME).collect().to_series().to_list()
gene_id_set.update(lf_gene_ids)
return pl.DataFrame({config.GENE_ID_COLNAME: sorted(list(gene_id_set))})
Expand Down
Loading
Loading