Skip to content

Unexpected number of samples in summary output file cohorts.tsv #76

@koszulordie

Description

@koszulordie

When running Intogen pipeline as a workflow in the Seqera launchpad, bbglabirb/ALP > intogen_cll_richter, the output file cohorts.tsv shows an inconsistent number of samples in the column SAMPLES.

I have checked that the parsing step BBGTOOLS:INTOGENPLUS:PARSE:GROUPBY (OpenVariant groupby) works as intended checking the ouput file of this process: /workspace/nobackup/work/intogen/intogen-richter/work/7e/c0c7be82455ae7407208bd61ceaa82/MASSONI.parsed.tsv.gz. When counting the unique samples with pandas, it gives the expected count of samples, 1069.

However, I have also checked that the step BBGTOOLS:INTOGENPLUS:PREPROCESS:VARIANTS_COUNT yields the wrong sample counts, as can be readily seen by executing the corresponding bash script:

variants=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | wc -l)
samples=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | cut -f1 | sort -u | wc -l)
echo "MASSONI	CLL	WGS	$variants	$samples" > MASSONI.counts

which yields the wrong count of 24 samples.

Metadata

Metadata

Labels

bugSomething isn't workinginvalidThis doesn't seem right

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions