Unexpected number of samples in summary output file cohorts.tsv

When running Intogen pipeline as a workflow in the Seqera launchpad, `bbglabirb/ALP > intogen_cll_richter`, the output file `cohorts.tsv` shows an inconsistent number of samples in the column SAMPLES.

I have checked that the parsing step `BBGTOOLS:INTOGENPLUS:PARSE:GROUPBY (OpenVariant groupby)` works as intended checking the ouput file of this process: `/workspace/nobackup/work/intogen/intogen-richter/work/7e/c0c7be82455ae7407208bd61ceaa82/MASSONI.parsed.tsv.gz`. When counting the unique samples with pandas, it gives the expected count of samples, 1069.

However, I have also checked that the step `BBGTOOLS:INTOGENPLUS:PREPROCESS:VARIANTS_COUNT` yields the wrong sample counts, as can be readily seen by executing the corresponding bash script:

```
variants=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | wc -l)
samples=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | cut -f1 | sort -u | wc -l)
echo "MASSONI	CLL	WGS	$variants	$samples" > MASSONI.counts
```
which yields the wrong count of 24 samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected number of samples in summary output file cohorts.tsv #76

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unexpected number of samples in summary output file cohorts.tsv #76

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions