When running Intogen pipeline as a workflow in the Seqera launchpad, bbglabirb/ALP > intogen_cll_richter, the output file cohorts.tsv shows an inconsistent number of samples in the column SAMPLES.
I have checked that the parsing step BBGTOOLS:INTOGENPLUS:PARSE:GROUPBY (OpenVariant groupby) works as intended checking the ouput file of this process: /workspace/nobackup/work/intogen/intogen-richter/work/7e/c0c7be82455ae7407208bd61ceaa82/MASSONI.parsed.tsv.gz. When counting the unique samples with pandas, it gives the expected count of samples, 1069.
However, I have also checked that the step BBGTOOLS:INTOGENPLUS:PREPROCESS:VARIANTS_COUNT yields the wrong sample counts, as can be readily seen by executing the corresponding bash script:
variants=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | wc -l)
samples=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | cut -f1 | sort -u | wc -l)
echo "MASSONI CLL WGS $variants $samples" > MASSONI.counts
which yields the wrong count of 24 samples.
When running Intogen pipeline as a workflow in the Seqera launchpad,
bbglabirb/ALP > intogen_cll_richter, the output filecohorts.tsvshows an inconsistent number of samples in the column SAMPLES.I have checked that the parsing step
BBGTOOLS:INTOGENPLUS:PARSE:GROUPBY (OpenVariant groupby)works as intended checking the ouput file of this process:/workspace/nobackup/work/intogen/intogen-richter/work/7e/c0c7be82455ae7407208bd61ceaa82/MASSONI.parsed.tsv.gz. When counting the unique samples with pandas, it gives the expected count of samples, 1069.However, I have also checked that the step
BBGTOOLS:INTOGENPLUS:PREPROCESS:VARIANTS_COUNTyields the wrong sample counts, as can be readily seen by executing the corresponding bash script:which yields the wrong count of 24 samples.