Keep allele frequency NaN when VCF lacks allele counts (#407)#1076
Merged
Conversation
A heterozygous SNP whose VCF record has a genotype (GT=0/1) but no allele-count FORMAT field (AD/AO/RO/DP4/CLCAD2/Strelka tiers) and no AF has an *unknown* alternate-allele frequency. read_vcf computed alt_freq as NaN correctly, but then fillna(0.0) over columns[6:] coerced it to 0.0. Because the site is still kept as heterozygous (zygosity from GT), its mirrored BAF was pinned to exactly 0, producing baf=0 across whole segments -- the symptom reported in #407 (e.g. VarScan2/GATK VCFs lacking per-sample counts). Leave alt_freq / n_alt_freq as NaN (exclude them from the 0.0 fill) so the np.nanmedian-based BAF aggregation skips count-less sites instead of dragging the result to 0. Guard baf_by_ranges' per-bin summary so an all-NaN bin returns NaN directly rather than emitting numpy's "All-NaN slice" warning. VCFs that carry allele counts are unaffected (alt_freq is never NaN there), so numerical output is unchanged for well-formed inputs; only the buggy baf=0 sites now correctly become NaN/excluded. Adds test_read_vcf_het_no_alt_count with a GT-only paired fixture, and a troubleshooting section in doc/baf.rst. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1076 +/- ##
==========================================
+ Coverage 67.73% 67.80% +0.06%
==========================================
Files 74 74
Lines 7674 7684 +10
Branches 1363 1365 +2
==========================================
+ Hits 5198 5210 +12
+ Misses 2035 2034 -1
+ Partials 441 440 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #407.
Symptom
.cnsoutput hasbafof exactly 0 across (nearly) every segment, even though the load log reports many records "kept heterozygous". Reported with VarScan2 and GATK HaplotypeCaller VCFs that carry genotypes but no per-sample allele counts.Root cause
CNVkit derives a SNP's allele frequency from per-sample allele-count FORMAT fields (
AD,AO/RO,DP4,CLCAD2, Strelka tier counts), or INFOAFfor unpaired VCFs — not fromGTalone. When a record hasGT=0/1(→ heterozygous) but none of those count fields,read_vcfcorrectly computesalt_freqasNaN… but then this line coerced it to0.0:The site is still kept as heterozygous (zygosity from
GT), so_mirrored_bafturns its fake0.0frequency into a mirrored BAF of exactly0, pinning the whole segment's BAF to 0.This is a "missing vs. zero" conflation:
alt_freq=0.0means "0% alt reads" (a real hom-ref observation), whileNaNmeans "no read-count evidence". Because BAF aggregation usesnp.nanmedian, a trueNaNis excluded but a fake0.0is averaged in.Fix
skgenome/tabio/vcfio.py: excludealt_freq/n_alt_freqfrom thefillna(0.0)so unknown frequencies stayNaNand are skipped by thenp.nanmedianBAF aggregation. (Genotype/depth/count columns are still filled with 0 as before.)cnvlib/vary.py: guardbaf_by_ranges' per-bin summary so a bin whose het SNPs are allNaNreturnsNaNdirectly, instead of emitting numpy's "All-NaN slice"RuntimeWarning.Clinical-impact note
VCFs that carry allele counts are unaffected —
alt_freqis neverNaNfrom missing counts there, so thefillnanever touched it, and numerical output is identical. Only the previously-buggybaf=0sites now correctly becomeNaN/excluded. Affected segments'baf(and downstreamcn1/cn2incall) change from a spurious0to either a real value (from sites that do have counts) orNaN(if none do).Testing
test_read_vcf_het_no_alt_count(TDD: fails before, passes after) with aGT-only paired fixturetest/formats/vcf-het-no-altcount.vcf, assertingalt_freqisNaNand the per-segment BAF isNaNrather than0.mypy(vcfio + vary) andruffclean.Docs
Adds a "
bafis 0 across (nearly) every segment" troubleshooting subsection todoc/baf.rstlisting the allele-count fields CNVkit reads and the new graceful-degradation behavior.🤖 Generated with Claude Code