Skip to content

Keep allele frequency NaN when VCF lacks allele counts (#407)#1076

Merged
etal merged 1 commit into
masterfrom
fix-407-baf-zero-missing-altcount
May 24, 2026
Merged

Keep allele frequency NaN when VCF lacks allele counts (#407)#1076
etal merged 1 commit into
masterfrom
fix-407-baf-zero-missing-altcount

Conversation

@etal
Copy link
Copy Markdown
Owner

@etal etal commented May 24, 2026

Fixes #407.

Symptom

.cns output has baf of exactly 0 across (nearly) every segment, even though the load log reports many records "kept heterozygous". Reported with VarScan2 and GATK HaplotypeCaller VCFs that carry genotypes but no per-sample allele counts.

Root cause

CNVkit derives a SNP's allele frequency from per-sample allele-count FORMAT fields (AD, AO/RO, DP4, CLCAD2, Strelka tier counts), or INFO AF for unpaired VCFs — not from GT alone. When a record has GT=0/1 (→ heterozygous) but none of those count fields, read_vcf correctly computes alt_freq as NaN… but then this line coerced it to 0.0:

table = table.fillna({col: 0.0 for col in table.columns[6:]})  # includes alt_freq!

The site is still kept as heterozygous (zygosity from GT), so _mirrored_baf turns its fake 0.0 frequency into a mirrored BAF of exactly 0, pinning the whole segment's BAF to 0.

This is a "missing vs. zero" conflation: alt_freq=0.0 means "0% alt reads" (a real hom-ref observation), while NaN means "no read-count evidence". Because BAF aggregation uses np.nanmedian, a true NaN is excluded but a fake 0.0 is averaged in.

Fix

  • skgenome/tabio/vcfio.py: exclude alt_freq/n_alt_freq from the fillna(0.0) so unknown frequencies stay NaN and are skipped by the np.nanmedian BAF aggregation. (Genotype/depth/count columns are still filled with 0 as before.)
  • cnvlib/vary.py: guard baf_by_ranges' per-bin summary so a bin whose het SNPs are all NaN returns NaN directly, instead of emitting numpy's "All-NaN slice" RuntimeWarning.

Clinical-impact note

VCFs that carry allele counts are unaffectedalt_freq is never NaN from missing counts there, so the fillna never touched it, and numerical output is identical. Only the previously-buggy baf=0 sites now correctly become NaN/excluded. Affected segments' baf (and downstream cn1/cn2 in call) change from a spurious 0 to either a real value (from sites that do have counts) or NaN (if none do).

Testing

  • New test_read_vcf_het_no_alt_count (TDD: fails before, passes after) with a GT-only paired fixture test/formats/vcf-het-no-altcount.vcf, asserting alt_freq is NaN and the per-segment BAF is NaN rather than 0.
  • Full unit suite: 313 passed.
  • mypy (vcfio + vary) and ruff clean.

Docs

Adds a "baf is 0 across (nearly) every segment" troubleshooting subsection to doc/baf.rst listing the allele-count fields CNVkit reads and the new graceful-degradation behavior.

🤖 Generated with Claude Code

A heterozygous SNP whose VCF record has a genotype (GT=0/1) but no
allele-count FORMAT field (AD/AO/RO/DP4/CLCAD2/Strelka tiers) and no AF has
an *unknown* alternate-allele frequency. read_vcf computed alt_freq as NaN
correctly, but then fillna(0.0) over columns[6:] coerced it to 0.0. Because
the site is still kept as heterozygous (zygosity from GT), its mirrored BAF
was pinned to exactly 0, producing baf=0 across whole segments -- the
symptom reported in #407 (e.g. VarScan2/GATK VCFs lacking per-sample counts).

Leave alt_freq / n_alt_freq as NaN (exclude them from the 0.0 fill) so the
np.nanmedian-based BAF aggregation skips count-less sites instead of dragging
the result to 0. Guard baf_by_ranges' per-bin summary so an all-NaN bin
returns NaN directly rather than emitting numpy's "All-NaN slice" warning.

VCFs that carry allele counts are unaffected (alt_freq is never NaN there),
so numerical output is unchanged for well-formed inputs; only the buggy
baf=0 sites now correctly become NaN/excluded.

Adds test_read_vcf_het_no_alt_count with a GT-only paired fixture, and a
troubleshooting section in doc/baf.rst.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.80%. Comparing base (12000f7) to head (2af42dd).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
+ Coverage   67.73%   67.80%   +0.06%     
==========================================
  Files          74       74              
  Lines        7674     7684      +10     
  Branches     1363     1365       +2     
==========================================
+ Hits         5198     5210      +12     
+ Misses       2035     2034       -1     
+ Partials      441      440       -1     
Flag Coverage Δ
unittests 67.80% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@etal etal merged commit 2338d14 into master May 24, 2026
12 of 13 checks passed
@etal etal deleted the fix-407-baf-zero-missing-altcount branch May 24, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BAF doesn't seem to be correct

1 participant