Skip to content

MarcusFFFFFF/dna-null-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA Null Framework

DOI

A per-sequence empirical null-distribution framework for detecting composition-independent long-range correlations in DNA sequences. Combines six parallel correlation methods (DFA, power spectrum, wavelet multifractal, recurrence quantification, excess entropy, reverse-complement symmetry) with dual null calibration: standard mononucleotide shuffle and Altschul–Erickson dinucleotide-preserving Euler shuffle.

Status: Research-grade, work in progress. Methods validated on π-derived null sequences and classical references (intergenic chromatin, telomeric repeats). Pilot applications to neurodevelopmental disorder gene panels are exploratory.

Preprint: Forthcoming on bioRxiv — DOI will appear here when posted.


What this does

Many DNA sequence statistics — Hurst exponents, 1/f^β spectra, multifractal widths, k-mer asymmetries — can flag "long-range correlations" that on closer inspection are driven by local nucleotide composition (GC bias, CpG depletion, codon usage). This framework separates the two:

  • Mono null shuffles preserve only single-base composition. Signals that survive are independent of base frequencies.
  • Di null uses an Altschul–Erickson Euler shuffle to preserve all 16 dinucleotide frequencies exactly. Signals that survive are independent of 2-mer composition, including CpG depletion.

For each input sequence, the framework computes Z-scores against N=100 shuffled controls from both null modes. Saturated values (|Z|≥20) are flagged separately, since with N=100 the empirical standard deviation is too noisy to make finer distinctions.

Validation

The framework is validated on a sequence with known statistical properties: the first 10⁶ digits of π, with digits {0,1,2,3} mapped to {A,C,G,T}. Under both null modes all twelve metrics return |Z|<3 on this sequence, confirming the method correctly identifies a numerically generated null as null.

The classical Peng et al. (1992) long-range correlations in intergenic chromatin are reproduced cleanly under both null modes, anchoring the framework against established findings.

Quick start

Requires Python 3.9+, numpy, mpmath, matplotlib.

git clone https://github.com/MarcusFFFFFF/dna-null-framework.git
cd dna-null-framework
pip install -r requirements.txt

# Download reference sequences from NCBI RefSeq
bash scripts/download_sequences.sh

# Run analysis with both null modes
python3 src/dna_unified_v11.py 100 --null-mode mono
python3 src/dna_unified_v11.py 100 --null-mode di

# Diagnostic pass: repeat-screening, FDR, power analysis
python3 src/dna_unified_v12.py

# Generate publication figures
python3 scripts/generate_figures_v2.py

Outputs are written to ~/dna_analysis/ by default. See docs/methodology.md for details.

What's in the repository

src/
  dna_unified_v8.py    Core analysis engine (six methods)
  dna_unified_v11.py   Empirical null wrapper, dual-mode
  dna_unified_v12.py   Diagnostics: repeats, FDR, power, length
scripts/
  download_sequences.sh   NCBI RefSeq fetch
  generate_figures_v2.py     Publication figures
data/
  accessions.txt          NCBI accession list (sequences fetched at runtime)
results/                  JSON outputs from runs
figures/                  PNG figures
docs/
  methodology.md          Detailed methods
  preprint_outline.md     Manuscript skeleton

Honest limitations

  • Pilot panel sizes (n=6 per group) yield only 40–50% power to detect medium effect sizes. Findings are suggestive, not confirmatory.
  • TBP in the housekeeping panel contains a polyQ (CAG/CAA) repeat that confounds RQA-based comparisons. Repeat-screening (dna_unified_v12.py) identifies this and similar issues.
  • Length-matching across functional groups is imperfect. Several metrics have known length-dependent finite-size biases.
  • mRNA-only analysis omits intronic and regulatory regions where structural signals may concentrate.

These limitations are flagged in the preprint Discussion. They are not failures of the framework — they are properties of any pilot study.

How to cite

If you use this framework, please cite the archived release via Zenodo:

Frenell, M. (2026). DNA Null Framework (v0.1.0). Zenodo.
https://doi.org/10.5281/zenodo.20283245

BibTeX:

@software{frenell_dna_null_2026,
  author    = {Frenell, Marcus},
  title     = {{DNA Null Framework}},
  month     = may,
  year      = 2026,
  publisher = {Zenodo},
  version   = {v0.1.0},
  doi       = {10.5281/zenodo.20283245},
  url       = {https://doi.org/10.5281/zenodo.20283245}
}

A preprint citation will be added once the bioRxiv version is posted.

License

MIT License — see LICENSE file. You are free to use, modify, and distribute, including for commercial purposes. Attribution requested via citation, not legally required.

Contributing

Issues and pull requests welcome. The framework is research-grade and there are many opportunities to extend:

  • Higher-order shuffle (Markov-order-k preserving) for trinucleotide and higher null models
  • Length-matched control generation
  • Larger reference panels for power
  • Integration with established bioinformatics packages
  • Performance optimization for genome-scale analysis

If you would like to collaborate on a specific biological application, open an issue describing the target system and we can discuss.

Contact

Marcus Frenell — marcusfrenell@gmail.com

Independent research, Stockholm, Sweden.

About

Per-sequence empirical null framework for DNA correlation analysis with mono- and dinucleotide shuffle calibration. Includes π-validation, six parallel correlation methods, and repeat-screening diagnostics.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors