Evolutionary Conservation of SINE and LINE-1 Depletion at NDD Gene Promoters Across Primates and Mammals
Can Sevilmiş | Department of Molecular Biology and Genetics, Bahçeşehir University, Istanbul, Turkey
Repository: https://github.com/Bilmem2/ndd-te-conservation
Related work: Sevilmiş, C. (2026). LINE-1 Depletion at Promoters of Neurodevelopmental Disorder Genes: A Genome-Wide Analysis. Preprints. https://doi.org/10.20944/preprints202604.0715.v1
This repository contains all analysis scripts, processed gene lists, statistical results, and figures for the above manuscript. Raw genome assemblies, RepeatMasker annotations, GTF files, and the hg38 reference FASTA are not included due to size constraints; download instructions are provided below.
.
├── scripts/
│ ├── 01_prepare_gene_lists.py # Curate NDD, BroadNDD, and Housekeeping gene sets
│ ├── 02_rmsk_to_bed.sh # Extract Alu/LINE-1 BED files from RepeatMasker
│ ├── 03_get_promoters.sh # Extract TSS ± 2 kb promoter windows from GTF
│ ├── 04_split_promoters.sh # Split promoter BEDs by gene category
│ ├── 05_intersect.sh # BEDTools intersect: TE count per promoter
│ ├── 06_statistics.py # Mann-Whitney U, effect sizes (preliminary)
│ ├── 07_pli_correlation.py # pLI vs Alu frequency correlation (exploratory)
│ ├── 08_encode_overlap_v2.py # ENCODE CTCF/DNase overlap analysis (exploratory)
│ ├── 09_window_sensitivity.py # Window size sensitivity analysis (± 0.5–3 kb)
│ ├── 10_cross_disease.py # Cross-disease specificity analysis (ClinVar)
│ ├── 11_stats_updated.py # Final statistics across all 7 species
│ ├── 12_figures_final.py # Generate all manuscript figures (Fig 1–5)
│ └── 13_ortholog_analysis.py # Ortholog-validated replication (Ensembl BioMart)
│
├── data/
│ ├── gene_lists/
│ │ ├── HighConfNDD_genes.txt # SFARI Tier 1+2 ∪ ClinGen Epilepsy Definitive/Strong (n=1020)
│ │ ├── Housekeeping_genes.txt # HRT Atlas ∩ brain TPM≥1, NDD-free (n=1679)
│ │ ├── Cardiovascular_genes.txt # ClinVar P/LP cardiovascular genes
│ │ └── Mendelian_genes.txt # ClinVar P/LP broad Mendelian genes
│ │
│ └── orthologs/
│ ├── ensembl_to_symbol.tsv # Ensembl ID ↔ HGNC symbol mapping
│ └── symbol_to_ensembl.csv # HGNC symbol ↔ Ensembl ID mapping
│
├── results/
│ ├── statistics_final.csv # Main results: all species, symbol-based matching
│ ├── statistics_ortholog.csv # Validation: Ensembl BioMart 1:1 ortholog-based matching
│ ├── null_model_new.csv # Permutation test results (n=10,000)
│ ├── encode_overlap_v2.csv # ENCODE overlap results (exploratory)
│ ├── hg38/ # Human
│ │ ├── HighConfNDD_Alu.bed
│ │ ├── Housekeeping_Alu.bed
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── ponAbe3/ # Orangutan
│ │ ├── HighConfNDD_Alu.bed
│ │ ├── Housekeeping_Alu.bed
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── nomLeu3/ # Gibbon
│ │ ├── HighConfNDD_Alu.bed
│ │ ├── Housekeeping_Alu.bed
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── rheMac10/ # Macaque
│ │ ├── HighConfNDD_Alu.bed
│ │ ├── Housekeeping_Alu.bed
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── calJac4/ # Marmoset
│ │ ├── HighConfNDD_Alu.bed
│ │ ├── Housekeeping_Alu.bed
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── mm10/ # Mouse (B1/B2 SINE analogs of primate Alu + LINE-1)
│ │ ├── HighConfNDD_B1B2.bed
│ │ ├── Housekeeping_B1B2.bed
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── canFam4/ # Dog (LINE-1 only; no Alu in dog)
│ │ ├── HighConfNDD_LINE1.bed
│ │ └── Housekeeping_LINE1.bed
│ ├── sensitivity/
│ │ ├── HighConfNDD_w500.bed
│ │ ├── HighConfNDD_w1000.bed
│ │ ├── HighConfNDD_w2000.bed
│ │ ├── HighConfNDD_w3000.bed
│ │ ├── Housekeeping_w500.bed
│ │ ├── Housekeeping_w1000.bed
│ │ ├── Housekeeping_w2000.bed
│ │ ├── Housekeeping_w3000.bed
│ │ └── window_sensitivity.csv
│ └── cross_disease/
│ ├── Cardiovascular_promoters.bed
│ ├── Mendelian_promoters.bed
│ └── cross_disease_results.csv
│
└── figures/
├── Fig1_Alu_Primates.pdf/.png # Alu depletion across 5 primates
├── Fig2_LINE1_Mammals.pdf/.png # LINE-1 depletion across 7 mammals
├── Fig3_Heatmap.pdf/.png # Significance heatmap (species × TE class)
├── Fig4_NullModel.pdf/.png # Permutation null model validation
└── Fig5_CpG.pdf/.png # CpG island confounder analysis
Raw data files must be downloaded separately. The following sources were used:
| Species | Assembly | Source |
|---|---|---|
| Human | hg38, GENCODE v47 | https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/ |
| Orangutan | ponAbe3, Ensembl 112 | https://ftp.ensembl.org/pub/release-112/gtf/pongo_abelii/ |
| Gibbon | nomLeu3, Ensembl 112 | https://ftp.ensembl.org/pub/release-112/gtf/nomascus_leucogenys/ |
| Macaque | rheMac10, Ensembl 112 | https://ftp.ensembl.org/pub/release-112/gtf/macaca_mulatta/ |
| Marmoset | calJac4, Ensembl 112 | https://ftp.ensembl.org/pub/release-112/gtf/callithrix_jacchus/ |
| Mouse | mm10, GENCODE vM25 | https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/ |
| Dog | canFam6, Ensembl 112 | https://ftp.ensembl.org/pub/release-112/gtf/canis_lupus_familiaris/ |
Retrieved from the UCSC Genome Browser for all seven species:
https://hgdownload.soe.ucsc.edu/goldenPath/{assembly}/database/rmsk.txt.gz
Replace {assembly} with: hg38, ponAbe3, nomLeu3, rheMac10, calJac4, mm10, canFam6
Note on the dog assembly folder. The
data/canFam4/andresults/canFam4/directories carry this name for historical reasons. The dog data they contain is in fact ROS_Cfam_1.0 (UCSC canFam6): the GTF is from Ensembl release 112 and the RepeatMasker track from UCSC canFam6. The two are coordinate-compatible (they differ only by thechrprefix, which the promoter script adds), so the folder name is cosmetic only and is deliberately kept as-is to preserve every path reference in the scripts and committed results.
The 1:1 ortholog tables in data/orthologs/ — ensembl_to_symbol.tsv,
symbol_to_ensembl.csv, and the per-species <assembly>_raw.tsv files consumed by
13_ortholog_analysis.py — were generated from
Ensembl BioMart (release 112) and are committed directly to the repository. There is
no automated fetch script on purpose: the BioMart service changes between releases, so
shipping the exact tables used here is what makes the ortholog validation reproducible.
| Dataset | Source |
|---|---|
| SFARI Gene 2.0 (Tier 1+2) | https://sfari.org/resource/sfari-gene |
| ClinGen Epilepsy GCEP | https://clinicalgenome.org |
| Human Phenotype Ontology | https://hpo.jax.org |
| HRT Atlas v1.0 | https://www.housekeeping.unicamp.br |
| GTEx brain expression | https://gtexportal.org |
| gnomAD v4.1 constraint | https://gnomad.broadinstitute.org |
| ClinVar variant summary | https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ |
| hg38 CpG islands | https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/cpgIslandExt.txt.gz |
| hg38 reference genome | https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz |
Python dependencies are pinned in environment.yml (conda environment name bio_master):
conda env create -f environment.yml
conda activate bio_masterThis provides Python 3.10 with pandas, numpy, scipy, matplotlib, seaborn, tqdm, and requests.
System dependency (install separately):
- BEDTools ≥ 2.31 must be on your
PATH— used by scripts 05 and 08–10, 12. Install via your package manager orconda install -c bioconda bedtools.
RepeatMasker itself is not required. The pipeline parses pre-computed UCSC RepeatMasker tracks (
rmsk.txt.gz); it never runs RepeatMasker locally.
All scripts resolve their paths relative to the repository root (via BASH_SOURCE
in the shell scripts and Path(__file__) in the Python scripts), so the pipeline runs
from a fresh clone on any machine — no path edits required.
# 1. Clone
git clone https://github.com/Bilmem2/ndd-te-conservation.git
cd ndd-te-conservation
# 2. Create and activate the conda environment
conda env create -f environment.yml
conda activate bio_master
# 3. Download raw genomes, GTF, and RepeatMasker tracks (not in the repo)
bash scripts/00_download_data.sh
# 4. Run the analysis in the order below.The committed results/, data/gene_lists/, and data/orthologs/ mean you can
regenerate all statistics and figures (steps 6–10 below) without the multi-GB
downloads. 00_download_data.sh and steps 2–5 are only needed to rebuild the
intermediate promoter/TE BED files from scratch.
# 1. Prepare gene lists
# (SFARI, ClinGen, HPO, HRT Atlas, GTEx) placed in data/sources/. The outputs
# are already committed under data/gene_lists/, so this step can be skipped.
python scripts/01_prepare_gene_lists.py
# 2. Extract Alu, LINE-1, and mouse B1/B2 BED files from RepeatMasker annotations
bash scripts/02_rmsk_to_bed.sh
# 3. Extract TSS ± 2 kb promoter windows from GTF files
bash scripts/03_get_promoters.sh
# 4. Split promoter BEDs by gene category (HighConfNDD, BroadNDD, Housekeeping)
bash scripts/04_split_promoters.sh
# 5. Count TE overlaps per promoter window (BEDTools intersect)
bash scripts/05_intersect.sh
# 6. Window sensitivity analysis (± 0.5, 1, 2, 3 kb)
python scripts/09_window_sensitivity.py
# 7. Cross-disease specificity analysis (ClinVar)
python scripts/10_cross_disease.py
# 8. Final statistics across all 7 species (including mouse B1/B2)
python scripts/11_stats_updated.py
# 9. Generate manuscript figures (Fig 1–6)
python scripts/12_figures_final.py
# 10. Ortholog-validated replication (Ensembl BioMart 1:1)
python scripts/13_ortholog_analysis.pyScripts 06–08 (07_pli_correlation.py, 08_encode_overlap_v2.py, and the preliminary 06_statistics.py) contain exploratory analyses not included in the final manuscript but retained for completeness.