This repository provides a minimal, reviewer-ready pipeline to reproduce:
- Noise removal / batch correction (edgeR TMM; outlier clusters removed)
- LOESS smoothing → one curve per gene (average technical → average biological)
- PCA on cleaned counts (optional FC ≥ 1.5 filter; off by default)
The code is RStudio-friendly and runnable from the command line.
- input/ # <- Input data
- scripts/ # 01_noise_removal.R, 02_loess.R, 03_pca.R
- tables/ # outputs: cleaned tables, LOESS tables, PCA scores
- figures/ # outputs: PCA plot
- legacy/ # archived older scripts (not used by the pipeline)
- R (≥ 4.1 recommended)
- Install these packages once:
1. openxlsx
2. edgeR
3. stringr
4. ggrepel
4. readr
5. tidyverse
In a terminal:
git clone https://github.com/umbibio/Babesia_time_course.git
From the repository root (Babesia_time_course):
Rscript scripts/01_noise_removal.R
-
tables/raw_counts_normal_growth_clean.xlsx — cleaned counts (first col GeneName)
-
tables/experiment_design_clean.xlsx — cleaned design with Batch (clusters 3 & 4 removed)
-
Filters low expression (CPM > 2 in ≥ 3 samples)
-
edgeR TMM normalization → logCPM
-
Hierarchical clustering on samples → cut into 4 clusters → drop clusters 3 & 4
-
Standardizes gene column to GeneName
Rscript scripts/02_loess.R
-
tables/BdAllData.xlsx with columns:
-
Gene, maxTime, minTime, maxValue, minValue, FoldChange
-
BdyValues1..50 (smoothed curve values)
-
BdNormyValues1..50 (same curve, normalized to sum = 100)
-
If some genes had < 3 usable timepoints after cleaning:
-
Averages technical replicates within each biological replicate (BE1+BE2 → BE, CK1+CK2 → CK) per timepoint, using available columns
-
Averages biological replicates (BE vs CK) per timepoint, using available values
-
Fits one curve per gene with adaptive LOESS (more stable with few points);
-
falls back to linear interpolation when needed
-
Evaluates each curve at 50 points; keeps legacy column names for compatibility
-
tables/loess_skipped_genes.tsv (gene ID, reason, times used)
Rscript scripts/03_pca.R
-
tables/pca_scores.tsv — PC1/PC2 for each timepoint (after averaging reps)
-
figures/pca_scores.pdf — publication-ready PCA plot
-
edgeR TMM → logCPM on cleaned counts
-
Averages all replicates at each timepoint (BE/CK; 1/2)
-
Optional amplitude filter across time (set min_fold_change <- 1.5 in the script to enable)
-
PCA on timepoints (rows) × genes (columns)
-
Open the repo in RStudio.
-
Source & run, in order:
-
scripts/01_noise_removal.R
-
scripts/02_loess.R
-
scripts/03_pca.R
-
Inspect outputs under tables/ and figures/.
-
Sample names: BdC9-<BE|CK>- (e.g., BdC9-BE1-6)
-
Time is parsed from the trailing integer (hours).
-
LOESS uses only available timepoints per gene after cleaning.
-
A minimum of 3 usable timepoints is required to fit a curve.
-
Seeds are set where applicable for reproducibility.
This repository is licensed under the MIT License.