The web contains vast amounts of educational content, but its quality varies enormously. The FineWeb-Edu dataset assigns quality scores to nearly 10 billion tokens of web text, making it one of the largest educational content corpora available. This project trains a distributed classifier to predict those quality scores — effectively automating the judgment of whether a web document is educationally valuable.
A reliable educational quality predictor has broad real-world impact: it can drive smarter dataset curation for training large language models, power content recommendation systems for students, and help researchers filter high-quality sources at scale.
Why this requires big data and distributed computing: The FineWeb-Edu Sample-10BT subset alone contains 9.67 million documents across 14 Parquet files. Loading, filtering, featurizing (Word2Vec over millions of documents), and training Random Forest models on this data is impractical on a single machine — it would take hours just to read the data, let alone train. Apache Spark on SDSC Expanse allows all of these steps to run in parallel across 32 cores, reducing training time from hours to seconds. Without Spark/Ray, the full pipeline would be computationally infeasible.
Notebook: notebooks/01-exploration.ipynb
We use the FineWeb-Edu Sample-10BT dataset from HuggingFace.
The dataset contains 9,672,101 educational web documents stored in 14 Parquet files.
Data is processed using Apache Spark on SDSC Expanse due to its distributed format and scale.
All work is performed on the SDSC Expanse high-performance computing cluster.
Hardware Configuration:
- CPU Cores: 8
- Memory: 128 GB per node
spark = SparkSession.builder \
.config("spark.driver.memory", "2g") \
.config("spark.executor.memory", "15g") \
.config("spark.executor.instances", 7) \
.getOrCreate()Executor instances = Total Cores - 1 = 8 - 1 = 7
Executor memory = (Total Memory - Driver Memory) / Executor Instances
= (128GB - 2GB) / 7 ≈ 18GB
Executor memory was set to 15GB to allow memory overhead for Spark and JVM processes and to prevent out-of-memory errors.
Below is the DataFrame output showing multiple executors active during data loading:
Notebook: notebooks/01-exploration.ipynb
The dataset contains 9,672,101 documents in the FineWeb-Edu Sample-10BT subset.
| Column | Type | Description | Scale / Distribution |
|---|---|---|---|
id |
string | Document ID | Unique identifier |
text |
string | Document content | Variable length |
url |
string | Source URL | Diverse domains |
language |
string | Language | Predominantly English |
language_score |
float | Language confidence | 0–1 (mostly >0.95) |
token_count |
integer | Token count | 1–170K, median ~629 |
score |
float | Quality score | 2.5–5.34, median ~2.9 |
int_score |
integer | Binned quality score | Buckets 3–5 (mostly 3) |
- Categorical variables:
language,url - Continuous variables:
language_score,token_count,score - Identifier column:
id - Target variable (planned):
score/int_score(educational quality)
- No null values detected in core columns
- No duplicate records based on
id - Dataset is clean and suitable for modeling
- Token count is highly right-skewed with median ~629 and mean ~1031 tokens, extending up to 170K tokens. This indicates high variability in document length.
- Quality score is moderately right-skewed, ranging from 2.5 to 5.34 with median ~2.9 and mean ~3.0. Most documents fall in quality bucket 3, with fewer in bucket 5.
- Language distribution is strongly dominated by English content.
- Domain distribution shows exceptional diversity with over 2 million unique sources. Top domains (Wikipedia, Britannica) each represent <1% of the total dataset.
The following Spark DataFrame operations were used:
df.count()— Total observation countdf.printSchema()— Schema inspectiondf.describe().show()— Summary statisticsdf.groupBy().agg()— Aggregations by quality bucketdf.select().distinct().count()— Unique domain countsdf.dropDuplicates()— Duplicate detection
The majority of documents fall into quality bucket 3 (86.7%), followed by bucket 4 (13.2%), with very few in bucket 5 (0.08%). This confirms a severe class imbalance that must be considered during model training.
Unique Domains: 2,088,546
The dataset exhibits exceptional domain diversity with over 2 million unique sources. Top domains include educational websites (Wikipedia, Britannica), science news (Phys.org, ScienceDaily), and academic resources. Even the top domain (Wikipedia) represents <1% of total documents, confirming no single source dominates the dataset.
Note: While the full dataset contains 9.6M documents, the visualizations below utilize a random 2% sample (~193K documents) to maintain performance while accurately representing the data distribution.
The token count distribution is heavily right-skewed. The median document length is ~629 tokens, while the mean (~1031 tokens) is larger due to a long tail extending up to 170K tokens. This indicates high variability in document length and suggests potential filtering or normalization during preprocessing.
The quality score ranges from approximately 2.5 to 5.34, with a median of ~2.9 and mean ~3.0. The distribution is moderately right-skewed, with most documents concentrated in the lower-to-mid score range (int_score = 3). Very high-quality documents (score bucket 5) are relatively rare.
Notebook: notebooks/02-preprocessing.ipynb
Preprocessing was performed on SDSC Expanse with the following Spark configuration:
spark = SparkSession.builder \
.appName("FineWeb_Preprocessing_Pipeline") \
.config("spark.driver.memory", "2g") \
.config("spark.executor.instances", "31") \
.config("spark.executor.cores", "1") \
.config("spark.executor.memory", "3g") \
.getOrCreate()- Total cores: 32 (1 driver + 31 executors)
- Total memory: ~128 GB
- Executor memory justification:
(128g − 2g) / (31 × 1.1 overhead) ≈ 3g— actual usage ~104 GB, safely within 128 GB
Documents were filtered using Spark's df.filter() with the following criteria:
- Text is not null and longer than 200 characters
token_count≥ 50language== "en"language_score≥ 0.9
Duplicates were checked by comparing total count vs. distinct id count. No duplicates were found.
To address class imbalance, stratified sampling was applied using sampleBy:
| int_score | Sample Fraction |
|---|---|
| 3 | 5% |
| 4 | 30% |
| 5 | 100% |
Two new columns were added using Spark SQL functions:
domain— extracted from the URL usingregexp_extractlength_bucket— categorizes documents bytoken_count:- Short: < 500 tokens
- Medium: 500–2000 tokens
- Long: > 2000 tokens
The target variable was cast to integer:
df = df.withColumn("label", F.col("int_score").cast(IntegerType()))The preprocessed DataFrame was saved to data/interim/fineweb_preprocessed.parquet.
Notebook: notebooks/03-model-1.ipynb
Training was performed on SDSC Expanse using Ray + RayDP:
spark = raydp.init_spark(
app_name="FineWeb_Spark_on_Ray",
num_executors=31,
executor_cores=1,
executor_memory="4GB",
)A 20% random sample (~142,415 rows) of the preprocessed data was used, split 60 / 20 / 20 into train, validation, and test sets.
| Stage | Description |
|---|---|
RegexTokenizer |
Splits text into lowercase tokens on non-word characters |
StopWordsRemover |
Removes common English stop words |
Word2Vec |
10-dimensional embeddings; minCount=500 |
Imputer |
Fills missing token_count with the column mean |
VectorAssembler |
Wraps token_count into a vector |
StandardScaler |
Standardizes token_count to zero mean and unit variance |
VectorAssembler |
Combines text embeddings + scaled token_count into features |
Two RandomForestClassifier models were compared:
# Model 1
rf1 = RandomForestClassifier(numTrees=5, maxDepth=2)
# Model 2
rf2 = RandomForestClassifier(numTrees=20, maxDepth=8)| Model | Train | Val | Test | Train–Test Gap |
|---|---|---|---|---|
| RF (numTrees=5, maxDepth=2) | 0.6149 | 0.6101 | 0.6047 | 0.0102 |
| RF (numTrees=20, maxDepth=8) | 0.6980 | 0.6807 | 0.6837 | 0.0143 |
Best model: RandomForestClassifier (numTrees=20, maxDepth=8) — saved to models/.
- Model 1 sits toward underfitting — the 1.0 pp train–test gap shows minimal overfitting, but
maxDepth=2limits the model to simple decision boundaries. - Model 2 shows mild overfitting (1.4 pp gap) but delivers +7.9 pp test accuracy, sitting in a better position on the bias–variance tradeoff.
- Word2Vec embeddings + token count carry real signal, but richer features (TF-IDF, sentence count, punctuation density) could further improve performance.
- Distributed computing was essential: Model 1 trained in ~4 s and Model 2 in ~71 s across 31 Ray executors — infeasible on a single machine at this data scale.
Notebook: notebooks/04-model-2-pca.ipynb
Training was performed on SDSC Expanse using Ray + RayDP:
spark = raydp.init_spark(
app_name="FineWeb_Spark_on_Ray",
num_executors=31,
executor_cores=1,
executor_memory="4GB",
)A 20% random sample (~142,623 rows) of the preprocessed data was used, split 60 / 20 / 20 into train, validation, and test sets.
| Stage | Description |
|---|---|
RegexTokenizer |
Splits text into lowercase tokens on non-word characters |
StopWordsRemover |
Removes common English stop words |
Word2Vec |
10-dimensional embeddings; minCount=500 |
Imputer |
Fills missing token_count with the column mean |
VectorAssembler |
Wraps token_count into a vector |
StandardScaler |
Standardizes token_count to zero mean and unit variance |
VectorAssembler |
Combines text embeddings + scaled token_count into features |
Two RandomForestClassifier models were compared:
# Model 1
rf1 = RandomForestClassifier(numTrees=20, maxDepth=6)
# Model 2
rf2 = RandomForestClassifier(numTrees=80, maxDepth=10)| Model | Train | Val | Test | Train–Test Gap |
|---|---|---|---|---|
| RF (numTrees=20, maxDepth=6) | 0.6381 | 0.6355 | 0.6322 | 0.0059 |
| RF (numTrees=80, maxDepth=10) | 0.6698 | 0.6439 | 0.6434 | 0.0264 |
Best model: RandomForestClassifier (numTrees=80, maxDepth=10) — saved to models/.
- Model 1 sits near mild underfitting (small gap, limited capacity).
- Model 2 shows mild overfitting (2.64 pp gap) but is well-controlled and delivers the best accuracy.
- Reducing 11 → 5 dimensions costs ~4.0 pp accuracy but cuts training time from ~13 min to under 1 min.
- Distributed computing was essential: Model 1 trained in ~5.5 s and Model 2 in ~58.6 s across 31 Ray executors — infeasible on a single machine at this data scale.
Model 1 — No PCA (03-model-1.ipynb)
Trained on the full 11-D feature vector. Best configuration: RF (numTrees=20, maxDepth=8) → 68.37% test accuracy, 1.02 pp train–test gap. Good generalisation; the pipeline signal (Word2Vec + token count) is confirmed.
Model 2 — PCA k=5 (this notebook)
Trained on 5 PCA components. Best configuration: RF (numTrees=80, maxDepth=10) → 64.34% test accuracy, 2.64 pp gap. PCA cuts training time dramatically but loses ~4.0 pp of accuracy due to aggressive compression.
Key finding: PCA (k=5) is too aggressive here — the lost variance carries real discriminative signal for the Random Forest.
What can be done to improve it?
- Use a larger
k(8–10) or inspect the explained-variance curve to pick the elbow - Enrich the input features before PCA (TF-IDF, stylistic features) to make compression less lossy
- Try
GBTClassifierwhich may extract more signal from the reduced representation - Include
label=5and train a proper 3-class model - Scale to a larger data fraction to reduce variance across all metrics
Michael: Team Leader / Project Manager / Coder and Writer: Contribution
Led the project by organizing tasks, coordinating the team, and ensuring deadlines were met. Contributed to both the coding and the project write-up, supporting the overall implementation and documentation. Regularly asked the team for feedback on the documentation to improve clarity and quality.
Sopan: Coder and Writer: Contribution
Contributed to both the coding and the project write-up. Assisted with implementation, debugging, and documenting the project. Collaborated with the team to support progress and ensure completion of assigned tasks.
Justin: Coder and Writer: Contribution
Contributed to both the coding and write-up of the project. Supported the team leader by helping implement key components, debugging issues, and keeping progress on track. Actively checked in with teammates and asked if they needed help, providing support where needed. Assisted in writing and refining the final report and collaborated closely with the team.
git clone https://github.com/mn-cs/fineweb-spark
cd fineweb-spark
- Collaboration Guide - Team workflow and git instructions







