FineWeb-Edu Large-Scale Analysis with Apache Spark

Introduction

The web contains vast amounts of educational content, but its quality varies enormously. The FineWeb-Edu dataset assigns quality scores to nearly 10 billion tokens of web text, making it one of the largest educational content corpora available. This project trains a distributed classifier to predict those quality scores — effectively automating the judgment of whether a web document is educationally valuable.

A reliable educational quality predictor has broad real-world impact: it can drive smarter dataset curation for training large language models, power content recommendation systems for students, and help researchers filter high-quality sources at scale.

Why this requires big data and distributed computing: The FineWeb-Edu Sample-10BT subset alone contains 9.67 million documents across 14 Parquet files. Loading, filtering, featurizing (Word2Vec over millions of documents), and training Random Forest models on this data is impractical on a single machine — it would take hours just to read the data, let alone train. Apache Spark on SDSC Expanse allows all of these steps to run in parallel across 32 cores, reducing training time from hours to seconds. Without Spark/Ray, the full pipeline would be computationally infeasible.

Notebook: notebooks/01-exploration.ipynb

Dataset

We use the FineWeb-Edu Sample-10BT dataset from HuggingFace.

The dataset contains 9,672,101 educational web documents stored in 14 Parquet files.
Data is processed using Apache Spark on SDSC Expanse due to its distributed format and scale.

SDSC Expanse Setup

All work is performed on the SDSC Expanse high-performance computing cluster.

Hardware Configuration:

CPU Cores: 8
Memory: 128 GB per node

SparkSession Configuration

spark = SparkSession.builder \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "15g") \
    .config("spark.executor.instances", 7) \
    .getOrCreate()

Justification

Executor instances = Total Cores - 1 = 8 - 1 = 7

Executor memory = (Total Memory - Driver Memory) / Executor Instances
= (128GB - 2GB) / 7 ≈ 18GB

Executor memory was set to 15GB to allow memory overhead for Spark and JVM processes and to prevent out-of-memory errors.

Spark UI Screenshot

Below is the DataFrame output showing multiple executors active during data loading:

Data Exploration

Notebook: notebooks/01-exploration.ipynb

Number of Observations

The dataset contains 9,672,101 documents in the FineWeb-Edu Sample-10BT subset.

Column Descriptions

Column	Type	Description	Scale / Distribution
`id`	string	Document ID	Unique identifier
`text`	string	Document content	Variable length
`url`	string	Source URL	Diverse domains
`language`	string	Language	Predominantly English
`language_score`	float	Language confidence	0–1 (mostly >0.95)
`token_count`	integer	Token count	1–170K, median ~629
`score`	float	Quality score	2.5–5.34, median ~2.9
`int_score`	integer	Binned quality score	Buckets 3–5 (mostly 3)

Variable Types

Categorical variables: language, url
Continuous variables: language_score, token_count, score
Identifier column: id
Target variable (planned): score / int_score (educational quality)

Missing and Duplicate Values

No null values detected in core columns
No duplicate records based on id
Dataset is clean and suitable for modeling

Distribution Insights

Token count is highly right-skewed with median ~629 and mean ~1031 tokens, extending up to 170K tokens. This indicates high variability in document length.
Quality score is moderately right-skewed, ranging from 2.5 to 5.34 with median ~2.9 and mean ~3.0. Most documents fall in quality bucket 3, with fewer in bucket 5.
Language distribution is strongly dominated by English content.
Domain distribution shows exceptional diversity with over 2 million unique sources. Top domains (Wikipedia, Britannica) each represent <1% of the total dataset.

Spark Operations Used

The following Spark DataFrame operations were used:

df.count() — Total observation count
df.printSchema() — Schema inspection
df.describe().show() — Summary statistics
df.groupBy().agg() — Aggregations by quality bucket
df.select().distinct().count() — Unique domain counts
df.dropDuplicates() — Duplicate detection

Data Visualizations

1. Documents per Quality Bucket (Bar Chart)

The majority of documents fall into quality bucket 3 (86.7%), followed by bucket 4 (13.2%), with very few in bucket 5 (0.08%). This confirms a severe class imbalance that must be considered during model training.

2. Top Domains (Bar Chart)

Unique Domains: 2,088,546

The dataset exhibits exceptional domain diversity with over 2 million unique sources. Top domains include educational websites (Wikipedia, Britannica), science news (Phys.org, ScienceDaily), and academic resources. Even the top domain (Wikipedia) represents <1% of total documents, confirming no single source dominates the dataset.

3. Token Count Distribution (Histogram)

Note: While the full dataset contains 9.6M documents, the visualizations below utilize a random 2% sample (~193K documents) to maintain performance while accurately representing the data distribution.

The token count distribution is heavily right-skewed. The median document length is ~629 tokens, while the mean (~1031 tokens) is larger due to a long tail extending up to 170K tokens. This indicates high variability in document length and suggests potential filtering or normalization during preprocessing.

4. Quality Score Distribution (Histogram)

The quality score ranges from approximately 2.5 to 5.34, with a median of ~2.9 and mean ~3.0. The distribution is moderately right-skewed, with most documents concentrated in the lower-to-mid score range (int_score = 3). Very high-quality documents (score bucket 5) are relatively rare.

Preprocessing

Notebook: notebooks/02-preprocessing.ipynb

Distributed Computing Setup

Preprocessing was performed on SDSC Expanse with the following Spark configuration:

spark = SparkSession.builder \
    .appName("FineWeb_Preprocessing_Pipeline") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.instances", "31") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.memory", "3g") \
    .getOrCreate()

Total cores: 32 (1 driver + 31 executors)
Total memory: ~128 GB
Executor memory justification: (128g − 2g) / (31 × 1.1 overhead) ≈ 3g — actual usage ~104 GB, safely within 128 GB

Spark Executors

Preprocessing Pipeline

1. Filtering Invalid Documents

Documents were filtered using Spark's df.filter() with the following criteria:

Text is not null and longer than 200 characters
token_count ≥ 50
language == "en"
language_score ≥ 0.9

2. Duplicate Check

Duplicates were checked by comparing total count vs. distinct id count. No duplicates were found.

3. Stratified Sampling

To address class imbalance, stratified sampling was applied using sampleBy:

int_score	Sample Fraction
3	5%
4	30%
5	100%

4. Feature Engineering

Two new columns were added using Spark SQL functions:

domain — extracted from the URL using regexp_extract
length_bucket — categorizes documents by token_count:
- Short: < 500 tokens
- Medium: 500–2000 tokens
- Long: > 2000 tokens

5. Label Column

The target variable was cast to integer:

df = df.withColumn("label", F.col("int_score").cast(IntegerType()))

6. Save

The preprocessed DataFrame was saved to data/interim/fineweb_preprocessed.parquet.

Model 1 — Random Forest

Notebook: notebooks/03-model-1.ipynb

Methods

Distributed Computing Setup

Training was performed on SDSC Expanse using Ray + RayDP:

spark = raydp.init_spark(
    app_name="FineWeb_Spark_on_Ray",
    num_executors=31,
    executor_cores=1,
    executor_memory="4GB",
)

Data

A 20% random sample (~142,415 rows) of the preprocessed data was used, split 60 / 20 / 20 into train, validation, and test sets.

Feature Engineering Pipeline

Stage	Description
`RegexTokenizer`	Splits text into lowercase tokens on non-word characters
`StopWordsRemover`	Removes common English stop words
`Word2Vec`	10-dimensional embeddings; `minCount=500`
`Imputer`	Fills missing `token_count` with the column mean
`VectorAssembler`	Wraps `token_count` into a vector
`StandardScaler`	Standardizes `token_count` to zero mean and unit variance
`VectorAssembler`	Combines text embeddings + scaled `token_count` into `features`

Models Trained

Two RandomForestClassifier models were compared:

# Model 1
rf1 = RandomForestClassifier(numTrees=5, maxDepth=2)

# Model 2
rf2 = RandomForestClassifier(numTrees=20, maxDepth=8)

Results

Model	Train	Val	Test	Train–Test Gap
RF (numTrees=5, maxDepth=2)	0.6149	0.6101	0.6047	0.0102
RF (numTrees=20, maxDepth=8)	0.6980	0.6807	0.6837	0.0143

Best model: RandomForestClassifier (numTrees=20, maxDepth=8) — saved to models/.

Discussion

Model 1 sits toward underfitting — the 1.0 pp train–test gap shows minimal overfitting, but maxDepth=2 limits the model to simple decision boundaries.
Model 2 shows mild overfitting (1.4 pp gap) but delivers +7.9 pp test accuracy, sitting in a better position on the bias–variance tradeoff.
Word2Vec embeddings + token count carry real signal, but richer features (TF-IDF, sentence count, punctuation density) could further improve performance.
Distributed computing was essential: Model 1 trained in ~4 s and Model 2 in ~71 s across 31 Ray executors — infeasible on a single machine at this data scale.

Prediction Analysis

Model 2 — Random Forest PCA

Notebook: notebooks/04-model-2-pca.ipynb

Methods

Distributed Computing Setup

Training was performed on SDSC Expanse using Ray + RayDP:

spark = raydp.init_spark(
    app_name="FineWeb_Spark_on_Ray",
    num_executors=31,
    executor_cores=1,
    executor_memory="4GB",
)

Data

A 20% random sample (~142,623 rows) of the preprocessed data was used, split 60 / 20 / 20 into train, validation, and test sets.

Feature Engineering Pipeline

Stage	Description
`RegexTokenizer`	Splits text into lowercase tokens on non-word characters
`StopWordsRemover`	Removes common English stop words
`Word2Vec`	10-dimensional embeddings; `minCount=500`
`Imputer`	Fills missing `token_count` with the column mean
`VectorAssembler`	Wraps `token_count` into a vector
`StandardScaler`	Standardizes `token_count` to zero mean and unit variance
`VectorAssembler`	Combines text embeddings + scaled `token_count` into `features`

Models Trained

Two RandomForestClassifier models were compared:

# Model 1
rf1 = RandomForestClassifier(numTrees=20, maxDepth=6)

# Model 2
rf2 = RandomForestClassifier(numTrees=80, maxDepth=10)

Results

Model	Train	Val	Test	Train–Test Gap
RF (numTrees=20, maxDepth=6)	0.6381	0.6355	0.6322	0.0059
RF (numTrees=80, maxDepth=10)	0.6698	0.6439	0.6434	0.0264

Best model: RandomForestClassifier (numTrees=80, maxDepth=10) — saved to models/.

Discussion

Model 1 sits near mild underfitting (small gap, limited capacity).
Model 2 shows mild overfitting (2.64 pp gap) but is well-controlled and delivers the best accuracy.
Reducing 11 → 5 dimensions costs ~4.0 pp accuracy but cuts training time from ~13 min to under 1 min.
Distributed computing was essential: Model 1 trained in ~5.5 s and Model 2 in ~58.6 s across 31 Ray executors — infeasible on a single machine at this data scale.

Prediction Analysis

Conclusion

Model 1 — No PCA (03-model-1.ipynb)

Trained on the full 11-D feature vector. Best configuration: RF (numTrees=20, maxDepth=8) → 68.37% test accuracy, 1.02 pp train–test gap. Good generalisation; the pipeline signal (Word2Vec + token count) is confirmed.

Model 2 — PCA k=5 (this notebook)

Trained on 5 PCA components. Best configuration: RF (numTrees=80, maxDepth=10) → 64.34% test accuracy, 2.64 pp gap. PCA cuts training time dramatically but loses ~4.0 pp of accuracy due to aggressive compression.

Key finding: PCA (k=5) is too aggressive here — the lost variance carries real discriminative signal for the Random Forest.

What can be done to improve it?

Use a larger k (8–10) or inspect the explained-variance curve to pick the elbow
Enrich the input features before PCA (TF-IDF, stylistic features) to make compression less lossy
Try GBTClassifier which may extract more signal from the reduced representation
Include label=5 and train a proper 3-class model
Scale to a larger data fraction to reduce variance across all metrics

Statement of Collaboration

Michael: Team Leader / Project Manager / Coder and Writer: Contribution
Led the project by organizing tasks, coordinating the team, and ensuring deadlines were met. Contributed to both the coding and the project write-up, supporting the overall implementation and documentation. Regularly asked the team for feedback on the documentation to improve clarity and quality.

Sopan: Coder and Writer: Contribution
Contributed to both the coding and the project write-up. Assisted with implementation, debugging, and documenting the project. Collaborated with the team to support progress and ensure completion of assigned tasks.

Justin: Coder and Writer: Contribution
Contributed to both the coding and write-up of the project. Supported the team leader by helping implement key components, debugging issues, and keeping progress on track. Actively checked in with teammates and asked if they needed help, providing support where needed. Assisted in writing and refining the final report and collaborated closely with the team.

Quick Setup

git clone https://github.com/mn-cs/fineweb-spark
cd fineweb-spark

Documentation

Collaboration Guide - Team workflow and git instructions

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
data		data
docs		docs
fineweb_spark		fineweb_spark
models		models
notebooks		notebooks
reports		reports
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FineWeb-Edu Large-Scale Analysis with Apache Spark

Introduction

Dataset

SDSC Expanse Setup

SparkSession Configuration

Justification

Spark UI Screenshot

Data Exploration

Number of Observations

Column Descriptions

Variable Types

Missing and Duplicate Values

Distribution Insights

Spark Operations Used

Data Visualizations

1. Documents per Quality Bucket (Bar Chart)

2. Top Domains (Bar Chart)

3. Token Count Distribution (Histogram)

4. Quality Score Distribution (Histogram)

Preprocessing

Distributed Computing Setup

Spark Executors

Preprocessing Pipeline

1. Filtering Invalid Documents

2. Duplicate Check

3. Stratified Sampling

4. Feature Engineering

5. Label Column

6. Save

Model 1 — Random Forest

Methods

Distributed Computing Setup

Data

Feature Engineering Pipeline

Models Trained

Results

Discussion

Prediction Analysis

Model 2 — Random Forest PCA

Methods

Distributed Computing Setup

Data

Feature Engineering Pipeline

Models Trained

Results

Discussion

Prediction Analysis

Conclusion

Statement of Collaboration

Quick Setup

Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages