Skip to content

Commit 10f517b

Browse files
committed
feat: ML module, benchmarks, GitHub Releases model storage
- Add badwords.ml with ToxicityPredictor (badwords-py[ml]) - Model path: BADWORDS_ML_PATH, ml/models (dev), cache, GitHub Releases - Benchmarks: rule-based + ML comparison vs glin-profanity - Throughput in benchmark output - make ml-package for GitHub Release asset - make bench-compare for BadWords vs glin Made-with: Cursor
1 parent e98ab2c commit 10f517b

16 files changed

Lines changed: 1158 additions & 1 deletion

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ __pycache__/
55
.ruff_cache
66
.mypy_cache
77
dist
8+
release/
9+
ml/data/
10+
ml/models/
11+
badwords-ml-model.zip
812
*.egg-info
913
.idea
1014
t.py

Makefile

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: develop build test test-rust test-python test-wasm bench bench-rust bench-python wasm wasm-nodejs npm-publish lang-packages npm-publish-languages
1+
.PHONY: develop build test test-rust test-python test-wasm bench bench-rust bench-python bench-compare wasm wasm-nodejs npm-publish lang-packages npm-publish-languages
22

33
develop:
44
cd python && maturin develop
@@ -26,6 +26,11 @@ test-wasm:
2626

2727
bench: bench-rust bench-python
2828

29+
bench-compare:
30+
@echo "BadWords vs glin-profanity (requires: pip install glin-profanity)"
31+
@if [ -d .venv ]; then .venv/bin/python scripts/bench_compare.py; \
32+
else python3 scripts/bench_compare.py; fi
33+
2934
bench-rust:
3035
cargo bench -p badwords-core
3136

@@ -49,3 +54,31 @@ lang-packages:
4954

5055
npm-publish-languages:
5156
cd js/languages && npm publish --access public
57+
58+
# ML training (requires: pip install -r ml/requirements.txt)
59+
ml-prepare:
60+
cd ml && python prepare_data.py --preset multilingual
61+
62+
# Full dataset (~600k samples, ~8-10h training with xlm-roberta)
63+
ml-prepare-full:
64+
cd ml && python prepare_data.py --preset multilingual --max-total 600000
65+
66+
# Max dataset (no cap, ~1M+ samples, ~15-20h)
67+
ml-prepare-max:
68+
cd ml && python prepare_data.py --preset multilingual
69+
70+
ml-train:
71+
cd ml && python train.py
72+
73+
ml-test:
74+
cd ml && python test_inference.py
75+
76+
# Quantize model: 500MB -> ~135MB
77+
ml-quantize:
78+
cd ml && python quantize_model.py
79+
80+
# Package ML model for GitHub Release (upload as badwords-ml-model.zip)
81+
ml-package:
82+
@if [ ! -f ml/models/model.onnx ]; then echo "Run ml-train and ml-quantize first"; exit 1; fi
83+
(cd ml/models && zip -r ../../badwords-ml-model.zip . -x "checkpoints/*" -x "checkpoints/*/*")
84+
@echo "Created badwords-ml-model.zip — upload to GitHub Release"

README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ with multilingual support and evasion detection.**
2929

3030
[Installation](#-installation)
3131
[Quick Start](#-quick-start)
32+
[Benchmarks](#-benchmarks)
3233
[Supported Languages](#-supported-languages)
3334
[Evasion Detection](#-advanced-evasion-detection)
3435
[Documentation](https://badwords.flacsy.dev)
@@ -96,6 +97,51 @@ print(clean_text) # "Some very *** text here"
9697

9798
---
9899

100+
## ⏱ Benchmarks
101+
102+
| CPU | GPU | RAM | OS |
103+
|-----|-----|-----|----|
104+
| x86_64 i7 Intel® Core™ i7-10700KF × 16 | NVIDIA GeForce RTX™ 3070 | 64 GB DDR4 3200MHz | Ubuntu 24.04.2 LTS |
105+
106+
107+
Rule-based matching (en+ru, `match_threshold=1.0`). Run: `make bench`
108+
109+
| Scenario | Rust (badwords-core) | Python (badwords-py) |
110+
|----------|----------------------|----------------------|
111+
| Clean text (no match) | ~7.6 µs (~130 K/s) | ~7.7 µs (~130 K/s) |
112+
| Bad word (match) | ~3.1 µs (~320 K/s) | ~2.7 µs (~370 K/s) |
113+
| Censor (replace) | ~2.8 µs (~360 K/s) | ~2.5 µs (~400 K/s) |
114+
| 5 texts batch | ~15 µs (~330 K/s) | ~16 µs (~310 K/s) |
115+
116+
*Python uses Rust via PyO3, overhead minimal.*
117+
118+
### vs glin-profanity
119+
120+
Rule-based mode, en+ru. Run: `make bench-compare` (requires `pip install glin-profanity`)
121+
122+
| Scenario | BadWords | glin-profanity |
123+
|----------|----------|----------------|
124+
| Clean text | ~7 µs (~140 K/s) | ~4.4 ms (~230/s) |
125+
| Bad word | ~1.3 µs (~770 K/s) | ~0.2 ms (~5 K/s) |
126+
| Censor | ~1.8 µs (~560 K/s) | ~1.4 ms (~700/s) |
127+
| 5 texts batch | ~16 µs (~310 K/s) | ~10 ms (~500/s) |
128+
129+
*BadWords is ~100–600× faster (Rust core vs pure Python).*
130+
131+
### ML mode
132+
133+
`pip install glin-profanity[ml]` + `make bench-compare`. 100 iter each.
134+
135+
| Scenario | BadWords ML (ONNX) | glin transformer |
136+
|----------|--------------------|-------------------|
137+
| Clean text (43 chars) | ~6.5 ms (~150/s) | ~27 ms (~37/s) |
138+
| Bad word (8 chars) | ~4.6 ms (~220/s) | ~21 ms (~47/s) |
139+
| 5 texts batch (82 chars) | ~24 ms (~210/s) | ~107 ms (~47/s) |
140+
141+
*BadWords ML (XLM-RoBERTa) ~3–4× faster than glin transformer.*
142+
143+
---
144+
99145
## 🛠 Methods & API
100146

101147
### `filter_text(text, match_threshold=1.0, replace_character=None)`

ml/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# ML Training Pipeline
2+
3+
Data preparation and model training for `badwords-py[ml]`.
4+
5+
## Setup
6+
7+
```bash
8+
cd ml
9+
pip install -r requirements.txt
10+
```
11+
12+
### CUDA (GPU)
13+
14+
Install PyTorch with CUDA **before** other deps:
15+
16+
```bash
17+
# CUDA 12.4 (or cu121 for older drivers)
18+
pip install torch --index-url https://download.pytorch.org/whl/cu124
19+
```
20+
21+
Check: `python -c "import torch; print(torch.cuda.is_available())"``True`
22+
23+
Trainer uses GPU automatically when available.
24+
25+
## Usage
26+
27+
### 1. Prepare data
28+
29+
```bash
30+
# Quick (~30k): EN + RU + 9 languages
31+
python prepare_data.py --preset multilingual
32+
33+
# Full (~600k): SetFit + civil_comments + paradetox
34+
python prepare_data.py --preset multilingual --max-total 600000
35+
36+
# Max (~1M+): all data, no cap
37+
python prepare_data.py --preset multilingual
38+
39+
# English only
40+
python prepare_data.py --preset toxic_conversations --max-samples 200000
41+
42+
# Single dataset
43+
python prepare_data.py --preset single --dataset SetFit/toxic_conversations
44+
```
45+
46+
### 2. Train model
47+
48+
```bash
49+
# Default: xlm-roberta (best quality), then quantize
50+
python train.py --epochs 2 --batch-size 8
51+
52+
# Lighter: distilbert (faster training)
53+
python train.py --model distilbert-base-multilingual-cased --epochs 2 --batch-size 32
54+
```
55+
56+
Output: `models/` (ONNX + tokenizer)
57+
58+
### 3. Quantize (~4x smaller, recommended)
59+
60+
```bash
61+
python quantize_model.py
62+
# xlm-roberta: 500MB -> ~135MB
63+
# distilbert: 250MB -> ~65MB
64+
```
65+
66+
## Datasets
67+
68+
| Preset | Sources | Languages | Size |
69+
|--------|---------|-----------|------|
70+
| `multilingual` | SetFit + paradetox + ru_paradetox + multilingual_paradetox | EN, RU, UK, DE, ES, AR, ZH, HI, AM | 30k+ |
71+
| `toxic_conversations` | SetFit/toxic_conversations | EN | 1.8M |
72+
| `civil_comments` | google/civil_comments | EN | 2M |
73+
| `paradetox` | s-nlp/paradetox | EN | 20k |
74+
| `ru_paradetox` | s-nlp/ru_paradetox | RU | 12k |
75+
76+
## Model
77+
78+
- **Default:** `xlm-roberta-base` (best quality, ~135MB after quantize)
79+
- **Lighter:** `distilbert-base-multilingual-cased` (~65MB after quantize, faster training)
80+
- Task: binary classification (offensive probability)
81+
- Output: ONNX for inference

0 commit comments

Comments
 (0)