Research project comparing multiple approaches to logical fallacy identification in online news comments. Built on the CoCoLoFa dataset.
Both tasks are evaluated across all approaches:
- detection - binary: does the comment contain a logical fallacy? (
yes/no) - classification - multiclass: which of 9 labels applies? (8 fallacy types +
none)
| Approach | Directory | Description |
|---|---|---|
| Zero-shot prompting | baseline/ |
Prompt an instruction-tuned LLM directly |
| Classic encoder SFT | baseline/classic_models/ |
Fine-tune RoBERTa / ModernBERT classifiers |
| Curriculum learning | curriculum_learning/ |
Encoder SFT staged easy->hard by annotator agreement |
| LoRA SFT of LLMs | lora/ |
Parameter-efficient fine-tuning of causal LLMs |
| Dynamic (retrieval) prompting | prompting/dynamic_prompting/ |
Retrieval-augmented few-shot prompting |
| DSPy prompt optimization | prompting/optimization/ |
Automatically optimize prompts (COPRO / MIPROv2) |
The dataset is a list of news articles, each containing crowd-written
comments. Every comment carries a fallacy field with one of:
appeal to authority appeal to majority appeal to nature
appeal to tradition appeal to worse problems
false dilemma hasty generalization slippery slope none
All experiments flatten articles -> comments and map the fallacy field to a
task label:
- classification uses the raw label (all 9 classes).
- detection maps
none->noand everything else ->yes.
none dominates the dataset, so class imbalance is a recurring concern
(motivates the balanced_knn retrieval strategy and the curriculum schedule).
Most experiments share the same conventions so results are comparable:
- Output format (generative models). The model is asked to reason briefly
and produce its answer inside
\boxed{...}(e.g.\boxed{yes}or\boxed{slippery slope}). A regex parser extracts the last valid boxed label. Unparseable outputs are recorded asunknown. - Metrics. All experiments report accuracy and macro precision / recall /
F1 via
scikit-learn, plus a fullclassification_report. - Models evaluated. Generative:
google/gemma-4-E2B-it,google/gemma-4-E4B-it,mistralai/Ministral-3-8B-Instruct-2512-BF16,mistralai/Ministral-3-8B-Reasoning-2512,Qwen/Qwen3.5-9B. Encoder:FacebookAI/roberta-base,answerdotai/ModernBERT-base. - Reasoning toggle. Several scripts expose
--enable-reasoning/--reasoning(and a larger token budget) for thinking model variants.
Generative experiments can run in two ways:
- Local Hugging Face
transformers- the model is loaded directly on a GPU (baseline/prompting.py,lora/sft_lora.py). - OpenAI-compatible HTTP endpoint - either a local
vLLM server or the OpenRouter
API ,dynamic-prompting CLI, DSPy
optimizers). Base URL is configured via
LLM_URL/--base-url.
Encoder and LoRA training assume a CUDA GPU. Most long-running jobs are designed as SLURM array jobs on an HPC cluster, one array index per model/task (and sometimes reasoning/context) combination.
The repository is a uv workspace. The root
package and prompting/dynamic_prompting share one environment.
prompting/optimization is a standalone package with its own dependencies.
Requires Python >= 3.12 and uv.
# Root workspace (baseline, lora, curriculum_learning, dynamic_prompting)
uv sync
uv sync --package dynamic-promptingA pinned pip alternative is available in requirements.txt.
The DSPy optimization package has a separate environment:
cd prompting/optimization
uv syncprompting/dynamic_prompting/.env.example- copy to.env. SetsDB_URL(Qdrant) andLLM_URL(the LLM endpoint).prompting/optimization/.env.example- copy to.env. SetsOPENAI_API_KEYandHF_TOKEN.OPENROUTER_API_KEYis read by the OpenRouter-backed scripts.
The dynamic-prompting retrieval store can run as a local container:
cd prompting/dynamic_prompting
docker compose up -d # Qdrant on localhost:6333 (REST) / 6334 (gRPC)Alternatively the CLI supports an embedded file-backed Qdrant via --db-path
(no server needed).
dataset/ contains train.json, dev.json, and test.json. See
dataset/README.md for the full CoCoLoFa description, statistics, schema, and
citation.
Prompt a model with only the comment text (no article context, no few-shot examples) and parse the boxed answer.
Local, GPU-based (baseline/prompting.py):
uv run baseline/prompting.py dataset/test.json \
--model gemma-4-e2b \
--task detection \
--output results/gemma-4-e2b_detection_results.json
# add --enable-reasoning --max-new-tokens 4000 for thinking variantsBatch all model/task/reasoning combinations on SLURM:
scripts/run_prompting_baseline.sh.
Supervised fine-tuning of encoder classifiers
(AutoModelForSequenceClassification) for binary or multiclass.
Optionally prepends the parent article as context.
uv run python baseline/classic_models/sft_model.py \
--model-name FacebookAI/roberta-base \
--task multiclass \
--epochs 5 --batch-size 16 --lr 2e-5
# add --use-context to prepend the article textSLURM arrays: train.sh (no context) and train_ctx.sh (with context) sweep
RoBERTa/ModernBERT x binary/multiclass. Evaluation and confusion-matrix plots:
baseline/classic_models/evaluate/evaluate.py (driver: evaluate.sh).
Encoder SFT where training proceeds through three cumulative stages ordered by inter-annotator-agreement difficulty. Clear-cut classes first, ambiguous classes last. Logits for not-yet-active classes are masked during each stage.
uv run python curriculum_learning/cl_train.py \
--model-name roberta \
--task multiclass \
--epochs-per-stage 5 --batch-size 16 --lr 2e-5
# add --use-context to prepend the article textSLURM array (train.sh) sweeps roberta/modernbert x binary/multiclass x context.
Evaluation: curriculum_learning/evaluate/evaluate.py (driver: evaluate.sh).
Parameter-efficient fine-tuning of causal LLMs (Gemma / Ministral / Qwen) to
produce the boxed answer format, then evaluate on the test set. Supports both
tasks and a --reasoning mode.
uv run python lora/sft_lora.py \
dataset/train.json dataset/dev.json \
results/<model>_<task>_sft_results.json \
--model google/gemma-4-E2B-it \
--task detection \
--test_file dataset/test.json \
--num_train_epochs 3 --lora_r 16 --lora_alpha 16 --lora_dropout 0.05SLURM scripts:
lora/run_sft.sh- train + evaluate (instruct models).lora/run_sft_reasoning.sh- train + evaluate with--reasoning.lora/run_eval.sh- evaluate only, reusing a saved adapter (--skip-training).
Retrieval-augmented few-shot prompting. Training comments are embedded with a
sentence-transformer and stored in Qdrant. At inference time the k most
relevant examples are retrieved and inserted as few-shot demonstrations before
the query comment.
Retrieval strategies:
knn- nearest neighbours over comment-text embeddings.balanced_knn- class-aware retrieval that counteracts thenone-heavy imbalance. Even fallacious/non-fallacious split for detection, one query per class for classification.
The package exposes a CLI with two subcommands:
cd prompting/dynamic_prompting
# 1. Embed the training data into the vector store
uv run python -m dynamic_prompting setup ../../dataset/train.json \
--embedding-model sentence-transformers/all-mpnet-base-v2 \
--db-url http://localhost:6333 # or --db-path <local_dir>
# 2. Evaluate with a chosen retrieval strategy against an LLM endpoint
uv run python -m dynamic_prompting eval ../../dataset/test.json \
--model google/gemma-4-e2b-it \
--task detection \
--strategy balanced_knn \
--k 5 \
--output results/detection_gemma_balanced_knn.jsonHelper scripts:
scripts/run_dynamic_prompting_eval.sh- HPC/SLURM run: embeds the data, starts a local vLLM server, evaluates all strategies.scripts/run_local_openrouter.sh- local run: brings up Qdrant via Docker, evaluates all model/task/strategy combinations through OpenRouter.
Automatic prompt optimization with DSPy, evaluated against a vLLM endpoint:
- COPRO (
optimize_copro) - zero-shot instruction optimization (hill-climbing over candidate instructions). - MIPROv2 (
optimize_mipro) - joint instruction + few-shot demonstration optimization via Bayesian search.
cd prompting/optimization
uv run python -m optimization.optimize_mipro \
--train-data ../../dataset/train.json \
--dev-data ../../dataset/dev.json \
--task detection \
--model google/gemma-4-e4b-it \
--base-url http://127.0.0.1:8000/v1 \
--output-dir results/optimization/mipro/detection_gemma-e4bSLURM scripts scripts/run_copro.sh and scripts/run_mipro.sh start a vLLM
server and run the optimizer for 2 models x 2 tasks. Each run saves the compiled
program, the optimized instruction, and an eval_results.json.
Evaluation outputs are committed as JSON:
results/- LoRA SFT runs (<model>_<task>_sft[_reasoning]_results.json) and the prompting baselines underresults/baseline/prompting/.prompting/dynamic_prompting/results/- dynamic-prompting runs, named<task>_<model>_<strategy>[_<reasoning>].
All result files share a common schema:
{
"metrics": { "accuracy": 0.0, "precision_macro": 0.0, "recall_macro": 0.0, "f1_macro": 0.0 },
"results": [
{ "comment": "...", "true_label": "...", "predicted_label": "...", "raw_response": "..." }
]
}Some runs additionally include metadata / classification_report and, for
dynamic prompting, the retrieved_ids used for each prediction.