This repository contains the solution for Subtask A of the CHiPSAL 2026 Shared Task: Hate Speech Detection in Nepali-only Memes.
Objective: Detect the presence of hate speech in monolingual Nepali memes.
- Label 0: Non-Hate
- Label 1: Hate
- Evaluation Metric: Macro F1-Score
ChipsalRepo/
├── config.py # Configuration and hyperparameters
├── requirements.txt # Python dependencies
├── train_simple.py # Quick-start training script
├── train.csv # Training data (index, label, text)
├── index_label_train.csv # Training labels
├── OCR_Dataset_Image.csv # Pre-extracted OCR text
│
├── src/
│ ├── __init__.py
│ ├── data_exploration.py # Data analysis and visualization
│ ├── ocr_extraction.py # OCR text extraction using EasyOCR
│ ├── dataset.py # PyTorch Dataset classes
│ ├── models.py # Model architectures (Text/Image/Multimodal)
│ ├── train.py # Full training pipeline with K-Fold CV
│ └── inference.py # Prediction and submission generation
│
├── train/
│ └── train_images/ # Training meme images
│
├── eval/ # Evaluation data (download from competition)
│ └── eval_images/
│
├── test/ # Test data (download from competition)
│ └── test_images/
│
├── data/ # Processed data
│ └── ocr_train_extracted.csv
│
└── outputs/
├── models/ # Saved model checkpoints
├── submissions/ # Generated submission files
└── logs/ # Training logs
pip install -r requirements.txtDownload images from the competition links:
- Training: Place in
train/train_images/ - Evaluation: Place in
eval/eval_images/ - Test: Place in
test/test_images/
python train_simple.pyThis will:
- Load the training data
- Train a multimodal model (XLM-RoBERTa + ResNet50)
- Save the best model to
outputs/models/
python src/train.pyThis provides:
- 5-Fold stratified cross-validation
- Early stopping
- Class weighting for imbalance
- Comprehensive metrics logging
python src/inference.pyThis creates:
predictions.csv: Submission filesubmission.zip: Ready for upload to Codabench
- Uses XLM-RoBERTa / mBERT / MuRIL
- Good for memes with extracted OCR text
- Uses ResNet50 / EfficientNet / ViT
- Captures visual features and layout
- Combines text and image features
- Fusion types: concat, attention, gated
- Uses pre-trained CLIP for joint understanding
- Best for zero-shot transfer
| Model | Val F1 (Macro) |
|---|---|
| Text-only (XLM-RoBERTa) | ~0.65 |
| Image-only (ResNet50) | ~0.58 |
| Multimodal (Concat) | ~0.70 |
| Multimodal (Attention) | ~0.72 |
Note: Results may vary based on hyperparameters and random seeds.
Create predictions.csv:
index,label
12345.jpg,0
15001.jpg,1
20524.jpg,1Zip and submit to Codabench:
zip submission.zip predictions.csvEdit config.py to customize:
- Model architecture
- Training hyperparameters
- Data augmentation
- Paths
- OCR Extraction: EasyOCR for Nepali text
- Data Augmentation: Albumentations for images
- Class Imbalance: Weighted loss function
- Mixed Precision: Faster training with AMP
- Early Stopping: Prevent overfitting
- Ensemble: Combine K-fold predictions
- Fork the repository
- Create your feature branch
- Submit a pull request
If you use this code, please cite:
@inproceedings{thapa2025nememe,
title={NeMeme: A Multimodal Prompt-based Framework for Analyzing Code-Mixed and Low-Resource Memes},
author={Thapa, S. and Veeramani, H. and others},
booktitle={ICWSM 2025},
year={2025}
}
For questions about the competition:
- Contact: rauniyark11@gmail.com
- GitHub: https://github.com/therealthapa/chipsal26-memes
Good luck with CHiPSAL 2026! 🇳🇵