This project contains a Python script for classifying images as either indoor or outdoor scenes using a pre-trained CLIP model and a lightweight few-shot, prototype-based method. It was built in response to a practical classification task using real-world, unlabeled image data.
The classifier leverages OpenAI’s CLIP (clip-vit-base-patch32) as a visual encoder and classifies new images by comparing their embeddings to representative class prototypes created from a small set of hand-picked example images (few-shot learning).
It is:
- ✅ Training-free (no fine-tuning needed)
- ⚡ Fast, with support for CPU
- 🔧 Modular, and easy to improve further
- Python 3.8+
- Git
- Dependencies listed in
requirements.txt
Follow these steps to set up and run the classifier on your own machine.
First, clone this repository to your local machine using Git:
git clone https://github.com/DrUkachi/inside-outside-classification.gitChange your current directory to the newly cloned project folder:
cd project-rootInstall the required Python libraries using the requirements.txt file:
pip install -r requirements.txtThis installs:
- PyTorch
- HuggingFace Transformers
- Pillow (for image handling)
- tqdm (for progress bars)
Before running the script, make sure your image data is placed in the data/ directory as ZIP files. The expected project structure should be:
project-root/
├── data/
│ ├── few_shot.zip
│ ├── validation.zip
│ └── unlabeled.zip
│
├── .gitignore
├── README.md
├── classify.py
├── experiment.ipynb
├── few_shot_images.json
└── requirements.txt
The script is designed to automatically unzip the few_shot.zip, validation.zip, and unlabeled.zip files into the root directory the first time you run it.
It will create the following folders:
few_shot/
validation/
unlabeled/
💡 The script will only unzip the files if the corresponding folders don’t already exist, so you can safely rerun it without duplicating data.
You can run the script in two primary modes: classify or validate.
Use this command to classify all images inside the unlabeled/ directory.
python classify.py --mode classify --folder ./unlabeledUse this command to run the classifier on the validation/ set and check its accuracy. This mode assumes the filenames in the validation set contain ground-truth labels (e.g., indoor_image_1.jpg).
python classify.py --mode validate --folder ./validationAfter running, all processed images will be sorted into one of three directories: classified/indoor/, classified/outdoor/, or classified/review/ for ambiguous cases.
- Uses
openai/clip-vit-base-patch32from HuggingFace - Embeddings are extracted for both:
- Text prompts (e.g., “a photo taken indoors”)
- Few-shot example images
- For each image:
- Get its CLIP embedding
- Compare it to the averaged prompt embeddings and few-shot image embeddings
- Compute similarity to both classes
- 🔄 If the top two scores are close (within 0.05), the image is sent to review
- Otherwise, assign to the class with the highest similarity
When the model is unsure (i.e., the similarity difference between the top two classes is small), the image is routed to a review/ folder for manual inspection.
🧠 Why 0.05? A 5% margin was selected as a practical threshold for ambiguity. It captures borderline cases where CLIP's semantic similarity doesn't clearly favor one class. This value can be tuned depending on tolerance for false positives or the capacity for human review.
- No model training required
- Strong generalization from a few visual examples
- Prompt-based reasoning makes it adaptable to other classes
| Category | Example | Insight |
|---|---|---|
| Ambiguous scenes | 190881191_*.jpg |
Even humans disagree — routed to review/ folder |
| Roofed or car interiors | 219636488_*.jpg, 70939958_*.jpg |
Challenging without contextual metadata |
| Clear misclassifications | 79869777_*.jpg, 227589596_*.jpg |
Could benefit from a secondary model (e.g., object detection) |
| Unexplainable predictions | 253900795_*.jpg |
Explaining CLIP decisions is non-trivial — visual interpretability tools could help |
| Prompt sensitivity | 50587842_*.jpg, 56540294_*.jpg |
Slight changes in text can impact results — consider dynamic prompt ensembling |
| Environmental cues | 99454779_*.jpg |
Brightness, lighting, and framing may bias CLIP's perception |
-
Scene-Based Inference Engine Use models like Places365 to classify contextually confusing cases (e.g., parking lots, stadiums).
-
Explainability Tools Add SHAP, Grad-CAM, or embedding heatmaps to interpret classification decisions.
-
Prompt Augmentation & Tuning Dynamically improve text prompts using automated selection or fine-tuned language prompts.
classify.py: Main classification and validation scriptREADME.md: Full guide and technical report (this file)- Folder structure with
few_shot,unlabeled,validation, andclassified/directories
This solution demonstrates:
- ✅ Effective use of large pre-trained vision-language models
- ✅ Lightweight and reproducible code
- ✅ Clear handling of edge cases
- ✅ Review mechanism for ambiguous images
- ✅ Good modularity for future extensions