Support for clustering detections#818
Merged
mihow merged 47 commits intodeployments/ood.antenna.insectai.orgfrom May 7, 2025
Merged
Support for clustering detections#818mihow merged 47 commits intodeployments/ood.antenna.insectai.orgfrom
mihow merged 47 commits intodeployments/ood.antenna.insectai.orgfrom
Conversation
…onse and save it to Classification object
✅ Deploy Preview for antenna-preview canceled.
|
Base automatically changed from
feat/save-classification-features
to
deployments/ood.antenna.insectai.org
April 30, 2025 01:55
e7bed1a to
74d606b
Compare
…nickLab/antenna into feat/add-clustering
✅ Deploy Preview for antenna-ood canceled.
|
…nickLab/antenna into feat/add-clustering
…ntenna into feat/add-clustering
mihow
reviewed
Apr 30, 2025
mihow
reviewed
Apr 30, 2025
…nickLab/antenna into feat/add-clustering
… directly from the request objects
mihow
reviewed
May 6, 2025
| if not Algorithm.objects.filter(key=feature_extraction_algorithm).exists(): | ||
| raise ValueError(f"Invalid feature extraction algorithm key: {feature_extraction_algorithm}") | ||
| else: | ||
| # Fallback to the most used feature extraction algorithm in this collection |
Collaborator
There was a problem hiding this comment.
Thanks for implementing this fallback to most used feature extractor!
mihow
reviewed
May 6, 2025
| ] | ||
|
|
||
|
|
||
| class ClusterDetectionsSerializer(serializers.Serializer): |
Collaborator
There was a problem hiding this comment.
I feel much better about using a serializer for the params! thank you 🙏
mihow
approved these changes
May 7, 2025
Collaborator
mihow
left a comment
There was a problem hiding this comment.
Well done @mohamedelabbas1996!! Merging!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces clustering support for detections in a source image collection, based on precomputed feature vectors (features_2048) stored in
Classificationrecords. The clustering helps organize detections into potential taxonomic groups, especially for unknown species.List of Changes
Integrated agglomerative clustering logic using
features_2048field stored on theClassificationmodel.Added a new job type (
DetectionClusteringJob) to cluster detections in a source image collection.Added logic to:
Select a consistent feature_extraction_algorithm across detections.
Allow the user to explicitly pass an algorithm, with fallback to the most used one.
Create new Taxon entries per cluster and automatically assign them to Occurrence determinations.
Added admin action on source image collections to trigger clustering.
Added
cluster_detectionsaction to the SourceImageCollectionViewSet to trigger clustering via API.Added
unknown_speciesboolean field to theTaxonmodel to flag automatically generated taxa.Related Issues
#774
Detailed Description
This PR adds clustering support for detections within a source image collection using existing visual feature vectors. The goal is to help researchers group similar-looking insect detections together, especially in collections where many specimens may not match known species. This is particularly useful in the context of the Panama trip, where a large number of insect images were collected with limited labeling. By clustering detections based on visual similarity, we can automatically group potential unknown species and assign them temporary taxa, making it easier for experts to review and identify patterns. The clustering job can be triggered from the admin interface or API.
The clustering works as follows:
First, Detection objects are filtered by the selected collection. Only detections that have at least one Classification containing a feature vector (features_2048) are considered. If the user provides a feature_extraction_algorithm, only features generated by that algorithm are used. If not, we select the most commonly used algorithm in the collection. Additionally, detections are filtered by an out-of-distribution (OOD) threshold: only those with an associated occurrence whose determination_ood_score exceeds the specified threshold are included.
Once the set of valid detections is prepared, we apply PCA dimensionality reduction to the feature vectors (default: 384 components). The reduced vectors are then passed to the selected clustering algorithm (currently, agglomerative clustering is supported). The algorithm groups similar detections into clusters based on the provided clustering parameters (e.g., distance_threshold).
After clustering:
A new Taxon is created for each cluster, marked with the unknown_species=True flag.
A new Classification is created for each detection, pointing to the corresponding cluster taxon and assigned a score=1.0. This score ensures that the associated Occurrence will automatically update its determination to use the new classification.
The
cluster_detectionsaction accepts several parameters:ood_threshold(float, default: 0.0): Filters out detections that have already been confidently classified. Only detections with a higher out-of-distribution (OOD) score will be included in the clustering.feature_extraction_algorithm(string, optional): Specifies the name of the feature extraction algorithm to use. If not provided, the system will automatically select the most commonly used algorithm within the collection.algorithm(string, default: "agglomerative"): The clustering algorithm to use. Currently supports "agglomerative".algorithm_kwargs(dict, default: {"distance_threshold": 0.5}): Extra configuration for the clustering algorithm, such as the distance threshold for agglomerative clustering.pca(dict, default: {"n_components": 384}): Dimensionality reduction settings applied before clustering.Sample request:
POST /api/collections/42/cluster_detections/
Content-Type: application/json
{
"ood_threshold": 0.4,
"feature_extraction_algorithm": "resnet18",
"algorithm": "agglomerative",
"algorithm_kwargs": {
"distance_threshold": 0.45
},
"pca": {
"n_components": 256
}
}
How to Test the Changes
Extract features for the target collection by running a pipeline that supports feature extraction (e.g. the Global pipeline, or Panama Plus).
Go to the Django admin panel and trigger the "cluster detections" action on the collection.
Confirm that:
a. A clustering job is created and runs successfully.
b. New taxa are created for clusters.
c. Associated occurrences receive updated determinations.
d. Results can be reviewed via the occurrences and taxa views.
Screenshots
Checklist