Hi, have you ever test the zero-shot accuracy on scannet200, i.e., replace those class-agnostic mask proposals predicted by Mask3D and use ground-truth instances as input to your mask feature computation module to get the mask features, then dot product with text embeddings from CLIP text encoder and select the maximum index as the predicted label for ground-truths? I just wonder how accurate CLIP is.
Hi, have you ever test the zero-shot accuracy on scannet200, i.e., replace those class-agnostic mask proposals predicted by Mask3D and use ground-truth instances as input to your mask feature computation module to get the mask features, then dot product with text embeddings from CLIP text encoder and select the maximum index as the predicted label for ground-truths? I just wonder how accurate CLIP is.