Distillclip: Knowledge Distillation of Contrastive Language-Image Pretrained Models
Date of Award
7-1-2023
Document Type
Thesis
Degree Name
Master of Science in Computer Science
First Advisor
Raphael B. Alampay, PhDPatricia Angela R. Abu, PhD
Abstract
Despite CLIP’s performance on vision-language tasks, CLIP’s size limits its deployment in low resource environments. We propose a knowledge distillation scheme to compress a teacher CLIP into a smaller student model we term DistillCLIP. Our framework consists of distilling both intra-modal and inter-modal similarity maps between and within image and text embeddings. DistillCLIP is 43.69% the size of CLIP and has 82.43% its FLOPs. We show that the ability of DistillCLIP to retain teacher performance on zero-shot transfer tasks may depend on the semantic granularity of class labels, preserving only 63.81% of teacher accuracy on average. Meanwhile DistillCLIP’s linear probe performance matches and on some datasets surpasses that of the teacher CLIP with an average retention rate of 100.53%. However, DistillCLIP retains only 12.28% teacher accuracy on average on distribution shift datasets. We also demonstrate that DistillCLIP is able to preserve 99.34% teacher accuracy on video accident recognition in dashcam videos.
Recommended Citation
Ramos, Patrick John C., (2023). Distillclip: Knowledge Distillation of Contrastive Language-Image Pretrained Models. Archīum.ATENEO.
https://archium.ateneo.edu/theses-dissertations/872
