Introduction
Given the vast number of models that excel at zero-shot classification, identifying common objects like dogs, cars, and stop signs can be viewed as a mostly solved problem. Identifying less common or rare objects is still an active field of research. This is a scenario where large, manually annotated datasets are unavailable. In these cases, it can be unrealistic to expect people to engage in the laborious task of collecting large datasets of images, so a solution relying on a few annotated examples is imperative. A key example is healthcare, where professionals might need to classify image scans of rare diseases. Here, large datasets are scarce, expensive, and complex to create.
Before diving in, a few definitions might be helpful.
Zero-shot, one-shot, and few-shot learning are techniques that allow a machine learning model to make predictions for new classes with limited labeled data. The choice of technique depends on the specific problem and the amount of labeled data available for new categories or labels (classes).
Zero-shot learning: There is no labeled data available for new classes. The algorithm makes predictions about new classes by using prior knowledge about the relationships that exist between classes it already knows.
One-shot learning: A new class has one labeled example. The algorithm makes predictions based on the single example.
Few-shot learning: The goal is to make predictions for new classes based on a few examples of labeled data.
Few-show learning, an approach focused on learning from only a few examples, is designed for situations where labeled data is scarce and hard to create. Training a decent image classifier often requires a large amount of training data, especially for classical convolutional neural networks. You can imagine how hard the problem becomes when there are only a handful of labeled images (usually less than 5) to train with.
With the advent of visual language models (VLMs), large models that connect text and language data, few-shot classification has become more tractable. These models have learned features and invariances from huge quantities of internet data and connections between visual features and textual descriptors. This makes VLMs the ideal basis to finetune or leverage to perform downstream classification tasks when only a small amount of labeled data is provided. Deploying such a system efficiently would make a few-shot classification solution far less costly and more appealing to our customers.
We’ve paired up with the University of Toronto Engineering Science (Machine Intelligence) students for half of the 2023 Fall semester to take a first step in productionizing a few-shot learning system.
Adapting to New Examples
Even though VLMs have very impressive results on standard benchmarks, they usually only perform well in unseen domains with further training. One approach is to finetune the model with the new examples. Full finetuning entails retraining all parameters of a pre-trained model on a new task-specific dataset. While this method can achieve strong performance, it has a few shortcomings. Primarily, it requires substantial computational resources and time and may lead to overfitting if the task-specific dataset is small. This can result in the model failing to generalize well to unseen data.
The adapter method, first popularized by the CLIP-adapter for the CLIP model, has been developed to mitigate these issues. In contrast to full finetuning, the adapter method only adjusts a small number of parameters in the model. This method involves inserting small adapter modules into the model’s architecture, which are then fine-tuned while the original model parameters remain frozen. This approach significantly reduces the computational cost and overfitting risk associated with full finetuning while allowing the model to adapt effectively to new tasks.
The TIP Adapter is an advanced approach that further improves upon the CLIP-adapter. TIP Adapters provide a training-free framework for a few-shot learning system, which means that no finetuning is needed (there is a version that uses additional fine-tuning and is more efficient than the CLIP-adapter). The system leverages a Key-Value (KV) cache where the CLIP embeddings are keys and the provided converted labels are values. This can be easily extended into a scalable service for a high volume of distinct image classification tasks.
Scaling to Production
With this, the University of Toronto Engineering Science program team designed a system that can be deployed as a single container using FastAPI, Redis, and Docker. Out of the box, it can support up to 10 million uniquely trained class instances. Not to mention that via the adapter method, the time needed for fine-tuning is reduced to the order of 10s of seconds.
Their final deliverable can be found in this GitHub repository.
What is next?
The team has identified a few directions:
Different base model: CLIP has a lot of variants and is certainly not the only VLM out there. However, this may be a tradeoff between model size (and thus serving costs) and accuracy.
Data augmentation: Methods like cropping, rotations, and re-coloring may help synthetically increase the number of examples for training.
Promising possibilities from Large Language Models (LMs): LLMs have decent zero-shot capabilities (no extra training) and emergent few-shot capabilities. Could LLMs be used more widely in few-shot production systems? Time will tell.
The UofT team comprises Arthur Allshire, Chase McDougall, Christopher Mountain, Ritvik Singh, Sameer Bharatia, and Vatsal Bagri.