Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor

Ethan Lin¹, Linxi Zhao¹, Atharva Sehgal², Jennifer J. Sun¹

¹ Cornell University, ² University of Texas at Austin
VisCon @ CVPR 2025

Achieving high accuracy does not guarantee that a set of descriptors is "good". Other factors like interpretability may suffer. Global Alignment and CLIP Similarity can serve as new metrics for evaluating and understanding different sets of visual descriptors.

Which descriptor set is best?

Class Name Prompt: Uses the standard zero-shot image classification format “An image of a {class name}”.
CBD Concepts: LLM-generated descriptors from the original Classification by Description framework.
ESCHER: Iteratively refined class-specific descriptors using the ESCHER algorithm.
DCLIP: Inspired by WaffleCLIP; combines class names with randomly sampled global descriptors.
WaffleCLIP: Appends randomized tokens to general concepts and class names, e.g., “An image of a {concept}: {class name}, which has !32d, #tjli, ^fs0.”

Descriptor quality is typically evaluated based on downstream classification accuracy. While informative, this metric offers limited insight into why certain descriptors succeed or fail, and provides little guidance for descriptor discovery or refinement. Moreover, higher accuracy does not necessarily imply more meaningful descriptors. As shown above, random descriptors generation methods can outperform zero-shot LLM-generation methods like the original Classification by Description method or even an iterative refinement algorithm like ESCHER. Even though the semantic quality of the random descriptors is much worse, they still have better accuracy. Clearly, there is need for more principled methods to assess descriptor quality beyond accuracy.

Contributions

In this work, we propose a novel approach to assessing descriptor quality by probing the relationship between textual descriptors and the underlying VLM. We design two new metrics:

1) Global Alignment: Measures the representational capacity of a set of descriptors.

2) CLIP Similarity: Measures how well a set of descriptors aligns with the VLM's pre-training data.

Global Alignment

Classification by description using VLMs can be viewed as a semantic projection, where images are mapped to similarity scores over a set of textual descriptors. This forms a new representation space, with each dimension corresponding to a specific concept. Conceptually, this resembles principal component analysis, where we aim to find a set of bases in the form of natural language.

Global Alignment evaluates the quality of this descriptor space by using metrics from previous work in representational alignment. We measure the Mutual-KNN alignment between the visual descriptor space to a strong reference image embedding space. A high alignment value indicates that the descriptors capture meaningful visual structure which likely corresponds to high semantic quality.

Global Alignment Method

Global Alignment Results

We compute the alignment and accuracy across three-different datasets that are commonly used in Classification by Description frameworks. While there is no clear trend in accuracy across the different sets of descriptors, alignment shows a clear trend in representation quality across the different types of descriptor sets.

CLIP Similarity

Due to the imbalanced nature of CLIP’s training data, semantically similar text descriptors may not yield similar downstream performance. To analyze this, we propose CLIP Similarity, a metric that measures how well descriptor candidates align with CLIP's pre-training data.

CLIP Similarity Method — Diagram of how the CLIP Similarity metric is computed given a set of descriptors and CLIP's pre-training data.

We define two statistics for each descriptor \( d \), based on its top-\( k \) nearest captions \( \{c_1, \dots, c_k\} \) retrieved by cosine similarity in CLIP’s text embedding space. Let \( \text{sim}(a, b) \) denote cosine similarity between two L2-normalized embeddings.

Frequency: number of captions above a similarity threshold \( \tau \):
\[ \mathcal{I}_d = \left\{ i \in [k] \mid \text{sim}(d, c_i) > \tau \right\}, \quad \text{Freq}(d) = |\mathcal{I}_d| \]
Similarity: average similarity between each matched caption \( c_i \) and its paired image \( I_i \):
\[ \text{Sim}(d) = \frac{1}{|\mathcal{I}_d|} \sum_{i \in \mathcal{I}_d} \text{sim}(c_i, I_i) \]

These two metrics capture how frequently a text descriptor appears in CLIP’s training set and how well it aligns with visual content. This provides a proxy for how compatible a descriptor is with CLIP’s vision-language representation space. High-quality descriptors should not only distinguish between image classes but also align with the inductive biases learned by VLMs. Such descriptors are more likely to yield better downstream performance.

CLIP Similarity Score: Average similarity across all descriptors:

\[ \text{CLIP Similarity} = \frac{1}{|\mathcal{D}|} \sum_{d \in \mathcal{D}} \text{Sim}(d) \]

CLIP Similarity Results

To build intuition about CLIP’s priors, we first analyze the relationship between descriptor frequency and CLIP similarity. As shown below, we observe a consistent negative correlation between frequency and image-text similarity—a somewhat counterintuitive trend. One possible explanation is that frequent descriptors tend to be coarser and more ambiguous (e.g., “animal”), making it harder for CLIP to associate them with a consistent visual pattern. In contrast, rarer descriptors often refer to more specific, visually grounded concepts that CLIP can align with more reliably.

Frequency vs. CLIP Similarity on NABirds

Distribution of descriptors by frequency and corresponding CLIP Similarity on the NABirds dataset.

Frequency vs. CLIP Similarity on CIFAR100

Distribution of descriptors by frequency and corresponding CLIP Similarity on the CIFAR100 dataset.

Frequency vs. CLIP Similarity on CUB

Distribution of descriptors by frequency and corresponding CLIP Similarity on the CUB dataset.

Does an iterative algorithm like ESCHER converge towards descriptors that align with CLIP's vision-language priors?

CIFAR100

CUB

NABirds

We use CLIP Similarity to evaluate whether or not descriptor refinement moves toward CLIP’s training distribution. The figure above shows how descriptor quality evolves over iterations of ESCHER. Across all datasets, CLIP Similarity generally increases across ESCHER iterations even when accuracy does not. This suggests that descriptors become increasingly fine-grained and better aligned with CLIP's representation space.