Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor

1 Cornell University, 2 University of Texas at Austin
VisCon @ CVPR 2025
Concept figure

Achieving high accuracy does not guarantee that a set of descriptors is "good". Other factors like interpretability may suffer. Global Alignment and CLIP Similarity can serve as new metrics for evaluating and understanding different sets of visual descriptors.

Abstract

Text-based visual descriptors—ranging from simple class names to more descriptive phrases—are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics—Global Alignment and CLIP Similarity—that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.

Which descriptor set is best?

  • Class Name Prompt: Uses the standard zero-shot image classification format “An image of a {class name}”.
  • CBD Concepts: LLM-generated descriptors from the original Classification by Description framework.
  • ESCHER: Iteratively refined class-specific descriptors using the ESCHER algorithm.
  • DCLIP: Inspired by WaffleCLIP; combines class names with randomly sampled global descriptors.
  • WaffleCLIP: Appends randomized tokens to general concepts and class names, e.g., “An image of a {concept}: {class name}, which has !32d, #tjli, ^fs0.”
Concept figure

Descriptor quality is typically evaluated based on downstream classification accuracy. While informative, this metric offers limited insight into why certain descriptors succeed or fail, and provides little guidance for descriptor discovery or refinement. Moreover, higher accuracy does not necessarily imply more meaningful descriptors. As shown above, random descriptors generation methods can outperform zero-shot LLM-generation methods like the original Classification by Description method or even an iterative refinement algorithm like ESCHER. Even though the semantic quality of the random descriptors is much worse, they still have better accuracy. Clearly, there is need for more principled methods to assess descriptor quality beyond accuracy.

Contributions

In this work, we propose a novel approach to assessing descriptor quality by probing the relationship between textual descriptors and the underlying VLM. We design two new metrics:

1) Global Alignment: Measures the representational capacity of a set of descriptors.

2) CLIP Similarity: Measures how well a set of descriptors aligns with the VLM's pre-training data.

Global Alignment

Classification by description using VLMs can be viewed as a semantic projection, where images are mapped to similarity scores over a set of textual descriptors. This forms a new representation space, with each dimension corresponding to a specific concept. Conceptually, this resembles principal component analysis, where we aim to find a set of bases in the form of natural language.

Global Alignment evaluates the quality of this descriptor space by using metrics from previous work in representational alignment. We measure the Mutual-KNN alignment between the visual descriptor space to a strong reference image embedding space. A high alignment value indicates that the descriptors capture meaningful visual structure which likely corresponds to high semantic quality.

Semantic Projection

Global Alignment Method

Global Alignment Method
Our alignment metric compares neighborhood structure between the projections of a dataset into descriptor and image embedding spaces.

Global Alignment Results

We compute the alignment and accuracy across three-different datasets that are commonly used in Classification by Description frameworks. While there is no clear trend in accuracy across the different sets of descriptors, alignment shows a clear trend in representation quality across the different types of descriptor sets.
NABirds Alignment vs. Accuracy CIFAR100 Alignment vs. Accuracy CUB Alignment vs. Accuracy

CLIP Similarity

CLIP Similarity Metric
CLIP Similarity assesses how well descriptors align with visual content by retrieving related image-text pairs from CLIP’s pre-training dataset and averaging their similarity scores.

Due to the imbalanced nature of CLIP’s training data, semantically similar text descriptors may not yield similar downstream performance. To analyze this, we propose CLIP Similarity, a metric that measures how well descriptor candidates align with CLIP's pre-training data.


CLIP Similarity Method
Diagram of how the CLIP Similarity metric is computed given a set of descriptors and CLIP's pre-training data.

We define two statistics for each descriptor \( d \), based on its top-\( k \) nearest captions \( \{c_1, \dots, c_k\} \) retrieved by cosine similarity in CLIP’s text embedding space. Let \( \text{sim}(a, b) \) denote cosine similarity between two L2-normalized embeddings.

  • Frequency: number of captions above a similarity threshold \( \tau \):
    \[ \mathcal{I}_d = \left\{ i \in [k] \mid \text{sim}(d, c_i) > \tau \right\}, \quad \text{Freq}(d) = |\mathcal{I}_d| \]
  • Similarity: average similarity between each matched caption \( c_i \) and its paired image \( I_i \):
    \[ \text{Sim}(d) = \frac{1}{|\mathcal{I}_d|} \sum_{i \in \mathcal{I}_d} \text{sim}(c_i, I_i) \]

These two metrics capture how frequently a text descriptor appears in CLIP’s training set and how well it aligns with visual content. This provides a proxy for how compatible a descriptor is with CLIP’s vision-language representation space. High-quality descriptors should not only distinguish between image classes but also align with the inductive biases learned by VLMs. Such descriptors are more likely to yield better downstream performance.

CLIP Similarity Score: Average similarity across all descriptors:

\[ \text{CLIP Similarity} = \frac{1}{|\mathcal{D}|} \sum_{d \in \mathcal{D}} \text{Sim}(d) \]

CLIP Similarity Results

To build intuition about CLIP’s priors, we first analyze the relationship between descriptor frequency and CLIP similarity. As shown below, we observe a consistent negative correlation between frequency and image-text similarity—a somewhat counterintuitive trend. One possible explanation is that frequent descriptors tend to be coarser and more ambiguous (e.g., “animal”), making it harder for CLIP to associate them with a consistent visual pattern. In contrast, rarer descriptors often refer to more specific, visually grounded concepts that CLIP can align with more reliably.


Does an iterative algorithm like ESCHER converge towards descriptors that align with CLIP's vision-language priors?

CIFAR100

CIFAR100 CLIP Similarity vs. Accuracy

CUB

CUB CLIP Similarity vs. Accuracy

NABirds

NABirds CLIP Similarity vs. Accuracy

We use CLIP Similarity to evaluate whether or not descriptor refinement moves toward CLIP’s training distribution. The figure above shows how descriptor quality evolves over iterations of ESCHER. Across all datasets, CLIP Similarity generally increases across ESCHER iterations even when accuracy does not. This suggests that descriptors become increasingly fine-grained and better aligned with CLIP's representation space.

Poster

BibTeX

BibTex Code Here