Text-based visual descriptors—ranging from simple class names to more descriptive phrases—are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics—Global Alignment and CLIP Similarity—that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.
Descriptor quality is typically evaluated based on downstream classification accuracy. While informative, this metric offers limited insight into why certain descriptors succeed or fail, and provides little guidance for descriptor discovery or refinement. Moreover, higher accuracy does not necessarily imply more meaningful descriptors. As shown above, random descriptors generation methods can outperform zero-shot LLM-generation methods like the original Classification by Description method or even an iterative refinement algorithm like ESCHER. Even though the semantic quality of the random descriptors is much worse, they still have better accuracy. Clearly, there is need for more principled methods to assess descriptor quality beyond accuracy.
In this work, we propose a novel approach to assessing descriptor quality by probing the relationship between textual descriptors and the underlying VLM. We design two new metrics:
1) Global Alignment: Measures the representational capacity of a set of descriptors.
2) CLIP Similarity: Measures how well a set of descriptors aligns with the VLM's pre-training data.
Classification by description using VLMs can be viewed as a semantic projection, where images are mapped to similarity scores over a set of textual descriptors. This forms a new representation space, with each dimension corresponding to a specific concept. Conceptually, this resembles principal component analysis, where we aim to find a set of bases in the form of natural language.
Global Alignment evaluates the quality of this descriptor space by using metrics from previous work in representational alignment. We measure the Mutual-KNN alignment between the visual descriptor space to a strong reference image embedding space. A high alignment value indicates that the descriptors capture meaningful visual structure which likely corresponds to high semantic quality.
Due to the imbalanced nature of CLIP’s training data, semantically similar text descriptors may not yield similar downstream performance. To analyze this, we propose CLIP Similarity, a metric that measures how well descriptor candidates align with CLIP's pre-training data.
We define two statistics for each descriptor \( d \), based on its top-\( k \) nearest captions \( \{c_1, \dots, c_k\} \) retrieved by cosine similarity in CLIP’s text embedding space. Let \( \text{sim}(a, b) \) denote cosine similarity between two L2-normalized embeddings.
These two metrics capture how frequently a text descriptor appears in CLIP’s training set and how well it aligns with visual content. This provides a proxy for how compatible a descriptor is with CLIP’s vision-language representation space. High-quality descriptors should not only distinguish between image classes but also align with the inductive biases learned by VLMs. Such descriptors are more likely to yield better downstream performance.
CLIP Similarity Score: Average similarity across all descriptors:
To build intuition about CLIP’s priors, we first analyze the relationship between descriptor frequency and CLIP similarity. As shown below, we observe a consistent negative correlation between frequency and image-text similarity—a somewhat counterintuitive trend. One possible explanation is that frequent descriptors tend to be coarser and more ambiguous (e.g., “animal”), making it harder for CLIP to associate them with a consistent visual pattern. In contrast, rarer descriptors often refer to more specific, visually grounded concepts that CLIP can align with more reliably.
We use CLIP Similarity to evaluate whether or not descriptor refinement moves toward CLIP’s training distribution. The figure above shows how descriptor quality evolves over iterations of ESCHER. Across all datasets, CLIP Similarity generally increases across ESCHER iterations even when accuracy does not. This suggests that descriptors become increasingly fine-grained and better aligned with CLIP's representation space.
BibTex Code Here