Identifying interpretable subspaces in image representations
An interpretability framework for explaining features of image representations using contrasting concepts and captions.
We propose Automatic Feature Explanation using Contrasting Concepts (FALCON), an interpretability framework to explain features of image representations. For a target feature, FALCON captions its highly activated cropped images using a large captioning dataset (like LAION-400m) and a pre-trained vision-language model like CLIP. Each word within the caption is scored and ranked, leading to a small number of shared, human-understandable concepts that closely describe the target feature. FALCON also applies contrastive interpretation using lowly activating (counterfactual) images, to eliminate spurious concepts. While many existing approaches interpret features independently, we observe in state-of-the-art self-supervised and supervised models that less than 20% of the representation space can be explained by individual features. We show that features in larger spaces become more interpretable when studied in groups and can be explained with high-order scoring concepts through FALCON. We discuss how extracted concepts can be used to explain and debug failures in downstream tasks. Finally, we present a technique to transfer concepts from one (explainable) representation space to another unseen representation space by learning a simple linear transformation. Code available at this https URL: https://github.com/NehaKalibhat/falcon-explain.
Latest publications
Topological representations of local explanations
A topology-based framework for comparing and understanding local explainability methods in machine learning.
ICMLTowards ground truth explainability on tabular data
Using copulas to generate synthetic tabular data with ground truth explanations for enhanced interpretability of AI models.
ICMLDynamic guardian models: realtime content moderation with user-defined policies
Specialized classifiers that evaluate text based on predefined trustworthiness objectives.
ICML