SAEfarer: exploring text classification models with sparse autoencoders
Leveraging SAEs to analyze the behavior of text classification LMs.
As language models (LMs) rise in prominence, there is interest in making them more transparent in order to better understand and steer their internal behavior. Recent work in interpreting LMs has focused on using sparse autoencoders (SAEs) to break down neuron activations at a given layer in the LM into human-understandable features, where each feature represents a single concept that the model knows. In this paper, we present initial work on leveraging SAEs to analyze the behavior of text classification LMs. We present techniques for exploring the relationships between the features extracted by the SAE and the model’s predictions and errors. We integrate these techniques into SAEfarer, an open-source prototype visual analytics tool for analyzing text classification models.
Latest publications
Routing with generated data
A setting in which routers are trained on generated queries and answers produced from high-level task descriptions. (ACL)
ACLCommonLID: Re-evaluating language identification performance
A community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. (ACL)
ACLMacaron: Controlled, human-written benchmark
A template-first benchmark that factorizes reasoning type and cultural aspect across question languages. (ACL)
ACL