TCAV: Interpretability Beyond Feature Attribution

An overview of GoogleAI’s model Interpretability technique in terms of human-friendly concepts.

How convolutional neural networks see the world

It’s not enough to know if a model works, we need to know how it works: Sundar Pichai

The emphasis today is slowly moving towards model interpretability rather than model predictions alone. The real essence of Interpretability, however, should be to make Machine learning models more understandable for humans and especially for those who don’t know much about machine learning. Machine Learning is a very powerful tool and with such power comes a responsibility to ensure that values like fairness are well reflected within the models. It is also important to ensure that the AI models do not reinforce the bias that exists in the real world. To tackle such issues, Google AI Researchers are working on a solution called TCAV (Testing with Concept Activation Vectors) to understand what signals the neural network models use for their prediction.


Google Keynote (Google I/O’19)

In his keynote address at Google I/O 2019, Sundar Pichai talked about how they are trying to build a more helpful Google for everyone which also includes building AI for everyone. He reiterated the fact that Bias in Machine Learning is a matter of concern and the stakes are even high when it comes to AI. In a bid to make AI more responsible and transparent, he discussed the TCAV methodology and through this article, I’ll like to give an overview of the same and how it intends to address the issue of Bias and Fairness. The article will be light on math so in case you want a deeper look, you can read the original research paper or visit TCAV’s Github Repository.

Need for Another Interpretability Technique

In the ML realm, there are mainly three kinds of Interpretibility techniques:

Types of Interpretibility Techniques

Mostly, you are given a model which has been created by years of engineering and expertise and you cannot change its architecture nor can you retrain it. So how to do you go about interpreting a model about which you have no clue? TCAV is a technique which aims to handle such scenarios.

Most Machine Learning models are designed to operate on low-level features like edges and lines in a picture or say the colour of a single pixel. This is very different from the high-level concepts more familiar to humans like stripes in a zebra. For instance, if you have an image then every pixel of that image is an input feature. Now although it is possible to look at every single pixel and infer their numerical values, these make no sense to humans. We won’t say that the 5th pixel of this image has a value of 28, rather as humans we always say that there is a blue river in the picture. TCAV tries to overcome this issue.

Also, typical interpretability methods require you to have one particular image that you are interested in understanding. TCAV gives an explanation that is generally true for a class of interest, beyond one image (global explanation).

TCAV approach

Let’s say we have a model that is trained to detect zebras from images. We would want to know which variables have actually played a role in deciding whether the image was a zebra or not. TCAV can help us understand if the concept of stripes was important to the model’s prediction which is actually yes in this case.

TCAV shows that stripes are an important ‘concept’ when deciding if an image contains a zebra or not

Similarly, consider a classifier trained on images of doctors. Now if the training data consisted mostly of males wearing white coats and stethoscopes, then the model would assume that being male with a white coat was an important factor to be a doctor. How would this help us? Well, this would bring out the bias in the training data which has fewer images of females and we could easily rectify that.

TCAV shows that being male is an important ‘concept’ when deciding if an image belongs to a doctor or not

So what is TCAV?

Testing with Concept Activation Vectors (TCAV) is a new interpretability initiative from the Google AI Team. The Concept Activation Vectors (CAVs), provide an interpretation of a neural net’s internal state in terms of human-friendly concepts. TCAV uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result–for example, how sensitive a prediction of “zebra” is to the presence of stripes.

The Team pioneered by Been Kim along with Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres, aims to make humans empowered by machine learning, not overwhelmed by it. Here is what Been thinks about interpretability.


TCAV essentially learns ‘concepts’ from examples. For instance, TCAV needs a couple of examples of ‘female’, and something ‘not female’ to learn a “gender” concept. The goal of TCAV is to determine how much a concept (e.g., gender, race) was important for a prediction in a trained model even if the concept was not part of the training.

Continuing with the ‘Zebra Classifier’, consider that the neural network consists of inputs x ∈ R^ n and a feedforward layer l with m neurons, such that input inference and its layer l activations can be seen as a function :

Testing with Concept Activation Vectors
  • Defining a concept of interest

For a given set of examples that represent this concept(eg stripes)(a) or an independent data set with the concept labelled (b) and a trained network(c), TCAV can quantify the model’s sensitivity to the concept for that class.

  • Finding Concept Activation Vectors (CAVs)

Now, we need to find a vector in the space of activations of layer l that represents this concept. CAVs are learned by training a linear classifier to distinguish between the activations produced by a concept’s examples and examples in any layer (d). We then define a “concept activation vector” (or CAV) as the normal to a hyperplane separating examples without a concept and examples with a concept in the model’s activations

  • Calculating Directional Derivatives

For the class of interest (zebras), TCAV uses the directional derivative SC,k,l(x) to quantify conceptual sensitivity(e). This SC,k,l(x) can quantitatively measure the sensitivity of model predictions with respect to concepts at any model layer

Here is a step by step guide on using TCAV in your workflow:

Insights and biases

TCAV was used in two widely used image prediction models i.e InceptionV3 and GoogleNet.


While the results show the importance of the red concept for fire engines, some results also confirmed the inherent bias in the models towards gender and race, despite not being explicitly trained with these categories. For example:

  • ping-pong balls and Rugby balls are highly correlated with a particular race
  • Arms concept was more important to predict dumbbell class than other concepts.


TCAV, is a step toward creating a human-friendly linear interpretation of the internal state of a deep learning model, so that questions about model decisions may be answered in terms of natural high-level concepts.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s