A Conjectured Solution to the Adversarial Examples Problem

I’ve been thinking about the adversarial examples question after talking to the RBC Research Institute in Toronto, and after watching this hilarious Silicon Valley clip. In the clip, the character Jian Yang develops what seems to be a universal image classifier. He points his smart-phone to a picture of a hot dog, and then the app correctly classifies the image as a hot dog. His friends then ask him to classify a pizza, and then, he points the camera to the pizza, and after some suspense, the app outputs “not hot dog.”

If one were to compare this app to state of the art image classifiers, one may think that this is not impressive, because it only outputs two categories of things. However, if Jian Yang’s classifier is robust to adversarial examples, in this post I would suggest that his app is the best image classifier in existence.  I also propose an architecture that Jian Yang may have used to construct such a robust classifier. The main purpose of this post is to have my readers point out areas of research that may be related to this line of thinking, and to inspire anyone to use this idea in their own research and engineering work.

The Approach: Train Multiple Neural Network Binary Classifiers each with Different Datasets

In Warde-Farley et al., the authors describe the adversarial examples problem. The idea behind adversarial examples for image classifiers is that an image which is clearly of one category, say, a cat, can be perturbed slightly, resulting in it being categorized as a different image, like, say, a dog. Sometimes these perturbations are so small that they are imperceptible to the human eye, and yet the neural network mis-classifies the image with very high confidence. For a less technical introduction to the topic, see this Open AI post.

To me, the most remarkable thing about adversarial examples is that they work on different neural networks with different architectures.

However, they work on different neural networks trained using the same dataset.

I suggest that engineers train many neural network classifiers, each using different datasets, and then “majority vote their outputs.”

Concretely, the way to do this to train a “hot dog, not  hot dog” classier would be to get a huge number of hot dog and not hot dog images, and then divide the images into N different sets, each with an similar number of hot dog and not-hot dog images. Then, different neural networks are trained to fit each of these data-sets. A sample image (which may be a hot dog or not hot dog), should then be fed through each of these classifiers. The output of the ensemble of these classifiers should be observed and the majority of “hot dog” or “not hot dog” should be selected as the final output.

Why this Might Work

My conjecture is that the reason why neural networks work on different architectures is that they are trained using the same dataset. I suspect that a whole collection of heuristics that train a network using randomized methods drawing from this dataset will, in a “law of large numbers” sense, converge to a property where they are tricked by an adversarial example. Specifically, as Warde-Farley et al. describe, in practice there are usually only a few vectors that you can use to perturb an example image to push it to another category of image, and this vector is similar across different architectures of neural networks. Let’s call this vector the optimally perturbing vector associated with the dataset.  I conjecture that this is a property of the distribution from which the images are drawn, which, when I think about it, is pretty much equivalent to the dataset itself.

If this is true, if one trains multiple networks using different datasets, I conjecture that different adversarial examples will work on each different neural network. However, I suspect that it would be hard to find an adversarial example that tricks the majority of these neural networks simultaneously. If the optimally perturbing vector associated with each dataset is different, adding these all up will cause their contributions to cancel each other out.

This of course needs empirical verification. Also, it is not clear to me whether the combined network that “majority functions” the output would be robust to a unique adversarial example.

Error Control Coding Analogy

To me, this seems like a way to capture the notion of a repetition code in machine learning. Each independently trained network in this architecture is analogous to a symbol in an error control code. I am still trying to think of an appropriate analogy to other types of coding techniques, like computing parity, as is done in low-density parity-check codes.

More than this, I suspect this would be a good technique to increase the accuracy of image classifiers even for typical images. Of course, this also needs empirical verification.

So, has anyone tried this method before? If so, please let me know! From what I can tell ensemble methods (which seem related to this idea) specifically try to train networks using the same dataset, and thus are different from this approach.

Advertisements

Classifying Untypical Examples of Adjective-Noun Pairs: A Speculative Machine Learning Application

One of the most commonly studied machine learning tasks is that of object recognition, which I have discussed in this blog, and a task that has been widely covered in the media. The task of object recognition is this: given an image of an object, say what nouns are in the image.  In this post I am going to argue that much more engineering emphasis should be placed on making “adjective” classifiers,.
Currently in the literature there is a lot of work on using the attributes of an image to classify the object in the image. This is a suitable goal and in some cases this approach performs very well. The intention of this post, however, is not to use identifying attributes towards the goal of classifying objects accurately, but rather to argue that identifying attributes is a worthy goal in and of itself.  
I also propose a different metric by which image classifiers should be measured: in addition to being measured according to their performance on a typical data-set, they should also be measured on adversarially designed novel examples. The key idea is that a classifier is good not only because of high accuracy for  a lot of typical image categories, but also because it can classify things into a large number of categories.
Another goal of this post is to get readers of the blog to point out areas of current research that I may be overlooking to appropriately focus my research. [Update: a friend of mine pointed out that this discussion may be related to a classic intellectual debate between computer vision and machine learning researchers, so I am going to work on clarifying the historical debate surrounding topics like this and discuss it as I learn more].
I will argue as well that engineering work on this problem may have a practical benefit that is enormous relative to the likely work required to do this. Moreover, I speculate that this framework will allow for generative neural nets to mimic a technique used in human creativity.
This line of thinking is actually inspired by a tool I use in language acquisition.
Using Creativity in Language Acquisition
One big hobby of mine is learning languages. In my experience, the task of becoming reasonably fluent in a language boils down to the ability to modify nouns easily. After the first few months of studying a language, I’ve found that I have enough “adjectives” and “adjectival phrases” to express any idea, however sloppily I may do it. That’s because most words can boil down to a combination of simpler words. One of the easiest ways to do this is by learning a lot of adjectives and a lot of nouns.
For example, consider the following language lesson, with a list of a bunch of adjectives and a bunch of nouns with which the nouns can be associated:
Adjectives
fierce
fluffy
furry
slobbery
shiny
spiky
frosty
sparkly
snowy
sleek
Nouns
frog
snake
human
dog
bicycle
porcupine
grape
palm tree
car
camera
If an English learner learned all 10 of the adjectives, and all 10 of the nouns in this list,  by creatively putting these together they can express 100 new concepts. They could describe some very typical concepts , like a “spiky porcupine,” “fierce dog,” or “sleek car,” but would also be able to describe very untypical but still conceivably possible things, like a “fluffy snake,”  “slobbery porcupine” or “furry bicycle.”  Note that the adjectives are chosen because they are properties of objects that can vary without actually changing what the object is. Note that most physical objects can be assigned these adjectives without losing their essential properties. Note of course that it may require something very imaginative happening with the object, like having a snarling mouth if it were a “fierce camera.”
Moreover, I think any human would very easily be able to recognize a “fluffy snake” if they saw one, and would probably note as well the novelty of it. This is easy  to do. As well, from my experience, imagining such things is a fun and useful tool for acquiring language.
I conjecture that recognizing novel types of objects like a “fluffy bicycle” would also be easy to do with existing machine learning algorithms, and very interesting. The main idea is simply: get well trained object classifiers and then put them together with attribute classifiers. Before going into detail, I will discuss some related image classification techniques used in much of the academic literature.
The Standard Image Classification Problem:
A standard engineering task is to classify images from a standard training set. One of the proto-typical examples of this is the MNIST Dataset, which is a collection of images of handwritten numbers and their correct classification (i.e. a vector representing whether the number is a 0, 1, 2, … 9).  Another example is the standard ImageNet Database which contains an enormous collection of images labelled by the nouns that are contained within them. To understand how this database was created, see this excellent and very accessible Ted Talk by Stanford Professor Fei-Fei Li. A standard goal associated with these datasets is to use the test data to train a classification function that then gets high classification accuracy on an associated test data set.
Adjective, or “Attribute” Classification
A goal related to the task of object classification is adjective classification: given an image, describe what adjectives describe the objects in that image. This is related to a concept in machine learning called attribute classification, which is often used to help in image recognition.
There is actually quite a bit of work in the literature on attribute classification. For example, the Sun Attributes Database contains images of landscapes labelled with various attributes typically associated with landscapes, including “foliage,” “biking,” and “no horizon.”
In Lampert et al., the authors show that certain animals can be associated with their attributes, and that learning these attributes can allow for better classification accuracy for the animals themselves. For example, an image of a bird can be classified as a bird if it has its essential attributes, like “having a beak,” “having feathers,” and “having wings.”
 Moreover, Russakovsky et al.,  discuss learning attributes in images, specifically with the goal of finding relationships between attributes and nouns. For example, their technique showed that there was a strong association between being a “marigold” and being a “flower.”
Human Imagination is Different: We can separate “noun” from “commonly associated descriptive properties
Training a neural network to recognize when an object has the essential attributes of an object can be used for image classification. But I would argue that human imagination and ingenuity lies in part in the ability to separate objects from their non-essential attributes, and vary both the objects and adjective in our imaginations freely. The examples of “fluffy snake,”  “slobbery porcupine” or “furry bicycle” of the previous section are examples of this. Moreover, human imagination seems more or less unlimited in this respect.
This naturally suggests a main engineering task: create a “classifier” that can output the “object” in the image and “adjectives describing the object” and in particular the unique or distinguishing adjectives. As an example, should such an algorithm take in as input an image of a watermelon covered in sparkles, the output should “sparkly watermelon.” The property of the watermelon being “sparkly” is the key attribute of such an image, because of the uniqueness of the sparkly attribute compared to other possible attributes of the watermelon like “greenness” or “juiciness”.
We could call such a classifier an “adjective-noun classifier.”
A Possible Simple Way To Implement Such a Classifier
To implement the adjective-noun classifier, I suggest two separate neural networks be trained: one to recognize objects and the other to recognize adjectives describing the objects. There is plenty of work on object classifiers but relatively very little work on adjective classifiers, (although the ImageNet Database suggests that it may only be temporary that only “nouns” are included as labels of its images).
To create the adjective-noun classifier, I suggest the image be fed into an adjective classifier and an object classifier. The likely adjective that is most novel should be output along with the most likely object type. To find the most novel adjective, I suggest consulting a corpus of English text and finding the least common adjective that is still output by the adjective classifier.
Testing the Classifier
This discussion also suggests a novel method of testing an adjective-noun classifier: by having humans creatively generate never-before imagined adjective-noun pairs, creating a work of art representing an object with this adjective,  and then seeing if the classifier can come up with the original adjective-noun description. For example, an artist could conceive of a “sleek watermelon,”  use computer graphics to create an image of a watermelon with fashionable and aerodynmic lines on its outer surface, and then input this image into the adjective-noun classifier. A good classifier should accurately come up with the “sleek watermelon” description as one of its possible descriptions.
This test, to me, is a good test of of how sharply a classifier tunes in on the essential properties of an object. If this is done well, it should accurately classify the object even when non-essential properties are exchanged with other properties.
The Benefit: Huge Returns for Training Labor Costs
I personally think that this should be a primary objective of the machine learning field. The reason for this is twofold:
1. It seems totally doable.
2. The possible benefit seems enormous because of the multiplicative growth in number of things that can be described.
To understand what I mean by the second point, I will define a property of two adjectives. We can consider two adjectives “orthogonal” if it is possible that an object can simultaneously have both these characteristics. So, for example, the adjectives “red” and “spiky” are orthogonal: all objects I can conceive of as being spiky can also be red. Non-orthogonal adjectives may be “bright” and “dim” although even this may not be true. The point is that it seems like the set of classifications that can be given to an image grows exponentially in the number of orthogonal adjectives that our adjective classifier can handle. This informs an obvious engineering goal: engineer a lot of orthogonal adjective classifiers. Each new classifier (say, a thorniness classifier) will expand the set of things our system can classify exponentially.
Another Possible Application: Generating Creative Images Described by Unique Adjective-Noun Pairs
More speculatively, I suspect that training an “object recognizer” and separately an “adjective recognizer” may be useful in generating creative images. By training the inputs to such a classifier, and using an optimization heuristic to optimize an objective function that reaches a minimum when the object classification is 100% being a particular noun and 100% the surprising adjective, we may be able to construct creative images. For example, consider starting with an “apple” image, and then applying gradient descent. Such a heuristic may result in the apple having frost-like textures on its surface, and from this we may be able to obtain an image of an apple that is “frozen.”
Obviously this is entirely speculative, and actually engineering such a thing may be quite an undertaking. I am curious if there are examples of something like this in the literature. Such a generative neural network architecture may then be in a sense producing imaginative works by simulating the human process of imagining novel adjective-noun pairs and then generating images that fit this category.
Summary
In this post I proposed an engineering goal inspired from linguistics: a noun-adjective classifier. I also proposed an architecture to produce such a classifier. As well, I proposed that the metric used to test such a classifier is how it handles a broad set of novel examples of adjective-noun pairs. I suggested that this was a good way to test how well a classifier hones in on precisely the key attributes of an object in its classification.  I speculated that such an architecture may be modified into a generative neural network as a way to automatically generate creative images, in a sense simulating the aspect of human creativity that allows humans to imagine objects with properties borrowed from other objects.

The “blogs I like”

On the side of my WordPress page I am compiling a collection of “blogs I like.” If you are in the mathematics and computer science world, it should come as no surprise that some of these entries are there, including Scott Aaronson’s Shtetl-Optimized and the blog of Terry Tao. In Terry Tao’s blog, I particularly recommend There’s more to mathematics than rigour and proofs. During my PhD I read this post, and I could not put into better words the importance of mathematical rigour than Terry did with this post:

“The point of rigour is not to destroy all intuition; instead, it should be used to destroy bad intuition while clarifying and elevating good intuition. It is only with a combination of both rigorous formalism and good intuition that one can tackle complex mathematical problems; one needs the former to correctly deal with the fine details, and the latter to correctly deal with the big picture. ”

During my PhD I went deep into mathematical rigour for my paper on the energy complexity of LDPC codes. It was through painstaking mathematical rigour that I realized a previous definition I had for what was an LDPC decoder circuit was not quite right. So, in a very intense month near the end of my PhD I used this rigour to correct my slightly wrong intuition, and a result of that was me coining the term “zig-zaggable set of bipartitions.” This work lead to what I think is a really fun generalization of Thompson’s zig-zag argument relating minimum bisection width of a circuit’s graph to it’s area, which you can find in the appendix of my paper.

As for other blogs I’ve linked to, there is one which, according to WordPress stats, is getting the most traffic, and that is Sinosplice, written by John Pasden. For Chinese language learners like myself, John Pasden is probably more well known as “John” from Chinesepod, which is, in my opinion, one of the greatest language learning resources ever created. Chinesepod is also, along with my Chinese teacher Hong Laoshi, one of the main reasons why I read and write Chinese fluently. In this blog John writes about the curiosities of living in China as a foreigner and makes many interesting observations about the Chinese language. I absolutely recommend that people check it out.

Earlier today my collaborator and all-round brilliant Carnegie-Mellon professor Pulkit Grover posted a comment that pointed out a connection between my current work on neural network pruning algorithms and the work of Dmitri Chklovskii. This is an absolutely perfect connection and I will address Pulkit’s comments soon. But, from a meta-perspective, I consider my blog a success! Writing about my literature review on a blog post as I go along allowed someone to point out an absolutely relevant and fascinating connection (which I’d probably have to make eventually anyways once the work goes through peer review). My blog is thus demonstrably, as I hoped, a tool for efficient science! I hope this acts as an example for other researchers, including my Kschischang-group colleague Reza Rafie (whose great blog I have also linked to) about a way to do research more efficiently than I did during my PhD. I look forward to reading Reza’s posts. Reza’s most recent post summarizes some of the work of the late Maryam Mirzakhani, who I was devastated to hear died so young. When she won her Field’s Medal, one of her media interviews where she talked about how she integrates her work into her daily life had great influence on how I do my work, even though she is in quite a different field than me.

Another blog post I have linked to is that of my good friend and star software engineer Jamie Wong. As my work has now lead me to be doing a lot of software engineering (i.e. figuring out how to use TensorFlow), Jamie’s blog has been hugely helpful and absolutely fascinating. I recommend people check it out.

I will keep adding links to blogs I find interesting as I continue with my research.

Update: I added a link to Lance Fortnow and Bill Gasarch’s Computational Complexity Blog to my sidebar. This blog was invaluable to me during my PhD research.

 

Optimal brain damage and energy aware pruning

In this post I will elaborate on the argument made in my previous post where I discussed how I am using an energy computation model called Grover’s information friction model to design energy efficient neural network architectures. In this post I will first discuss the current work on making energy efficient neural network architectures, then discuss some of my work on specialized decoding circuits for communication channels, and finally explain how my work suggests that there are huge energy savings possible in modern neural network architectures. The key point will be in recognizing that machine learning classifiers, in fact, are decoding circuits, albeit for a very strange code and channel.

Pruning Neural Networks for Specialized Circuitry

A very commonly used inference technique is using a set of training data to train a neural network classifier. The concept of a classifier is simple: it’s a function that takes some sample input (say, an image) and then outputs a classification of that input (so for example, an image of a cat, once put into a good image classifier should output “cat”). Sometimes a classifier may produce instead a vector of probabilities that the inputs belong to various categories.

Creating good image classifiers is one of the primary challenges of machine learning, and in many ways is one of the most significant engineering challenges of our time. Image classifiers in a sense must emulate what the visual cortex does in the brain.

Modern advances in machine learning have discovered something remarkable: if you use what are called deep convolutional neural networks and use optimization heuristics like gradient descent with back-propagation, and you have a big enough and well chosen data-set, convolutional neural networks start classifying images very well.

Currently, many groups are trying to create specialized neural network hardware. There are two main reasons why now there is so much focus is on making specialized circuitry for neural networks (as argued in a talk at FCRC by Olivier Temam and summarized by Scott Aaronson):

1. Neural networks are starting to work very well. Of course, neural networks have worked quite well for many decades (for example, convolutional neural networks have been used by the post office for zip-code classification since the late 80s). However, much bigger neural networks are being trained to do serious image recognition work.

2. In previous eras, Moore’s Law was still in full swing. Making a specialized circuit for neural network computation didn’t make sense: by the time the long circuit fabrication process was complete, the next generation of general processor would simply simulate the specialized circuitry as efficiently as the specialized circuitry. But, now that Moore’s law is reaching a fundamental physical limit, it makes sense to construct specialized circuits to gain extra computational power over general-purpose CPUs and GPUs.

Of course, modern well-trained neural networks are very big, and big specialized circuits are very expensive. Thus, there has been extensive work on creating sparse neural networks more suitable for circuit implementations.

In the 1990s, Le Cun et al. summarized many techniques used at the time to create more “sparse” neural networks. The reasons for making a network more sparse were often motivated by making the trained network generalize better. Neural networks with too many parameters have a tendency to over-fit the training data and thus not work well in practice. On the other hand, neural networks that are too small lack the expressive power to complete their task. Thus, the goal of pruning neural networks is to find “just the right” size of network in this trade-off.

More recently, techniques adapted from the earlier optimal brain damage research have proven enormously successful. For example, Han et al. use pruning techniques to reduce the size of a widely known convolutional neural network called AlexNet by 35X. The main idea of this approach is to train the network, then remove weights in the network that are close to 0, and then retrain the network with the reduced number of weights in an iterative fashion.  Yang et al. use similar techniques to reduce energy consumption of neural networks using an energy model derived from actual measurements of real circuits. Molchanov et al. propose greedy algorithms to determine which edges to prune in the network and they show that these techniques work well for large image classification tasks. From my perspective, optimally and efficiently pruning neural network architectures is an essential step for implementing specialized neural network circuits.

Much of this work deals with neural network computation on general purpose GPUs and CPUs in which energy consumption is dominated by memory accesses. However, specialized circuits should not necessarily use such an architecture. One possible different architecture for specialized deep learning circuits is a directly-implemented method where each “edge” in the neural network graph is a wire connecting processing nodes. This is one type of architecture that I studied for specialized error control coding circuits.

Specialized Error Control Coding Circuits

Another type of inference circuit is the error control decoder. These are circuits used everywhere in modern life, from cellphones to deep space communication. They infer a sent message when given a noisy version of this message as input. Specialized circuits for  these algorithms are already implemented, and in my thesis I characterized their fundamental energy, time, and reliability trade-offs.

A widely used error control decoding algorithm is called the sum-product algorithm, first used by Gallager for decoding low-density parity-check (LDPC) codes. It was also used by  Pearl for inference in Bayesian networks, and generalized by Kschischang et al., who showed that all of these problems reduce to the problem of marginalizing a function, which can be done computationally efficiently using an algorithm defined on a factor graph of that function. Thus, like neural networks, the sum-product algorithm is an algorithm defined on a graph. In the case of LDPC decoding, this graph is called the “Tanner graph” of the decoder. But the analogy is stronger than that. In fact, viewed in the right way, a neural network classifier, or any classifier for that matter, is a decoder.

The Key Connection: Classifiers Are Decoders

I first heard the argument that classifiers are decoders from Martin Wainwright at a tutorial talk he gave at ISIT 2015. To see this, let’s review the classical communication channel. In such a channel, there is some set of messages, then one of these messages is mapped to a (usually) longer message, sent through a noisy channel (like, say, a transmission wire), and then the receiver has to figure out what the original message was. Here’s a simple text drawing of the classical channel:

Message —-> Channel Coder —-> Channel —-> Decoder

The key point here is that the decoder has to figure out what the original message was. The communication system designer gets to design the channel coding and the method of transmission. The laws of physics, as applied to the transmission technique results in the properties of the channel.

Now let’s look at classifiers in a different way. To make things concrete, suppose our classifier was a “dog, cat, monkey” classifier. The input is an image and the output is the classification of the image as a dog, cat, monkey, or neither. We can think of the set {dog, cat, monkey, neither} as the set of possible “messages” to be sent. Then, the random distribution from which images are drawn in our Universe is the encoder and channel combined. Now, this image is fed into the classifier, and the task is to figure out what the original message was (dog, cat, monkey, or neither). This is exactly the task that the channel decoder had to perform: figure out the original message. The difference is, of course, we don’t know from what distribution “cat images” or “monkey images” are drawn, leading to a bigger challenge. However, unlike with error control coding, the set of messages for many classifiers are rather small: standard ImageNet datasets contain around 1000 image categories, or “messages,” but modern error control codes can have block lengths on the order of a 10^6, meaning that the number of possible messages is on the order of 2^{10^6}, vastly exceeding the number of atoms in the known Universe.

Because classifiers are decoders, this is why I conjecture that the same trade-offs that exist in error control coding circuits will extend naturally to machine learning circuits.

For LDPC decoding, in my thesis I showed that if a graph theoretic property called the minimum bisection width of a Tanner graph of the code is high, then the energy consumption of a directly-instantiated LDPC decoder must also be high. But in fact, applying the exact same arguments of my thesis to a neural network graph implies the exact same thing: high minimum bisection width of the neural network graph implies high circuit complexity for directly instantiated neural networks. And by high, I don’t just mean a big constant factor; these results are asymptotic scaling rules. As with coding circuits, and for the exact same reason, as neural networks get bigger, the difference in energy between neural networks with different energy scaling can be made arbitrarily big. And now, neural networks are getting huge (see, for example, Krizhevsky et al., which trained a neural network with 60 million parameters and 650 thousand neurons). This is why I am designing neural network architectures optimizing for this information-friction parameter.

The energy, time, and reliability trade-off in machine learning algorithms

My PhD thesis showed that there was a fundamental and universal trade-off between time, energy, and reliability in a special, widely used type of inference circuit called a decoder, suggesting that a well-engineered decoder must judiciously trade-off these fundamental parameters. But the human brain is also a well evolved inference machine that must judiciously trade-off time, energy, and reliability. In the research on optimal brain damage for neural networks, the information friction energy measure I used in my thesis never comes into play. But, this is an energy measure for which the human brain must be well adapted in order to survive in our environment, because it is a fundamental practical energy limit in our Universe (see Grover, or Chapter 8 of my thesis). Thus, I am now developing tools to train neural networks to mimic human evolution’s time, energy, and reliability trade-offs. The hypothesis is that tuning these parameters correctly will have multiple advantages:

1. The networks that result may mimic human brain structures more accurately, and thus have a better chance of getting high classification accuracy.

2. These techniques will result in more energy efficient neural network architectures, which is essential for their widespread application. This will be especially important for the creation of specialized machine learning circuits.

So what do I predict will be the result of this research? Currently, IBM’s Watson supercomputer (known for winning on the U.S. television game show Jeopardy!) uses 4000 times more energy than its human competitors. As well, modern machine learning tools often make un-humanlike errors (see, for example, the research on adversarial examples here, here, here, and also the discussion here). Thus I conjecture that this research will lead to the discovery of key tools to tune machine learning architectures to both increase reliability while also reducing energy consumption.