I’ve been thinking about the adversarial examples question after talking to the RBC Research Institute in Toronto, and after watching this hilarious Silicon Valley clip. In the clip, the character Jian Yang develops what seems to be a universal image classifier. He points his smart-phone to a picture of a hot dog, and then the app correctly classifies the image as a hot dog. His friends then ask him to classify a pizza, and then, he points the camera to the pizza, and after some suspense, the app outputs “not hot dog.”
If one were to compare this app to state of the art image classifiers, one may think that this is not impressive, because it only outputs two categories of things. However, if Jian Yang’s classifier is robust to adversarial examples, in this post I would suggest that his app is the best image classifier in existence. I also propose an architecture that Jian Yang may have used to construct such a robust classifier. The main purpose of this post is to have my readers point out areas of research that may be related to this line of thinking, and to inspire anyone to use this idea in their own research and engineering work.
The Approach: Train Multiple Neural Network Binary Classifiers each with Different Datasets
In Warde-Farley et al., the authors describe the adversarial examples problem. The idea behind adversarial examples for image classifiers is that an image which is clearly of one category, say, a cat, can be perturbed slightly, resulting in it being categorized as a different image, like, say, a dog. Sometimes these perturbations are so small that they are imperceptible to the human eye, and yet the neural network mis-classifies the image with very high confidence. For a less technical introduction to the topic, see this Open AI post.
To me, the most remarkable thing about adversarial examples is that they work on different neural networks with different architectures.
However, they work on different neural networks trained using the same dataset.
I suggest that engineers train many neural network classifiers, each using different datasets, and then “majority vote their outputs.”
Concretely, the way to do this to train a “hot dog, not hot dog” classier would be to get a huge number of hot dog and not hot dog images, and then divide the images into N different sets, each with an similar number of hot dog and not-hot dog images. Then, different neural networks are trained to fit each of these data-sets. A sample image (which may be a hot dog or not hot dog), should then be fed through each of these classifiers. The output of the ensemble of these classifiers should be observed and the majority of “hot dog” or “not hot dog” should be selected as the final output.
Why this Might Work
My conjecture is that the reason why neural networks work on different architectures is that they are trained using the same dataset. I suspect that a whole collection of heuristics that train a network using randomized methods drawing from this dataset will, in a “law of large numbers” sense, converge to a property where they are tricked by an adversarial example. Specifically, as Warde-Farley et al. describe, in practice there are usually only a few vectors that you can use to perturb an example image to push it to another category of image, and this vector is similar across different architectures of neural networks. Let’s call this vector the optimally perturbing vector associated with the dataset. I conjecture that this is a property of the distribution from which the images are drawn, which, when I think about it, is pretty much equivalent to the dataset itself.
If this is true, if one trains multiple networks using different datasets, I conjecture that different adversarial examples will work on each different neural network. However, I suspect that it would be hard to find an adversarial example that tricks the majority of these neural networks simultaneously. If the optimally perturbing vector associated with each dataset is different, adding these all up will cause their contributions to cancel each other out.
This of course needs empirical verification. Also, it is not clear to me whether the combined network that “majority functions” the output would be robust to a unique adversarial example.
Error Control Coding Analogy
To me, this seems like a way to capture the notion of a repetition code in machine learning. Each independently trained network in this architecture is analogous to a symbol in an error control code. I am still trying to think of an appropriate analogy to other types of coding techniques, like computing parity, as is done in low-density parity-check codes.
More than this, I suspect this would be a good technique to increase the accuracy of image classifiers even for typical images. Of course, this also needs empirical verification.
So, has anyone tried this method before? If so, please let me know! From what I can tell ensemble methods (which seem related to this idea) specifically try to train networks using the same dataset, and thus are different from this approach.