There has been a lot of renewed interest lately in neural networks (NNs) due to their popularity as a model for deep learning architectures (there are non-NN based deep learning approaches based on sum-products networks and support vector machines with deep kernels, among others). Perhaps due to their loose analogy with biological brains, the behavior of neural networks has acquired an almost mystical status. This is compounded by the fact that theoretical analysis of multilayer perceptrons (one of the most common architectures) remains very limited, although the situation is gradually improving. To gain an intuitive understanding of what a learning algorithm does, I usually like to think about its representational power, as this provides insight into what can, if not necessarily what does, happen inside the algorithm to solve a given problem. I will do this here for the case of multilayer perceptrons. By the end of this informal discussion I hope to provide an intuitive picture of the surprisingly simple representations that NNs encode.
I should note at the outset that what I will describe applies only to a very limited subset of neural networks, namely the feedforward architecture known as a multilayer perceptron. There are many other architectures that are capable of very different representations. Furthermore, I will be making certain simplifying assumptions that do not generally hold even for multilayer perceptrons. I find that these assumptions help to substantially simplify the discussion while still capturing the underlying essence of what this type of neural network does. I will try to be explicit about everything.
Let’s begin with the simplest configuration possible: two inputs node wired to a single output node. Our NN looks like this:
The label associated with a node denotes its output value, and the label associated with an edge denotes its weight. The topmost node represents the output of this NN, which is:
In other words, the NN computes a linear combination of the two inputs and , weighted by and respectively, adds an arbitrary bias term and then passes the result through a function , known as the activation function. There are a number of different activation functions in common use and they all typically exhibit a nonlinearity. The sigmoid activation , plotted below, is a common example.
As we shall see momentarily, the nonlinearity of an activation function is what enables neural networks to represent complicated input-output mappings. The linear regime of an activation function can also be exploited by a neural network, but for the sake of simplifying our discussion even further, we will choose an activation function without a linear regime. In other words, will be a simple step function:
This will allow us to reason about the salient features of a neural network without getting bogged down in the details. In particular, let’s consider what our current neural network is capable of. The output node can generate one of two values, and this is determined by a linear weighting of the values of the input nodes. Such a function is a binary linear classifier. As shown below, depending on the values of and , one regime in this two-dimensional input space yields a response of (white) and the other a response of (shaded):
Let’s now add two more output nodes (a neural network can have more than a single output). I will need to introduce a bit of notation to keep track of everything. The weight associated with an edge from the node in the first layer to the node in the second layer will be denoted by . The output of the node in the layer will be denoted by . Thus and .
Every output node in this NN is wired to the same set of input nodes, but the weights are allowed to vary. Below is one possible configuration, where the regions triggering a value of are overlaid and colored in correspondence with the colors of the output nodes:
So far we haven’t really done anything, because we just overlaid the decision boundaries of three linear classifiers without combining them in any meaningful way. Let’s do that now, by feeding the outputs of the top three nodes as inputs into a new node. I will hollow out the nodes in the middle layer to indicate that they are no longer the final output of the NN.
The value of the single output node at the third layer is:
Let’s consider what this means for a moment. Every node in the middle layer is acting as an indicator function, returning or depending on where the input lies in . We are then taking a weighted sum of these indicator functions and feeding it into yet another nonlinearity. The possibilities may seem endless, since we are not placing any restrictions on the weight assignments. In reality characterizing the set of NNs (with the above architecture) that exhibit distinct behaviors does require a little bit of work–see Aside–but the point, as we shall see momentarily, is that we do not need to worry about all such possibilities. One specific choice of assignments already gives the key insight into the representational power of this type of neural network. By setting all weights in the middle layer to , and setting the bias of the middle layer to , the activation function of the output neuron will output whenever the input lies in the intersection of all three half-spaces defined by the decision boundaries, and otherwise. Since there was nothing special about our choice of decision boundaries, we are able to carve out any arbitrary polygon and have the NN fire precisely when the input is inside the polygon (in the general case we set the weights to , where is the number of hyperplanes defining the polygon).
This fact demonstrates both the power and limitation of this type of NN architecture. On the one hand, it is capable of carving out decision boundaries comprised of arbitrary polygons (or more generally polytopes). Creating regions comprised of multiple polygons, even disjoint ones, can be achieved by adding a set of neurons for each polygon and setting the weights of their respective edges to , where is the number of hyperplanes defining the polygon. This explains why, from an expressiveness standpoint, we don’t need to worry about all possible weight combinations, because defining a binary classifier over unions of polygons is all we can do. Any combination of weights that we assign to the middle layer in the above NN will result in a discrete set of values, up to one unique value per region formed by the union or intersection of the half-spaces defined by the decision boundaries, that are inputted to the node. Since the bias can only adjust the threshold at which will fire, then the resulting behavior of any weight assignment is activation over some union of polygons defined by the shaded regions. Thus our restricted treatment, where we only consider weights equal to , already captures the representational power of this NN architecture.
Several caveats merit mention. First, the above says nothing about representational efficiency, only power. A more thoughtful choice of weights, presumably identified by training the NN using backpropagation, can provide a more compact representation comprised of a smaller set of nodes and edges. Second, I oversimplified the discussion by focusing only on polygons. In reality, any intersection of half-spaces is possible, even ones that do not result in bounded regions. Third, and most seriously, feedforward NNs are not restricted to step functions for their activation functions. In particular modern NNs that utilize Rectified Linear Units (ReLUs) most likely exploit their linear regions.
Nonetheless, the above simplified discussion illustrates a limitation of this type of NNs. While they are able to represent any boundary with arbitrary accuracy, this would come at a significant cost, much like the cost of polygonally rendering smoothly curved objects in computer graphics. In principle NNs with sigmoidal activation functions are universal approximators, meaning they can approximate any continuous function with arbitrary accuracy. In practice I suspect that real NNs with a limited number of neurons behave more like my simplified toy models, carving out sharp regions in high-dimensional space, but on a much larger scale. Regardless NNs still provide far more expressive power than most other machine learning techniques and my focus on disguises the fact that even simple decision boundaries, operating in high-dimensional spaces, can be surprisingly powerful.
Before I wrap up, let me highlight one other aspect of NNs that this “union of polygons” perspective helps make clear. It has long been known that an NN with a single hidden layer, i.e. the three-layer architecture discussed here, is equal in representational power to a neural network with arbitrary depth, as long as the hidden layer is made sufficiently wide. Why this is so is obvious in the simplified setting described here, because unions of sets of unions of polygons can be flattened out in terms of unions of the underlying polygons. For example, consider the set of polygons formed by the following 10 boundaries:
We would like to create 8 neurons that correspond to the 8 possible activation patterns formed by the polygons (i.e. fire when input is in none of them (1 case), one of them (3 cases), two of them (3 cases), or any of them (1 case)). In the “deep” case, we can set up a four-layer NN such that the second layer defines the edges, the third layer defines the polygons, and the fourth layer contains the 8 possible activation patterns:
The third layer composes the second layer, by creating neurons that are specific to each closed region. However, we can just as well collapse this into the following three-layer architecture, where each neuron in the third layer “rediscovers” the polygons and how they must be combined to yield a specific activation pattern:
Deeper architectures allow deeper compositions, where more complex polygons are made up of simpler ones, but in principle all this complexity can be collapsed onto one (hidden) layer. There is a difference in representational efficiency however, and the two architectures above illustrate this important point. While the three-layer approach is just as expressive as the four-layer one, it is not as efficient: the three-layer NN has a 2-10-8 configuration, resulting in 100 parameters (20 edges connecting first to second layer plus 80 edges connecting second to third layer), while the four-layer NN, with a 2-10-3-8 configuration, only has 74 parameters. Herein lies the promise of deeper architectures, by enabling the inference of complex models using a relatively small number of parameters. In particular, lower-level features such as the polygons above can be learned once and then reused by higher layers of the network.
That’s it for now. I hope this discussion provided some insight into the workings of neural networks. If you’d like to read more, see the Aside, and I also recommend this blog entry by Christopher Olah which takes a topological view of neural networks.
Update: HN discussion here.