Aside

It is tempting to assume that with the appropriate choice of weights for the edges connecting the second and third layers of the NN discussed in this post, it would be possible to create classifiers that output $1$ over any composite region defined by unions and intersections of the 7 regions shown below.

This is untrue, a fact that can be shown for the three edge case by brute force enumeration of all unique NNs of this architecture (I’m assuming fixed weights for edges connecting the 1st and 2nd layers). Because edge weights can vary continuously this may seem like an impossible task, but in reality we can restrict our attention to a small number of integer weights.

Consider the input to $a_1^{(3)}$, called the pre-activation function. By construction the value of this quantity cannot change within any of the 7 regions. For a given threshold value (set by the bias term $b_1^{(2)}$), all regions whose pre-activation values exceed that threshold will cause $a_1^{(3)}$ to fire a $1$, and all others will cause it to fire a $0$. For fixed weights $w_{11}^{(2)}$ thru $w_{13}^{(2)}$, changing the value of $b_1^{(2)}$ can result in at most 8 distinct NNs, since there are 7 regions and hence 7 distinct thresholds. Furthermore, the actual values of the thresholds do not matter, only their relative rank order, which enables us to only consider integer values. By enumerating all permissible rank orderings and all distinct biases, we can characterize the full set of distinct NNs.

Below is such an enumeration of all distinct NNs with the architecture under consideration. Orange regions correspond to areas of $\mathbb{R}^2$ where the NN will output $1$.

Below is the complement, the set of impossible NNs to achieve using this architecture.

It is easy to see why some configurations are impossible. Consider the fourth example above (counting from top-left). Let’s denote the regions covered by only one half-space by $r_1$, $r_2$, and $r_3$. All such regions are inactive, and all regions covered by exactly two half-spaces are active. This means that by adding the pre-activations of any two of $\{r_1,r_2,r_3\}$, the threshold is crossed, i.e. there was an increase in the value of the pre-activation function. But this implies that the pre-activation value associated with all three regions $r_1$, $r_2$, and $r_3$ simultaneously must be higher still, hence their intersection cannot be inactive.