Okay. Surfaces.
First thing to understand, is how the visual cortex (V1, the first stage) actually looks, and how it relates to the convolutional neural network architecture shown above.
Then, after understanding what V1 does, we'll move on to V2, and then V4 and MT, and finally the inferior temporal cortex (IT) which is akin to the "recognition layer" on the far right of the convolution map.
We talked about primitives. It turns out, that V1 is "pre"-primitive. It extracts feature from the visual space, which are then assembled into primitives by subsequent areas of the visual cortex. Specifically, the three features of greatest relevance for V1 are ocular dominance, edge orientation, and color.
Just to get a visual for what I'm about to say, first realize that the input from the eyes is combined in the visual cortex. The left half of the visual field (from both eyes) goes to the right cortex, and vice versa. BUT, the input from each eye is kept separated in V1, in the form of
ocular dominance columns. Each ocular dominance column processes input from one eye. Here is what ocular dominance columns actually look like in the surface of the V1 cortex:
Black is from the left eye, white is from the right.
But superimposed on this structure, are two other structures, one bigger and one smaller. Note the distance scale in the above picture. Taken together, the columnar architecture of V1 constitutes a matrix map of the (right or left half of) the visual field.
The matrix looks like this: for each point in the visual field, there is a "hypercolumn" approximately 1mm in diameter. Within that, there are two ocular dominance columns, about 500 microns each. Within each ocular dominance column there is a "color blob" in the middle, surrounded by a pinwheel of "orientation selective mini-columns" of extent 20-30 microns each. The pinwheel looks like this:
The black dot in the middle is the "color blob". It is a small region that stains positive for cytochrome oxidase. The color blob is color sensitive, but the orientation pinwheel is not.
Thus, for each point in the visual field there is an input from each eye, and within that input there is a color and an array of edge orientations. So we end up with a vector field for color, and a tensor field for orientations.
Within the orientation mini-columns, there is a preference for motion normal to the orientation. So if the edge orientation is vertical, the cells will respond best to horizontal motion. In humans and monkeys, there is a slight over-representation of the 90 degree angle in the orientation pinwheels in general, so a slight preference for vertical edges with horizontal movement.
In the V1 visual cortex there are 6 anatomical layers, which is an interesting story in and of itself. The columns we're talking about are vertical, whereas the layers are horizontal. So you can conceive of the processing matrix looking approximately like this:
Whereas the layers look like this:
The circuitry for a processing column looks like this:
The wiring between layers looks like this:
In the above diagram "M" is the magnocellular input from retina (via the lateral geniculate nucleus of the thalamus), which is the Y cells that handle motion. Whereas "P" is the parvocellular input from the X cells of the retina that handle static luminance and contrast. With respect to color, a fraction of the X cells are red-green opponent (some are just black and white or gray scale, depending on whether they get input from rods or cones), whereas the "K" (koniocellular) inputs are blue-yellow opponent. As expected, the K inputs and the red-green P inputs feed the color blobs, which can be seen mostly in layer 3 of the cortex.
In turn, this visual information from V1 is then fed to V2, in the following way:
The color information ("blob") goes to the thin stripes of V2, whereas the orientation information goes to the thick and pale stripes. We'll talk about V2 tomorrow, that is where visual surface primitives begin to be assembled.
For now, you can see that the primary visual cortex V1 is basically a gigantic matrix engine that builds a vector field for color and a tensor field for orientation, at each point in visual space.
The cellular circuitry is quite complex, and I'm leaving a lot out of the description just to provide an overview. For example you can see that the incoming and outgoing information is "multiplexed", each channel handles more than one piece of information, and how it does that has to do with the timing of neural firing and the relative excitation and inhibition between neurons. For now I just want to give you enough information to understand how surfaces are extracted and processed in V2 - because surfaces necessarily require the combination of signals from each eye into binocular representations in 3 dimensions.
And then later we'll see how the middle temporal area (MT, also known as V5) handles motion.
The general picture of the higher visual cortex is that V2 extracts surfaces, V4 extracts objects, and V5/MT tracks those objects as they move. IT, the inferior temporal lobe which is near the hippocampus anatomically, is responsible for recognizing objects based on memory, and the hippocampus is the beginning of the navigation system that lets us run mazes for rewarding objects, and grasp rewards that are hidden behind other objects.
Here is an overview of the human visual system minus the recognition part:
You can see there's an upper (dorsal) portion, and a lower (ventral) portion. So far we're just talking about the lower (ventral) stream, which is sometimes called the "what" pathway, as distinct from the dorsal stream which is called the "where" pathway. Both pathways are fed from V2, the surface processing area.
V2 is what is labeled "extrastriate cortex" in the above pic. It surrounds V1 and is about 3 times as big. There is an area called V3 which is also part of the extrastriate cortex, but generally in humans it's lumped in with V2 because the boundaries are indistinct and the wiring is almost identical. In monkeys though, V3 is more distinct and has different wiring. Both monkeys and humans have a "where" pathway and a "what" pathway in the visual system.
It is the "where" pathway that accomplishes the changes of basis that allows the rapid switching of reference frames when we're navigating a maze or exploring new terrain. Such exploration builds coordinate maps of visual space around our head and body movements, which is something completely different from object recognition. But both pathways feed into the "cognitive map" in the hippocampus that allows us to purposefully navigate visual terrain.
There are other parts of the visual system we haven't talked about (and won't), like MTS which has the specific purpose of recognizing faces and identifying the emotional components of facial maps. It's part of the "what" system, and is definitely related to surface mapping, but it's somewhat extraneous to surface geometry itself, it's a separate topic for a different thread.
Tomorrow we'll get to the differential geometry that allows us to identify surfaces from color and orientation. The good news is, it mainly boils down to simple algebra - in this case tensor algebra which is just an extension of vectors. Neural networks excel at calculating dot products, the results are almost instantaneous. If you're familiar with differential geometry you'll know about the Jacobian and the metric tensor, which vary at each point in space. Part of what the neuroscientists call "lateral inhibition" and "recurrent excitation" has to do with calculating the metric tensor and using it to estimate the curvature of surfaces.
We will eventually see, that the full 4 dimensional picture for moving surfaces looks a lot like general relativity, mathematically. Where, we have to use intrinsic geometry because there's nothing (we know of) that's "external" to the universe, therefore when we're looking at curvature we have to do it intrinsically. The brain is clever though, it uses both intrinsic and extrinsic descriptions, and maps between them to be able to accurately reconstitute and track moving objects from moving surfaces in real time.