the Jacobian field (and what it's good for)

scruffy · Jul 19, 2025

In an earlier thread I outlined the behavior of the human brain's visual system in terms of differential geometry and the extraction of information about surfaces.

But now, scientists at MIT have come one step closer, they've succeeded in matching the visual map with the motor map.

In humans, this relates to our ability to reach behind objects for other objects. We reach "around" the obstacle, and in most cases one trajectory is as good as another.

MIT scientists have duplicated this feat with machine learning. And, the mechanism is surprisingly simple.

MIT's new AI can teach itself to control robots by watching the world through their eyes — it only needs a single camera

The new training method doesn't use sensors or onboard control tweaks, but a single camera that watches the robot's movements and uses visual data.

www.livescience.com

In humans, the "world map" is generated in the temporal pole, in and around the hippocampus. Mappings from the various sensory and motor modalities "align" there.

The MIT scientists merely had to align the Jacobian map from the visual system, with the Jacobian map from the motor system. Throw in a learning algorithm, and the rest happens by itself.

scruffy · Jul 19, 2025

The evolutionary story around this is quite interesting.

The cerebral cortex as we know it, is only found in mammals and non-avian reptiles. Scientists theorize there was a common ancestor, before the split of the sauropsids and theropsids, which they posit had the first cerebral cortex.

Prior to that, there was only an area called the pallium, which basically equates with the hippocampus in mammals. It fulfills the same function, learning and memory and sensory integration. The pallium is old, goldfish have one. Insects don't really have one, instead they have an "optic lobe" and their eyes are structured differently (they have "ommatidia" instead of rods and cones, which is a whole 'nother thread, I'd love to tell you about the hawk moth and why it's important).

But the pallium and it's extension the hippocampus, are what they call "allocortex", which is different from the "neocortex" in mammals. The first creatures with a true neocortex are the early mammals, the synapsids, which started around the Triassic/Jurassic transition some 200 million years ago. The difference is, neocortex has 6 layers, while allocortex only has three. A neocortex is like two allocortexes piled on top of each other.

So these Jacobian maps we're talking about, are created in the neocortex. They depend on a "visual cortex". Before the visual cortex, there was only a crude reflex to target prey and swipe at it, called the superior colliculus, which has several other reflexes passing through it too, regarding eye movements and such.

So what happens during evolution is the advanced neocortex comes to make use of the older systems. In humans the visual cortex processing is tightly coupled to eye movements, and this is what enables the Jacobian maps. (The Jacobian maps are the complete set of tangent planes in the visual field - basically every processing column in the visual cortex creates its own Jacobian matrix relative to its neighbors, so you get a field or lattice of these things).

The maps have stability requirements, and what happens in humans is exceedingly clever. Obviously, if you're trying to calculate a Jacobian which takes time, and then your eyes suddenly move to a different place in the middle of it, your map is destroyed and you'll get garbage output. So the human brain has an evoked potential sequence that guarantees the Jacobians are built immediately after an eye movement (basically every time there's a new snapshot of the visual field). WHILE this is going on, the eyes are tremoring at about 100 hz, which moves the visual field back and forth very quickly by a very small amount, guaranteeing that the numbers being calculated are "statistical averages" relative to nearby receptors. We're talking single photon resolution though, and .01 degrees of arc in the visual field. So "nearby" with emphasis.

And, this process is what enables predictive coding. It's too long to explain the whole thing - but what happens in the neocortex that doesn't happen in the allocortex, is the visual system comes to predict the details of the next frame. By building the Jacobian, if the derivatives are correct then we can predict the details of small motions. This process in humans is very fast, it happens in just milliseconds after a new scene.

So hence the tremor - which provides successive error signals to the predictive coder. After every error signal there is a calculation to update the Jacobian from the difference between the predicted value and the actual value, and this in turn reduces the error of the Jacobian which eventually converges to near-zero.

This is pretty sophisticated behavior, it's well ahead of the light-detection and dynamic filtering properties of the moth's optic lobe (which roughly equate with the lateral geniculate nucleus in humans, except without the feedback from the cerebral cortex). The interesting thing is that the cellular architecture required to accomplish all this is very simple.

If you read through the article you'll see the whole system only has two parts. 'Course they do it the machine way, not the brain way - but it's the same calculation, the same principles. Your organism is moving in time and space, and the visual system has to understand how the surfaces being seen are related to the motor actions being taken - and therefore the visual map has to line up with the motor map.

There are specific other areas of the neocortex (other than the primary visual cortex) involved with alignments of maps (for instance areas 5 and 7 in the parietal lobe). And the motor side of it is fascinating too, the motor and sensory areas end up having to predict each other, and there's a whole 'nother synchronization mechanism keeping that lined up with eye movements.

We're finally getting to the place where AI and neuroscience can inform each other

scruffy · Jul 21, 2025

The goal here is to build a 3-d map of the environment from the two images from each eye.

The first thing to consider is the "bounding box", it is the smallest rectangle that completely encloses an object in the visual field.

We can pretty easily figure out where the object boundaries are in an image, using contrast and color. With the image from a single eye or camera, you get bounding boxes that look like this (you may have seen the same concept in facial recognition, which generally uses only a single camera):

However we need a THREE dimensional map, not just two. So the bounding boxes have to become cubes, not just squares. They have to look like this:

There are all kinds of methods for doing this, including point cloud photogrammetry. However we want it done in 10 msec or less, for a color image roughly 1000 x 1000 pixels.

The acquisition sequence is:

1. The eyes move to a new spot
2. The image from each eye is fed to the processor
3. The processor calculates object boundaries and places the bounding boxes around each object
4. From the object definitions we extract the surface features of each object

and finally and most importantly, when all this is done,

5. We translate camera coordinates into world coordinates, so our organism can navigate the maze tomorrow and the next day without requiring real time information.

These tasks can be performed by AI deep learning using neural networks, in near real time. A fast CPU with an nVidia graphics card can do it in just about 10 msec - which is about the same time our brains take.

So these are road scenes from freeways and city streets. What happens if the cars are moving? Can we still say that the hood of the leading car is slightly curved, within 10 msec? If the car is moving fast like 60 mph, at some point it becomes impossible to describe its surface features in such detail. We trade detail for processing speed, because we care more that the car is moving towards us, than its make and model.

Two things about coordinates:

1. The eyes are at the top of the head, whereas we may require leg coordinates so we can get the hell out of the way.

2. The eyes give us egocentric coordinates whereas we may be interested in estimating the distance between cars, which would require an allocentric coordinate system.

And the further observation is that a stereo image is a "projection mapping", no two lines are parallel and all lines merge at the horizon. Any translation to allocentric coordinates has to undo (correct) the projection mapping first

the Jacobian field (and what it's good for)

scruffy

Diamond Member

MIT's new AI can teach itself to control robots by watching the world through their eyes — it only needs a single camera

scruffy

Diamond Member

scruffy

Diamond Member

EM symmetry finally explained

Statistics, God and Estimating

Similar threads

New Topics

Latest Discussions