alignment of sensory fields

See? Scruffy is always right. This suddenly became a hot topic in robotics. Now someone has designed a network that automatically integrates new sensors.

That sounds amazingly similar to plug and play devices we have used on our computers for a long time. You've still got to tell the computer what to do with the data received, and how to integrate that into the running program, don't you?
 
That sounds amazingly similar to plug and play devices we have used on our computers for a long time. You've still got to tell the computer what to do with the data received, and how to integrate that into the running program, don't you?
Plug and play sensors are required to identify themselves and their capabilities with IEEE 1451.4

PC sensors use "class" drivers once they've been identified. Some of the drivers will go out and look for SensorML files if they exist. Class drivers are kind of like middleware, they do the translation into an OS recognized format.

Robotic sensors are different, they carry analog signals like a microphone. Imagine you had an omnidirectional mic and a cardioid pattern mic, and you plugged one in at random. Which one did you plug in? The robot can only make that determination if it has other simultaneous inputs from other sensors.

To a certain extent some of the sensor output is self explanatory. If you have a pushbutton you won't get an analogy signal, it'll be binary. But if it's a gyro you'll have three analog streams coming out of the sensor, and if it's placed on a 6-df limb the brain needs to know its orientation before the signal streams can be interpreted properly.
 
Plug and play sensors are required to identify themselves and their capabilities with IEEE 1451.4

PC sensors use "class" drivers once they've been identified. Some of the drivers will go out and look for SensorML files if they exist. Class drivers are kind of like middleware, they do the translation into an OS recognized format.

Robotic sensors are different, they carry analog signals like a microphone. Imagine you had an omnidirectional mic and a cardioid pattern mic, and you plugged one in at random. Which one did you plug in? The robot can only make that determination if it has other simultaneous inputs from other sensors.

To a certain extent some of the sensor output is self explanatory. If you have a pushbutton you won't get an analogy signal, it'll be binary. But if it's a gyro you'll have three analog streams coming out of the sensor, and if it's placed on a 6-df limb the brain needs to know its orientation before the signal streams can be interpreted properly.
the TEDS-Enabled Strain Gauge Load Cell is an analog device like lots of other analog devices that fall under that standard. Instant Internet Experts say the silliest things, don't they?
 
The new design you describe might truly be groundbreaking, but you haven't demonstrated in what unique way that might be.
 
The new design you describe might truly be groundbreaking, but you haven't demonstrated in what unique way that might be.
You could always try doing a little research yourself.
 
You could always try doing a little research yourself.
Nope. My interests might bump against that field occasionally, but I have neither the time nor interest in seriously developing in that area. However, J have close friends that work in those areas daily. They don't mind giving me the Cliff Notes version when I ask. I have no need for the minute details. You're the one claiming a new development better than sliced bread. You haven't identifier what is new or wonderful about it.
 
Last edited:
Nope. My interests might bump against that field occasionally, but I have neither the time nor interest in seriously developing in that area. However, J have close friends that work in those areas daily. They don't mind giving me the Cliff Notes version when I ask. I have no need for the minute details. You're the one claiming a new development better than sliced bread. You haven't identifier what is new or wonderful about it.

I gave you the link.

It's up to you to read it.
 
So, here's why alignment is important.

Barn owls locate their prey in the dark, based only on the sounds they make. The location is determined by the "inter-aural time difference", sounds from the left will hit the left ear first. Owls can resolve time differences of only 5 microseconds, from neurons whose resolution is at best 2 msec.

How do they do this? This has been intensively studied, by Mark Konishi at Caltech. It has to do with a neural architecture that resembles a delay line, but it's much more complicated than that. It involves phase locking of auditory receptors and a "head related transfer function" based on the size and shape of the individual head.

But here's the thing - at some point a tonotopic frequency map coming out of the ear, is converted to a spatial map in the midbrain. And here's the amazing stuff: this map self organizes. Not only that, but during a critical period lasting about 200 days starting at a couple weeks of age, the auditory spatial map will align with a visual map from a nearby brain area. And, we can control how those two maps align, by putting prisms on the eyes. If we shift the visual field by 20 degrees during the critical period, then the owl's auditory localization capability in darkness will be off by 20 degrees.

A similar process occurs in humans. The formation of a unified spatial map of the external world occurs very early in evolution, it's in the midbrain (no cerebral cortex required), and it's highly conserved from fish upwards. Here's what it looks like in humans:

1765409972482.webp


You see the little red dot there, and you see the white area surrounding it. The white area looks like it has two little bumps on the left side (which would be the back of the brain). These are the superior and inferior colliculi (you're only looking at half the brain, so there's actually four bumps, two on each side).

The little red dot is the inferior colliculus, which is the one that deals with hearing. Just above it is the superior colliculus, which deals with vision. Between the superior and inferior neurons, are connections that map the visual field to the auditory field. These connections self organize. The result can be manipulated by changing the statistics of the environmental input.

This is a model system, and we're studying this for a reason. Because in both owls and humans (which can both localize sounds with approximately equal precision), there is a next step. In the colliculi in the brainstem, the synaptic weights resulting from the self organization process end up hard wired. But what if they weren't? What if we could change the visual-to-auditory mapping by dynamically changing the synaptic weights?

We need this capability because of navigation. Think of a rat navigating a maze to a reward, where the maze has several entry points, and the rat enters randomly at one of them. Same maze, same reward location, just different entry point. You can train the rat on three entry points, and then introduce a brand new fourth point, and show that the rat is mentally rotating existing models to find a fit. (Not only that, but it "superstitiously" prefers trained choices while it's searching, often attempting learned sequences).

Vision and touch map input to a sensory spatial surface, but hearing has to recreate that surface from binaural time differences. What is most interesting about the barn owl is it can localize sounds behind itself (it can turn its head almost 180 degrees), where there is no visual input. So what would you guess, does the spatial map coming out of the inferior colliculus contain the "behind" region or not?
 
So, here's why alignment is important.

Barn owls locate their prey in the dark, based only on the sounds they make. The location is determined by the "inter-aural time difference", sounds from the left will hit the left ear first. Owls can resolve time differences of only 5 microseconds, from neurons whose resolution is at best 2 msec.

How do they do this? This has been intensively studied, by Mark Konishi at Caltech. It has to do with a neural architecture that resembles a delay line, but it's much more complicated than that. It involves phase locking of auditory receptors and a "head related transfer function" based on the size and shape of the individual head.

But here's the thing - at some point a tonotopic frequency map coming out of the ear, is converted to a spatial map in the midbrain. And here's the amazing stuff: this map self organizes. Not only that, but during a critical period lasting about 200 days starting at a couple weeks of age, the auditory spatial map will align with a visual map from a nearby brain area. And, we can control how those two maps align, by putting prisms on the eyes. If we shift the visual field by 20 degrees during the critical period, then the owl's auditory localization capability in darkness will be off by 20 degrees.

A similar process occurs in humans. The formation of a unified spatial map of the external world occurs very early in evolution, it's in the midbrain (no cerebral cortex required), and it's highly conserved from fish upwards. Here's what it looks like in humans:

View attachment 1192477

You see the little red dot there, and you see the white area surrounding it. The white area looks like it has two little bumps on the left side (which would be the back of the brain). These are the superior and inferior colliculi (you're only looking at half the brain, so there's actually four bumps, two on each side).

The little red dot is the inferior colliculus, which is the one that deals with hearing. Just above it is the superior colliculus, which deals with vision. Between the superior and inferior neurons, are connections that map the visual field to the auditory field. These connections self organize. The result can be manipulated by changing the statistics of the environmental input.

This is a model system, and we're studying this for a reason. Because in both owls and humans (which can both localize sounds with approximately equal precision), there is a next step. In the colliculi in the brainstem, the synaptic weights resulting from the self organization process end up hard wired. But what if they weren't? What if we could change the visual-to-auditory mapping by dynamically changing the synaptic weights?

We need this capability because of navigation. Think of a rat navigating a maze to a reward, where the maze has several entry points, and the rat enters randomly at one of them. Same maze, same reward location, just different entry point. You can train the rat on three entry points, and then introduce a brand new fourth point, and show that the rat is mentally rotating existing models to find a fit. (Not only that, but it "superstitiously" prefers trained choices while it's searching, often attempting learned sequences).

Vision and touch map input to a sensory spatial surface, but hearing has to recreate that surface from binaural time differences. What is most interesting about the barn owl is it can localize sounds behind itself (it can turn its head almost 180 degrees), where there is no visual input. So what would you guess, does the spatial map coming out of the inferior colliculus contain the "behind" region or not?
I dont know what your point is. You have to be more concise. Un human thought emotions are in control. Emotions can determine what you think you know. The brain has 5 systems. They are bilateral. The Limbic system can control every brain system in a crises real or imagined. It only expresses emotion and memory. Its not verbal or explicit.. The message goes up to the prefrontal cortex which then sets goals and actions. The PFC is explicit and understands words. The ability to understand the LS message is called coherence. Some people have high others low and some so little they make mistakes constantly.
 
I dont know what your point is. You have to be more concise. Un human thought emotions are in control. Emotions can determine what you think you know. The brain has 5 systems. They are bilateral. The Limbic system can control every brain system in a crises real or imagined. It only expresses emotion and memory. Its not verbal or explicit.. The message goes up to the prefrontal cortex which then sets goals and actions. The PFC is explicit and understands words. The ability to understand the LS message is called coherence. Some people have high others low and some so little they make mistakes constantly.

We're talking about machine learning.

Synthesizing higher dimensional spaces from lower dimensional ones.

Changing reference frames so you get an allocentric map of 3d space instead of a bunch of egocentric samples.
 
This is how a self organizing map works:

1765498551996.gif


You'll notice that it's the coordinate system changing, not any of the data.

Within each cluster, the dots are arranged to reflect the primary features within the cluster.

In the case of the alignment of sensory fields, there will be a similar but slightly less radical organization going on.

The question is about the dimensionality. Ears are 3-d and hear behind, eyes are 2-d and only see in front. So if you have one 3d map and one 2d map, what will happen?
 
Hm. Curioser and curioser.


I have an alternative explanation for these results. The brain is making a prediction. Since it has identified the location of an auditory signal in space, it is expecting something visual to show up there too.

We're going to have to revisit these earlier studies, in light of what we know now.
 
This is the most interesting part of alignment.

What we've been looking at till now is Alignment 101. This right here, is the 401 advanced course.

The drawing is pretty self explanatory. The parts in orange are the hippocampal areas the contain grid cells, place cells, and border definition cells. The blue part is the parietal cortex, which is all about "where" things are from an egocentric perspective, on the body surface, or in the visual field.

Between them is the purple area called "retrosplenial cortex", which is Brodmann's area 30 - so called because it sits right behind the splenium of the corpus callosum, near the junction of the parietal, temporal, and occipital lobes.

1765578773368.webp


This area performs geometric transforms of the sensory space (image on right) and relates the position and orientation of landmarks "in the current view" to other views of the same scene.

This is a fully 3 dimensional capability that can be invoked by a human at will and learned by a wide variety of organisms (including mice and lab rats).

The mapping between egocentric and allocentric representations, has to be updated every time the eyes move, every time the head moves, and every time the body moves, which can include rotations in place as well as running (in which case the visual image is shaky, jittery, and changes rapidly).

The lesson to be learned from "the way the brain does it" is that things are done sequentially, piecemeal, using simple calculations. The mapping of allocentric space can be constructed from examples based on landmarks.
 
There is a visual attention signal emanating from cerebral cortical area V4 (that's the one that handles color, it feeds surface shading and speculation information to IT - inferior temporal cortex - for object recognition). The attention signal can be picked up along with the EEG as a scalp recorded ERP, it's called N2pc. It occurs 150-190 msec after presentation of a stimulus.


The attention signal tells the brain which objects to pay attention to. So by the time this signal hits V4, some preliminary determination of importance has already been made.

Paying attention to an object that doesn't happen to be perfectly centered in the eye (which is 99% of the time) results in a saccadic eye movement to foveate the object. Accordingly, it is the Frontal Eye Field that generates the attention signal.

V4 is part of the "what" pathway. It has a map of which visual features belong to which objects. It keeps a catalog of object features, in addition to center of mass and boundaries there is a list of surfaces and orientations for each object. When there is a new visual scene, the frontal lobe will issue a series of attention commands to explore it.

This particular attention command, the N2pc, is useful because it's elicited by a controllable behavioral test and it's entirely reproducible.

What this tells us is the frontal lobe starts guiding exploration even before it knows which specific objects are in the scene. You can compare the timing of the N2pc with the N70 which comes from the primary visual cortex, and the P300 which starts in and around the hippocampus.

Somehow, the visual signal gets to the frontal lobes before it gets to the hippocampus. No one knows how this happens. Face recognition is associated with an N170 and the visual face area is approximately the same number of hops from V1 as is V4. Recognition is associated with an N250, and lack of recognition (or incongruence) with a P300 and an N400. So object recognition occurs maybe 250 msec after presentation.

The object being recognized can be any size, any color, any orientation - and it can be occluded by other objects.
 
By now we get a good feel for the magnitude of the task. In humans, each retina has about 100 million photoreceptors. These are divided into 93 million rods and 7 million cones. The field of view is 180-200 degrees horizontally and about 130 degrees vertically. Photoreceptors are smaller and more densely packed near the fovea, where one rod can handle just a few seconds of arc, much smaller than a dot on a page.

The 100 million photoreceptors converge onto about 1.5 million retinal ganglion cells in each eye (through about 36 million bipolar cells). So one retina has about the same resolution as an HD monitor, give or take, 1920x1024 pixels except the code is a little weirder than just RGB. When information from the 1 million RGC's arrives at V1, it gets encoded into approximately 140 million cortical neurons. The entire human visual system all together is about 8 billion neurons.

About 10% of retinal ganglion cells connect to the superior colliculus (in other words, it's a relatively low resolution map). The remaining 90% connects to the dorsal lateral geniculate nucleus of the thalamus (with a few exceptions), which then feeds the primary visual cortex V1. The retino-geniculate projection sends collaterals to the area of the thalamus we haven't talked about (much) yet, the pulvinar.

The pulvinar is "the" central switching hub for visual attention. It connects the "what" stream with the "where" stream. It receives inputs from all of the early visual areas, including the retina, LGN, superior colliculus, and V1. It connects with V4 and IT, the object identification areas. It also connects with MT and the superior and inferior parietal lobules, the object location areas. It also connects with the frontal eye fields, and other areas involved in cognition like the anterior cingulate cortex.

So our vision has one pathway (the "what" stream) that's completely invariant to size, position, orientation, color, and internal geometry within limits - and another pathway that cares about these things very much and maps their locations in space. During simple episodic memory like navigation, the details of visual images are omitted and they are represented as labeled landmarks. Landmarks are defined and acquired through exploratory behavior, which involves multiple diverse views of the same scene. These views are synthesized into 3 and 4 dimensional allocentric representations and encoded into episodic memory by the hippocampal system, which includes the hippocampus proper and all the areas of cortex around it, like entorhinal and retrosplenial.

These names sound like a directed graph, but really they're all very close to each other in the brain. Here for example is some perspective:

1765711553380.webp


The red and blue areas are the colliculi on the left and the geniculates on the right. Just above them you can see the label "pulvinar". So these functions we're talking about, like visual attention, begin at the roof of the midbrain. That's an old area in evolution, so old even birds have it (in birds it's called the tectum which means roof). The architecture is approximately the same for animals with lateral eyes (like rats) and primates with forward eyes.
 
15th post
So back to the brainstem for a minute, then we'll do system architecture.

The interesting thing about behavior is, eye movements rarely occur by themselves, unless the object is already very close to the center of vision. Usually eye movements are accompanied by head movements, and sometimes whole body movements. The principle of minimizing free energy says it might be cheaper to move the head and body once (so the object can be brought to the center of vision), rather than keeping the eyes in an eccentric position for a long time.

So this brainstem area called superior colliculus, not only organizes eye movements, but it organizes head and body movements too - but only those that are needed to foveate an object of interest. It connects for example with neck muscles through the tectospinal tract and tectoreticulospinal tract. To understand the system architecture, one need only imagine all the different ways a visual object can be brought into focus. And this exercise will highlight the importance of system dynamics in the expression of behavior.

A purely stimulus driven foveation "when the object is already in the field of vision" can be accomplished by a simple reflex from the visual system to the oculomotor map in the superior colliculus. To this end, the SC has inputs from all the early visual areas, including retina, V1 visual cortex, and many of the other visual cortical areas like V2/V3, and especially those along the dorsal "where" visual stream, and that includes areas in the parietal cortex mapping visual, auditory, and tactile space.

The selection of which object to foveate is complicated and involves large parts of the brain (the dorsal and ventral attention networks), but ultimately everything feeds into the frontal eye fields, which handle "voluntary" eye movements, including those during exploration. Ultimately, you have one area for voluntary (in the frontal lobe), one area for stimulus driven (in the parietal lobe), and a bunch of "influence" from just about every higher visual area and all the attention areas, and all that feeds into the superior colliculus, which somehow has to decide which inputs to use and which to ignore.

One of the interesting things is, these systems are separate from episodic memory. In other words, the way AI is doing it right now (where weights are adjusted based on stimulus sequence relationships), is not the way the brain does it. The brain separates out irrelevant information. The details of eye movements are not relevant to the memory of the episode. Only the resulting scenes are relevant.

The other interesting thing about scene memory in the hippocampus is, it's not a targeting level description. Superior colliculus and its accompanying systems all the way up to parietal cortex are targeting level descriptions. But hippocampus doesn't use that, it does its own thing. It builds a whole new description of space called allocentric, which is approximate, it's a model. The grid that's used for navigation is completely separate from the grid that describes retinal coordinates.

When an episodic memory is stored, it's stored with both egocentric and allocentric coordinates, and this is part of the context delivered by the frontal lobe in response to a request from the hippocampus. It is one of the the things that allows animals to do "mental navigation", which is a capability found even in lab rats.
 
To complete the picture (as it were), one must realize that what the organism actually sees is a projection. Stereo vision is a projective space. When the organism is on flat ground looking forward, there is a horizon in the distance and parallel lines converge there.

1765826230779.webp


This is why the "map of space" in the hippocampus isn't fully 3d. It's kind of "2-1/2 d". For instance, to navigate a maze the organism plots a path along the ground, which is the bottom half of the visual field - using information from the upper quadrants to identify landmarks and boundaries. A maze is perceived as a rising 2d plane that is sloped in the direction of the horizon.

This is why organisms don't spend a lot of time calculating the exact 3d embeddings of every object. Instead they approximate a map of landmarks, and mentally rotate it when needed. In this case, precise positioning is not needed, all that is needed is the relative positions and orientations of landmarks with respect to each other.
 
To keep perspective on scale, humans have about 1.5 million retinal ganglion cells in each eye, which when normalized is about 1200x1200 pixels. This diverges to about 6 million relay neurons in the LGN, where interneurons exist in a 4:1 ratio.

Mice, on the other hand, have only 30,000 neurons in each LGN. Yet they can still navigate, still make spatial judgements, still perceive and act in depth. Apparently, it doesn't take many neurons to do the trick.

The images used by the machine learning people to train neural networks are generally as small as possible, for example NIST has a handwritten character dataset that's only 16x16 pixels - however it's only monocular. For depth perception we need something a little better, and NIST has a 256x256 set of images in left-right stereo pairs that will work nicely. Therefore our retina has to be at least 256x256, which is a little on the large side but it'll work.

And, to handle the concept of "channels" (like X and Y cells) we will preprocess the input signal to conform with the known channel types, which is nothing more than convolving each image with a filter. So at each of the 65,536 points in an image, we will extract contrast, color, and motion.

When a brand new visual scene arrives, there is not yet any prediction.The first thing that happens with a new visual frame, is a lookup - the purpose of which is to generate a prediction. If we were doing object recognition in the inferior temporal cortex we would be looking up our object in associative memory, but here at this early stage in the visual system the analysis is hard wired and local. For each point in space, V1 will encode what's there, so for instance it'll say "at this point there's a blue edge moving slowly to the right" - so it's combined the information from the "channels" into a unified representation.

Obviously if the object is moving to the right then in the next visual frame it will be displaced from where it currently is. The calculations are very easy and entirely logical, like if the edge is here AND it's moving, multiply and add the speed in the direction of motion to get the new position. This becomes the prediction. It's transmitted back downstream to the LGN.

Next frame, LGN will compare the prediction with the subsequent retinal image. If the prediction is correct the images will match exactly and there will be no error signal. (In real life there will never be "no" error signal, because of noise and the stochastic behavior of neurons - but we can accurately call it "near zero").

After this, we can do invariant object recognition in the usual way using one of Yann LeCun's convolutional networks - but that's not exactly what we want in this case. We want the SCENE, we want the object relationships, more than the details of the objects themselves. So we'll adopt David Marr's 2-1/2 d strategy and treat all the objects like stick figures and shape primitives - EXCEPT, we have identified and labeled each object based on the inferior temporal object recognition in the other part of the brain. So now it's just a matter of saying "this" object is in "that" position and then predicting how the relationship will change in the next frame.

This is a simple and elegant method because all the details of information are still there, they're just in different parts of the brain. All the brain has to do is link up the bits of information that belong to each object. You can see the process is far more statistical than it is geometric. Most of the geometric parts are self organizing and end up becoming hard wired.

The only mystery is how the system dynamics govern the sequence of processing. It's hard enough to figure out with a single Kalman filter, but here in the early visual system there are a dozen such filters and they"re all interconnected. THAT it works is not an issue, but WHY it works is debatable. LGN relay neurons turn out to be a bit complicated. They have a calcium T-channel that activates just below the resting potential, so immediately after an IPSP the next thing that happens is a burst. The timing is such that the LGN relay neurons will pass one or at most two retinal spikes before hyperpolarizing, and this happens about 40 msec after a visual stimulus. So spike-pause-burst, and that's just within a single neuron (ie not due to the population).
 
So now that we know (kinda sorta) how the brain does it, let's look at how machines do it. The two best methods in modern machine learning are multi view convolutional networks and neural radiance fields.

Multi-view CNN's work pretty much just like the brain, they decompose each view into features, and then link the features across views.


Neural radiance fields work off the details of the light in each view. In theory they can reconstruct a large fraction of the 3rd dimension from a single 2d view. (My expensive state of the art facial mesh software can create a 3d model from 3 views, so 1 view is pretty impressive).


AI is presently at approximately the level of the parietal lobe. It has the 3d map but it hasn't put together the scene yet. As we've discussed, scene reconstruction occurs in the hippocampus and its distinguishing feature is that it uses an allocentric (external, world based) coordinate system rather than an egocentric (internal, organism-based) one.

Scene reconstruction in the hippocampus is accompanied by context delivered by the frontal lobes. The hippocampus decides what portion of the context is relevant to the scene. This process involves drawing information from long term memory and then creating new short term memories. The new memories are organized into "episodes" and consolidated over a period of hours to days, but meanwhile they're still available as context for additional episodes.

AI has nothing like this. Now that it can reconstruct 3d in near real time, the hot topic is building the scene in the allocentric reference frame. In computer graphics they do it by "rendering", which can take anywhere from seconds to hours depending on the method (ray tracing is beautiful but it takes a long time). But humans can do it in real time and that will be the requirement for AI too.

The third dimension is necessary for "reaching". Animals without a parietal lobe depend on the brainstem for the locations of objects in visual space, and when they reach, the limb moves correctly in the direction of the object but they can't grasp (they try, but they can't get the depth right).

For the animal to locate its own self in a scene, the animal's own position has to be represented in allocentric coordinates. That's what the place cells are for, in the hippocampus. The cell fires every time the animal is at that location, regardless of orientation or direction or speed of travel. Place cells are derived from the grid cells in the entorhinal cortex, and those are a hexagonal tiling of the visual field, which in projective space means they can be mapped to object locations with simple matrix multiplication.

More interesting evidence that perception is predictive, comes from a phenomenon called "visual completion". It was discovered and studied by Hermann Helmholtz. Let's say you have a sentence like

My cat got chased by a dog

and you show it to a subject in such a way that the word "a" is exactly in the blind spot of the eye. So the subject can't see it, because there's no photoreceptors. Under these conditions, the subject will insist he sees the word, insist the word is there, and when asked to draw exactly what he sees, will draw the word.

This is context being delivered from memory by the frontal lobe. It's a specific word that depends on the context of the sentence. It is being predicted as somehow belonging to the scene, at that specific location.
 
Back
Top Bottom