alignment of sensory fields

So now we can talk about eye movements.

You have 3 systems, one for targeting, one which is event driven, and one voluntary.

Targeting = superior colliculus in the brainstem
Event Driven = posteríor parietal cortex
Voluntary = Frontal Eye Fields

They all talk to each other, and they all share some function. Each area has both a sensory and a motor function. For example the parietal lobe targets details of an image on a voluntary basis, it is instructed where to target by the attention systems.

As we proceed centrally from the retina, we eventually get to the scene map in the hippocampus. The next thing that happens is the frontal lobe assigns value to each landmark, by talking with the nucleus accumbens. This information is passed to the attention system and helps the organism navigate.
 
So this seems easy and straightforward at first glance. The first thing we need is a retinotopic targeting system, and we know enough about the oculomotor muscles to make it work.

So here's an eye. Could be a human eye, could be a robot eye. You can see the muscles that move the eye. They're organized in pairs. When the eye wants to move, one of them contracts and the other one relaxes.

1766103541492.webp


Here's a side view showing the same muscles when the eye is in its socket. The optic nerve exits the retina approximately at the fovea and travels between the muscle fibers into the brain.

1766103654674.webp


We know how the superior colliculus works. Stimulate a spot in the visual field and the eye will move to that position. But how does this actually happen? Let's take the simple case of a horizontal saccade, a side to side movement of the eye involving the medial and lateral rectus muscles. So we need to understand the pathway between the brainstem and the muscles. Fortunately, we do.

The relevant oculomotor nuclei include the trochlear and abducens, together with the oculomotor nucleus they exit the brainstem in cranial nerves 3, 4, and 6. The pathway looks like this:

SC => gaze centers => integrators => motor neurons

There are two gaze centers in the paramedian pontine reticular formation, one for horizontal and one for vertical. (The vertical gaze center is in the rostral interstitial nucleus). The gaze center generates commands to the oculomotor nuclei to move the eyes. For the eyes to be held in position, there has to be a baseline firing rate of the neurons related to a tonic contraction of the muscle. So the motor neuron will have an initial baseline rate, fire a burst of impulses while the muscle is contracting, and then settle down into a new baseline rate determined by the new eye position.

How does the brain tell when the eye has reached its desired final position? Two ways. First, the target is now foveated, and second, there are neural integrators that add up all the eye velocities to determine position. Here's the lower part of the horizontal gaze pathway:

1766105002142.webp


The two cortical oculomotor areas (the frontal eye fields and the posterior parietal cortex) both send outputs to the superior colliculus, but the frontal areas connect directly to the PPRF as well, so they can bypass the targeting system and execute a voluntary eye movement directly.

How the integrator works, is really interesting. Premotor neurons burst at a rate proportional to the desired eye velocity. Usually but not always the larger eye movements correlate with faster velocities. The output of the integrator is a step-like function corresponding to the final eye position. So the network is taking a desired target position, breaking it up into a series of piecewise linear velocities suitable for muscle contraction, and then adding them up to make sure they work. This is a "local dynamic" and once again emphasizes the importance of dynamics. There's more to the story (like "omnipause neurons" able to inhibit or stop an eye movement), but that's basically it.

It looks like a pretty simple system, doesn't it? The part we should study is the negotiation between the subsystems that determine what the organism will look at next. This takes us into bounding boxes and object identification and such - which we've mostly looked at already except for the bounding box which is easy to do with a Raspberry Pi.
 
You can see clearly that this very simple system far surpasses the capability of modern AI. First of all it's entirely self organized, the only commands are a few gradients and some genetic sequencing. Secondly, consider that when an organism studies a visual image, it wants to extract the maximal possible amount of information from that image, in the shortest possible amount of time. In some cases compromises are acceptable, in other cases they're not. So the eyes move within scenes, as well as between scenes. How long to hold on to a scene, is a function of the attention system, and attention depends very much on the perceived value of whatever's being attended to -

So AFTER the scene map in the hippocampus, in addition to its connections to the frontal cortex for requesting scene-relevant context, it sends a massive output into the anterior thalamus which feeds the anterior cingulate cortex - together these make up the "anterior attention system". The organism is selecting which landmarks to pay attention to, and for each selection the OTHER eye movement system in the parietal cortex will analyze its details in terms of surfaces and shading and 3 dimensional relationships.

The attention system is complicated. There's not just one, there's half a dozen. Somehow they all end up connecting to this area, the anterior cingulate cortex:

1766108502378.webp


The ACC has no direct connections with the gaze centers. Instead, it's the "front end" of the attention system. It helps decide which information should be attended to based on its value. The ACC takes spatial-cognitive map information from the hippocampus, analyzes its context relative to the current task and goals, and selects "relevant" parts of it for further exploration and analysis.
 
So I want to build one of these, to show it can be done. Entirely self organizing, that's the requirement.

I want to use this as a platform for exploring the logic of attention. I already have an architecture in mind. Basically it's just a stereo camera mounted on a robot, with a little motor to turn the camera so its "eyes move".

Later we can get the robot to navigate 3d space, but for right now I want to focus on the eye movements.

So, to do this, we'll need at least rudimentary scene analysis capability. We want to know where objects are and what their geometric properties are. In machine learning, this part already exists, in the form of convolutional neural networks. So we will ASSUME that these work and can be made to work, and focus our first efforts on the eye movement itself - the translation between the visuotopic map and the contraction of the eye muscles.

To begin with, we note that when the eye is at rest and there is no visual activity, the optic line is "centered", that is to say, there is no net contraction of the medial and lateral rectus muscles. Even in this condition though, there is a small amount of resting contraction in each individual muscle, because it keeps the eye in place (and when it deviates moves it back like a servo). This is our initial configuration, but to start with and to simplify the explanation we'll only consider horizontal saccades.

1766120406372.webp


So we will focus on four sets of neurons: two that move the eyes inward (bilateral contraction of the medial rectus muscle), and two that move the eyes outward (bilateral contraction of the lateral rectus muscle).

For the eyes to move leftward in the visual field (to bring an object in the left periphery closer to the midline fovea), both eyes have to move left and that means the medial rectus muscle of the right eye (oculomotor nucleus, nerve 3) contracts, and the lateral rectus muscle of the left eye contracts (abducens nucleus, nerve 6). Simultaneously, the medial rectus of the left eye and the lateral rectus of the right eye will relax.

We want this system not to adapt or potentiate, we don't want our math changing just because neurons are getting tired. We will take as our initial visual input the W cells that are known to project from the retina to the superficial superior colliculus. We will also overlay retinotopic maps from parietal and frontal cortex onto this same area, which we'll come back and use later.

First task is translating from retinal eccentricity to the duration of muscle contraction. We base calculations starting from a centered visual line.
 
This is approximately what we're dealing with in the superior colliculus:

1766123526958.webp


Pretty easy, pretty straightforward. We have about 100,000 axons coming in from each retina (10% of a million), and we can arrange these any way we want but we should decide up front because the arrangement should be consistent. So let's use a hexagonal grid since that can be easily converted to both polar and Cartesian coordinates.

In the drawing, visual inputs come in at the top, visual neurons are in green. We have to account for a few idiosyncracies like the nucleus of Edinger-Westphal and the suprachiasmatic nucleus that handle things like pupillary accommodation and circadian rhythms, but by and large the optic layer of the SC is just feed forward topography with some local lateral inhibition for sharpening.

The motor layers are more complicated. We're specifically interested in the output labeled "paramedian pontine reticular formation", as those are our integrators. These cells convert eye velocity commands to eye positions. Their output is integrated with oculomotor neurons directly, in the medial vestibular nucleus (which handles reflexes involving head and body position, like the vestibulo-ocular reflex) and the nucleus prepositus hypoglossi. In a nutshell, SC neurons signal "move the eyes here to this target" and PPRF neurons convert that to saccade amplitude and velocity "from the current eye position" (which in this thought experiment always starts out as centered on the visual line - the visual line btw point straight out of the eye normal to the fovea, it is the gaze location of each eye). PPRF neurons excite the ipsilateral abducens and the contralateral medial rectus, so the ones on the right side of the brain move the eyes to the right. (The laterality gets a little weird because there's lots of crossings).


There are two distinct functions to look at: bursting that initiates a saccade, and terminating the saccade and thereafter holding the eyes in position. A saccade itself can take anywhere from 15 to 100 msec, it typically accelerates and then decelerates.

1766126403550.webp



Note that a saccade can be manipulated to expose additional dynamics. We're not yet in a position to say where these dynamics come from, but the model will predict them eventually. Here's an example using just a behavioral paradigm, and you can do the same thing with brain stimulation, measuring either individual saccades or ensembles.

1766126689223.webp


I'll let you read through this great paper which describes the integrator better than I can.

 
To get this to work, the first thing we have to do is enable a VOR. The vestibulo-ocular reflex keeps your eyes steady while you're walking. Our robot may or may not walk, but it will definitely be called upon to navigate over rough terrain, so one of these is essential. It's pretty easy, your head goes one way and your eyes go the other, in equal amounts so as to keep the retinal image stable, in both the horizontal and vertical directions. And since we already know horizontal and vertical are handled separately, we'll keep this motif because it's convenient for our accelerometers (semicircular canals) as well.

So here's where we are right now, what we need is to model the ramp-and-step response pattern of the integrators. You can see the action of the integrator in the lower half of figure A (the eye movement is in the upper half). In B you can see the PPRF "pre-motor" neuron, which only has a burst, it doesn't have a step. The burst signals the velocity and amplitude of the eye movement, whereas the step indicates the resulting eye position.

1766129918772.webp


So now maybe we get a clue what the "up" and "down" states of a neuron are for. Which is a whole separate discussion in a different thread. If you're interested Google "spiny stellate cells striatum" and look for the L type calcium conductance.
 
Last edited:
In our robot we're going to replace the semicircular canals with accelerometers and gyroscopes. Cheap ones, the kind you find in cell phones. We will encode their outputs in binary form so we can use them with 1 dimensional neural networks.

The first task then, is to align the sensory maps from the gyros with the visual map from the retina. We will only do this for the portion of 3d space that the organism can actually see - for the rest there will never be any visual signal, so we can either prune those connections or just not map them.

Our neural network works on correlation and covariance, and it uses lateral inhibition to sharpen the boundaries between neighboring points. So what we want is, for the origin in body space to align with the origin in eye space, when the eyes are fixated forward (ie when there is no oculomotor command). This part is fairly easy to accomplish, it can be done frame by frame in the population on the basis of overlapping Gaussians. The harder part is to establish a linear relationship between the signal from the gyros and the deviation of the eyes. Because, we need to know what the "correct" relationship is - we need a standard to measure against, because something has to create an error signal so our network can self-organize.

One way to do this is to measure how much the retinal image slips when the head moves, and use the slippage as our error signal (because we eventually want zero slippage). But to do this we need a spatial reference, and the only part of the visual field that isn't usually moving (in humans) is the extreme upper outer edge of the retina, which looks at the nose. (You can try it yourself, pay attention to your nose and then move your eyes from side to side - your nose doesn't move very much when you do this).

There are two opthalmologic laws we have to conform to, they're called Listing's law and Donder's law. Listing's law says rotations occur around axes in a single plane, and Donder's law says the torsion on the eye at any given position is the same no matter how it got there.

So let us prepare for full 3d with six sets of accelerometers and gyros, 3 on each side, 1 for each degree of freedom (dimension). For now we will ignore the Z axis, although we'll train it anyway. X will be our horizontal axis, and Y will be our vertical axis. To start with we're only interested in horizontal, and once that works we'll figure out how to replicate it to vertical.

So we have 1 accelerometer and 1 gyro in each ear, that's 4 devices total, and we'll say each device puts out a 16 bit numerical value. So 64 bits total, 8 bytes. On the retina side we have a square grid of 100,000 inputs, so about 317x317, so each 16 bit value will get mapped to one of 317 slots, minus the fraction of the visual field that's never seen. We therefore need 317x4 hidden units, some of which will be redundant.

1200 parameters is a TINY number for a neural network (ChatGPT has literally billions, but then again it uses nVidia GPU's and we're just using a Raspberry Pi). It usually takes about 10,000 iterations to train a network, and in our case an iteration is just matching a sensor configuration against a position on the retina, which takes a few msec. So this part of our network can be programmed in less than a day simply by opening the eyes and moving the head a lot. Easy peasy.
 
Here is the first piece of serious engineering. Amazingly enough, no one really knows how the integrator works. So we have two issues that are really part of the same problem. First, we have the topographic map that needs a reference. And second, we have a neural behavior that translates position to burst rate.

From a machine learning standpoint, the first part is easier. It's kinda sad reading through the biology literature about the second part though, the biologists don't know from engineering and just wave their hands over the entire issue. Like, "this signal goes one way and the other signal goes the other way", just kinda assuming that it's all possible and the engineers will eventually figure it out.

So here's a little input about the first part, the alignment of maps. In young birds this takes about 100 days, in a machine it can be done in less than a day.


The second part is intriguing. There are multiple ways to accomplish it, in a neural network, but none of them are necessarily easy. We want a burst rate that's proportional to the velocity of the saccade, that's what the PPRF neurons encode - and saccade velocity is proportional to eccentricity (ie location) on the retina. The trouble is, that bursting is related to the intrinsic membrane properties of neurons. The input doesn't burst, the neuron somehow translates eye position to spike rate based on the behavior of synaptically controlled ion channels.

The machine learning view tells us we should be very careful in defining what we want. An eye movement has an amplitude, a direction, and a velocity - and those are derived from the starting location (empirical) and ending location (desired, predicted). From a control systems viewpoint we have ONE muscle that can be driven by ONE neuron and all it has to do is correctly calculate the rate. But this is not at all how neurons work! Right?

The real muscle has a few thousand muscle fibers, and the more of them that get engaged the more contraction you get. Each individual muscle fiber is noisy and unreliable, but together the population averages out the hiccups and you get a nice smooth Gaussian. In this case the eye movement velocity is related to the mean of the Gaussian, because that determines how many muscle fibers are contracting on average, at the same time.

So one way you could do this, is to have short unidirectional connections between neurons. Another way is with a crossbar. A third way is to have the network specifically control channel conductances in a very small population of highly connected integrators. All we really need is a signal of the form

integral from 0 to a, of x dx

Where a is the target position on the retina. The result will then be multiplied by some scaling factor to keep the eye movement within bounds (often this is an S shaped curve or logistic function).

From a biological standpoint, it's too risky and unreliable to have our network calculate channel conductances in specific neurons. The easiest and most reliable way to do this given that the starting position is always the origin is to have every source connected to every destination, and simply add up all the active destination elements under the following rule:

if the target is closer to the origin than the current position, output 0, else output 1

So let's say we're making an eye movement 10 degrees to the right, and the limit of our visual field is 100 degrees. The logic says the first 10% of our neurons should be active, resulting in a total output of 10. On the other hand, if we're moving 20 degrees to the right we get a total output of 20, and so on. We end up with a nice linear relationship between target position and output activity. And, the decision for each neuron is entirely local, making it biologically plausible. And, note that topography is no longer needed at this point, we just end up with a single number that says "amount" by which to contract the muscle.

Now that we have a basic mechanism, we can fit it into what we already know about the biology, and see if it works in terms of predicting the known neuron types and behaviors, and the various neural and biochemical pathologies. And it does work, the only odd man out is the "omnipause" neuron that plays a specific role we haven't defined yet (but we will).

Once we have a basic integrator we can build a network. So let's do that. We will define our SC (superior colliculus) to be 320x320, and our retina can be the same since there's no need for higher precision at this point. Then to get the X axis, we'll have 320 neurons that map eccentricity using a "one-hot" code, in other words, the eye can only be in one place at a time, and we only need one neuron to show where it is. And finally, to get the burst rate we determine "how many neurons are medial to the selected position". Easy peasy, and trivial to replicate to the Y axis.

As far as neural "place codes" go, this is probably the easiest one. It can be done linearly, or it can be done statistically using a population code (which is probably closer to what we actually want, since we already know PPRF is a population and not a single neuron). In a population context, each neuron simply makes the decision "am I closer to the midline than the target", and we simply add up all the neurons that say yes.
 
Last edited:
Wow. Doing this with off the shelf components is going to result in a large and unwieldy prototype.

I want something small and portable, and it looks like I'll have to build it myself. Here's what I found for just the camera:

We need a stereo camera with two separate eyes, that can be individually positioned. In macro-world we could use two PTZ cameras, which already swivel in the horizontal and vertical directions.

1766213154020.webp


It's way too big, it's for desktops. But that's the idea. Only, we want it small. Maybe like this:

1766213428283.webp



Only, by the time we get a stepper motor on that thing it'll be just as big as the other one.

So here's what I finally settled on. There's a class of tiny PTZ camera that's used for motion capture. All of the robotics are actually inside the lens, there's a controller board for the motors and a graphics daughterboard. Looks like this:

1766213920859.webp


You mount this once in a fixed position, and thereafter it's controlled by software. Two of these can be easily mounted on a mobile platform. They require IDE-type power for the motors, and the ribbon cable plugs directly into a Raspberry Pi and becomes a device in Linux.

I mentioned that we needed a visual line and a reference point, so we can just pick a pixel and designate it as the fovea. The obvious choice is the one right in the middle. We can save a lot of time and trouble down the line by having the driver translate coordinates for us, we want the origin in the middle of the frame and eccentricity and angle in both polar and Cartesian coordinates, and we want the same coordinate systems for the motors and the graphics.

Our first task is to overlay the motor map with the graphic map. The test goes like this:

1. a light turns on somewhere in the visual field
2. the camera detects where it is and provides its retinotopic coordinates
3. the neural network calculates a vector from the current eye (camera) position to the position that will foveate the light, and passes it to the motors
4. the motors execute a "saccade" so as to foveate the light
5. the light turns off

We can test various learning algorithms in the neural network. The obvious one is the self organizing map, which is easy computationally, something a Raspberry Pi can easily do. Once we have a map that works, we save off the weight matrix and we're done.

We've waved our hands over a HUGE number of issues, not the least of which is noise and stochastic behavior. For now though, we just want to show that the mapping algorithm works, and we can come back and finesse things later.

The next step is to design the neural network. We have to choose an integration mechanism. For present purposes it really doesn't matter "how" it works, only "that" it works, so to simplify the math and save computation time we'll just use the pixel coordinates. In other words, the motors will simply subtract the present location from the destination coordinates to get the required movement.
 
This video shows the camera working in PTZ mode. This scenario could be, you're podcasting and the camera is located in an upper corner of the room.

 
Ha. I found something even better. Fully assembled for less than 20 bucks.


And, this camera, which has audio too:

1766226100227.webp


Piece of cake. So now there are two scaling requirements:

1. The camera is 1080p and we really only want 320x320
2. The motors should come in at the same resolution as the image

So here's the plan: :p

Each eye will be an Arduino, with a PTZ gimbal and a camera. The Arduino will handle the motors only, not the camera. We will have standard resolutions Rx and Ry (320 to start with) which apply to all communication between devices, and each device will scale its own coordinates as needed.

The video will be fed to a Raspberry Pi with an AI hat, which is like a mini-GPU. The Raspberry will talk to the Arduino's over USB. Each Arduino will have set and get registers for X and Y coordinates, as well as eccentricity and angle. And, each Arduino will have an "execute" command so the endpoint registers can be programmed in advance of an eye movement.

Now we can get down to the real business, of programming the neural network. We have to work backwards, to understand the conversions. First, the single signal that says "amount of muscle contraction right now" (which equates with a command to turn the motor) is the sum of a large number of inputs (we said, all neurons medial to the target). Second, we need a more precise definition of how exactly the distance from the midline is calculated - BECAUSE, we need this information to determine whether we're properly aligned at the end of the eye movement (or more specifically, to calculate the amount of misalignment, which is our error signal).

A couple of observations on the technicals:

First, our gizmo has to actively scan each frame to determine whether a spot of light is present. To save computation time, we can say spot = 1 and no spot = 0 for each pixel and simply add all the pixels, and if we're non-zero we have a spot. And then we can scan only that particular frame to determine where the spot is.

Second, we need to monitor the state of the network, because eventually we might want to turn off the camera while an eye movement is in progress.

Third, there may be physical and device limitations we haven't considered yet. The motor may not have the same precision as the camera. Some calibration may be necessary, and maybe even some downloads of lookup tables from one device to another during system calibration. This is necessary at this early stage because we don't have a neural network yet!
 
Comparison of AI neurons and real neurons

As context for the neural network, and also to demonstrate how far modern AI is from anything real, let's take a quick look at neurons. Starting with the "prototypical" Hodgkin-Huxley neuron (on which modern AI is based), and then a real example of a pyramidal cell in the hippocampus (and we're looking at that because it's a space map, and we want to know what we should do with our map of visual space).

An H-H "theoretical neuron" has only three components: two channels and a pump. One of the channels is voltage gated, it opens up at a threshold voltage and only stays open for a little while. The story begins with the pump, that pumps sodium ions out of the cell. This means more positively charged sodium ions outside than inside, which means we get a negative voltage inside the cell, it's called the resting potential, it's usually around -70 mv. This is the "0" state of the neuron, the resting state. So in AI when we say a neuron has 0 activity, this is what we mean. The pump has reached electrical and osmotic equilibrium and the neuron's just sitting there at a steady membrane potential.

When we activate a synapse, we get some additional receptor linked conductances, locally to the synapse, that drive the local membrane potential up or down. Synaptic conductances are also usually ionic, involving H+, Na+, K+, Ca++, Cl-, although some involve gap junctions and hormones, but generally speaking we're going to excite the cell by depolarizing it, and a single synapse only depolarizes by maybe 2 to 3 mv or so. But if we get 15 or 20 synaptic activations in a row, or simultaneously, we might hit the threshold of the voltage gated sodium channel, in which case it will open up with positive feedback resulting in a "spike" (an action potential). The threshold could be anywhere from -60 to -10 mv, depending on the transfer function of the neuron (which is to say, the concentration of ion channels). This is what the Hodgkin-Huxley dynamic looks like in the phase plane:

1766301112383.webp



This is what the textbooks tell you, and this is how modern AI works:

The triggering of an action potential only happens in one place in the neuron, right at the very beginning of the axon (it's called the "axon hillock"), because that's where all the voltage gated sodium channels are concentrated. The rest of the nerve membrane (other than the axon, so like, the dendritic tree) is passive, so when we excite a synapse the depolarization has to diffuse to the axon hillock (summing with other synaptic events along the way), and then when the threshold is reached the nerve impulse only goes one way, which is down the axon. Here's the textbook view:

1766301669341.webp


So in AI, we consider that everything north of the axon hillock is a passive integrating surface where the membrane potential is a simple sum

V(t) = sum over inputs (input * weight)

and the threshold is represented as a nonlinear activation function

Output(t) = f(V(t))

And this might be fine for motor neurons and Perceptrons, but it's not at all how cerebral neurons work!

First of all, the action potential when it's generated, travels in BOTH directions, not just one. The spike travels upwards (backwards) into the dendritic tree, where it influences the synapses. Synaptic plasticity can include learning rules like LTP, LTD, and STDP that are used in AI.

And second, there is not just "one" spike generating conductance, there are many. In addition to the voltage gated sodium channel found at the axon hillock, the most common is a voltage gated calcium channel that's found in the dendrites, and specifically in the dendritic spines that form the synapses. In pyramidal cells in the hippocampus, as well as Purkinje cells in the cerebellum and cortical neurons, there are multiple different kinds of calcium spikes with different time courses and different propagation characteristics.


Here you can see some dendritic events and the different latency bins for them. The little red dots are dendritic calcium spikes.

1766302714862.webp


The backward traveling sodium spike from the axon hillock interacts with the forward traveling calcium spikes from the dendrites in highly nonlinear ways. The usual logistic equation for optimizing neural network weights (that ChatGPT uses for example) falls far short of what's actually happening. A dendritic spike can generate a "plateau potential" that puts the entire dendritic segment into a whole different sensitized state. Obviously these are far more sophisticated mechanisms than just simple optimization.

So we'd like to be aware of these issues before we race off generating self organizing maps.
 
Now we get to the good stuff. We're going to map our retina onto our oculomotor system, which will allow us to orient to visual stimuli.

The first thing to realize is the primate SC is different, it only maps the contralateral visual field - whereas cats and mice map the entire visual field. The input to the SC is binocular, but there are no ocular dominance columns. Nevertheless there is an overall mapping that resembles the mapping from retina to cortex.

1766363199882.webp


For this exercise we want the primate model, and here is a great review on what is specifically known about the primate superior colliculus.


Putting on a machine learning hat and thinking about mapping two retinas to two superior colliculi, an issue immediately leaps out at us. We have no feedback signal, so we can't generate an error amount. There's no pathway back from the SC to the retina.

So how is this done in humans and monkeys? It is quite obvious the retina has to drive everything, it has to connect to both the SC and the LGN in a topographic manner, without any neural feedback. Turns out, the geometry of the retina in (approximately) x and y coordinates is programmed and maintained genetically by cell surface proteins called ephrins. There are gradients of ephrins from the nose side of the eye to the ear side, and the developmental process has already been modeled with 1% precision.


So WE get lucky. This part of the work has already been done and it works so we can just connect our retina to our SC according to the coordinates, and move on to the more interesting part.

From a machine learning standpoint, we now need to encode the location of our target and convert it into a number that says how much to contract the muscle. This is a directional vector in the visual field, in the upper (superficial) layers of the superior colliculus. We can continue using x and y coordinates if it's convenient, and we have the coordinates from the spot of light, but now we need to translate these into x and y movement amounts. We can do this in a number of ways.
 
Prior to engaging in a design review (where we often talk about things we can get rid of), it behooves us to understand the architecture and behavior of the two other major eye movement mechanisms, which are vergence and smooth pursuit. (We've already talked about saccades and the VOR, and there is technically a fifth type called disjunctive saccade, which we'll see we can put together from the others.

The first thing to note is there are portions of these pathways that bypass the SC, they go straight to the integrators or the motoneurons (so we can't get rid of those things, we have to keep them). The pathways are very logical. Smooth pursuit involves axons from visual area MT (that processes motion) and the frontal pursuit area that abuts the frontal eye fields, that end in the brainstem. Vergence involves parietal area LIP (which we've talked about, it handles 3d vision) and a frontal vergence area that also abuts the frontal eye fields. Generally speaking, it is only the nerve fibers from the frontal cortex that bypass the superior colliculus, and these are the ones that would be responsible for eye movements "in the absence of" a visual stimulus. There are such eye movements in humans, for example during deep thought we may emit non visual saccades.

A moving visual stimulus is usually required to initiate smooth pursuit. Smooth pursuit is a reflex. The pathway starts in the retina and goes through LGN to V1, then to V2/V3, and finally to area MT which generates the motion signals. It is important to note that the frontal pursuit area does not select which moving stimulus to respond to, it only figures out how to respond to the selected one. The job of selection is done by the attention systems in the anterior cingulate cortex.



The vergence system is really interesting and bears some discussion. The depth dimension in human vision is entirely synthetic, it is derived rather than directly observed, but it is also a reflex. The z axis is derived from binocular disparity. The calculation of disparity is somewhat complex and proceeds through at least 3 stages before being made explicit in area LIP where the occipital, temporal, and parietal cortex meet. Area LIP feeds the scene map in the hippocampus, but it also feeds the SC and the frontal vergence area. In turn the frontal vergence area bypasses the SC and synapses directly onto the integrators.


Two things about vergence:

1. It's related to lens accommodation to sharply focus the image on the retina

2. It's handled separately from conjugate movements.

#2 is a big deal. It tells us the z axis is an odd man out, it's never truly integrated with the visual plane. The reason for this could be historical in evolution, or it could be something required by the computation that we don't know about yet.

This context tells us that if we want to approach human performance we have to keep the neural systems as they are in the human brain. And speaking of which, an HD camera is approximately like a human retina or slightly better, so why not use the full resolution? We just need more precise motors if we're going to do that, and it makes our neural network bigger so we have to make sure we have enough memory.

So now we talk about 3 ways we can make the integrator work: synaptic, somatic, and recurrent. We can just pick one, but a heads up is we may eventually require the recurrent version to unify the various inputs, because we have some from the SC and some from PPRF and they don't necessarily use the same neurotransmitters.
 
Last edited:
Here's the skinny on integrators.

 
You might be wondering why our map signals eye velocity, as distinct from position. This is because of the vestibulo-ocular reflex. The hair cells in the vestibular organs signal velocity and linear acceleration, not position.

The velocity signals are converted to head position in the superior vestibular nucleus, which then feeds the anterior thalamus and becomes the head direction signal in the hippocampus. At the same time, VN feeds the oculomotor integrators directly, bypassing the spatial mapping in the SC. So the head position signal will simply be subtracted from the coordinates given to the SC. This will be done automatically whenever the VOR is functioning correctly.

Thus, the quickest and easiest way to test the integrator is to make the VOR work. And that is easy, we just get some MEMS gyros and integrate the acceleration signal (or use the velocity signal directly if they provide one). We can do three dimensions like the inner ear, that's fine. To tune the integrator we just set the gain (weight) so it balances the time constant of the leaky membrane. Once the integrator is tuned we can work backwards and tune the weights between the SC and the integrator, so they conform to the VOR. This way, any head movements that occur during eye movements will be automatically and properly corrected for.
 
15th post
Good. Here's the initial design.

IMG_20251223_000953938.webp


In the middle you can see the forward path: retina => SC => PPRF => oculomotor neuron

On the right is an expansion of the PPRF (integrator). When the lateral rectus muscle contracts it pulls the eye away from the midline, toward the ear. It is the excitatory burst neurons (EBN) that cause the ipsilateral eye muscles to contract, and they also cause the contralateral opponent muscles to contract by the same amount, keeping the eyes in register. The IBN's do exactly the same thing but for the conjugate muscles, so whenever the lateral rectus contracts the medial rectus will relax and vice versa.

Both EBN and IBN are gated by the "omnipause" neurons, which are also called fixation cells. They are tonically active during fixation and fall completely silent during eye movements. The oculomotor neurons respond only when the omnipause neurons are silent.

Conversely, there are "long lead burst" cells that fire up to 100 msec before a saccade (as distinct from the omnipause cells whose window is about 1 msec around the actual saccade). These signal "saccade about to happen" and their primary purpose is to negotiate between saccades, smooth pursuit, and fixation so there are no conflicts. The long lead burst cells in the PPRF are driven by neurons in the rostral superior colliculus.


At the bottom of the drawing you can see the states of the system, which are fixation, smooth pursuit, and saccade. Vergence and the VOR overlay these states, potentially affecting all of them.

The next step is to make a table of machine components vs brain components to make sure we got everything. Our job is going to be to connect the SC to the PPRF, knowing "almost nothing" about the neural circuitry of the superior colliculus. We will infer the circuitry from the required behavior, and at the end of it all we will have to match known medical conditions to the network to make sure we get the same output.
 
Learning and Training

Eye movements in human babies take several months to develop. The bad news is the fixation reflex requires V1, so any attempt to self-organize the oculomotor integrators will require a pre-existing mapping from V1 to SC.


The good news is there are muscle spindles in the oculomotor muscles, which means there is proprioception. In humans the palisade endings in the oculomotor muscles send signals to the primary somatosensory cortex S1.


The relationship of the muscles to the muscle spindles looks pretty classical:

1766553016682.webp


From S1 the eye position signals end up in the anterior parietal cortex.


These connections are natural and expected because of the pre-existing pathways through the cerebellum, which handle among other things the smoothing of eye position.

So we have some limited proprioceptive feedback from the muscle spindles, but this is not enough to program the integrator. Behavioral changes after surgical and pharmacological manipulation indicate that the integrator takes about 10 days to adapt, and during that time it's doing about 3 eye movements a second times about 50,000 seconds a day (not counting sleep), so 150,000 frames a day times 10 days is 1.5 million frames, which means the learning rate is very slow.

Before an infant starts saccading and following objects at about 3 months of age, the eye movements are jerky and usually miss their targets. They get better over the next month or so, and by 5 months vergence can be tracked with moving objects. So this requirement means we have to align the V1 projection to SC with the retinal projection to SC, before we can train the integrator. Which is actually fairly easy, because we have a reference and therefore we can generate an error cost. And as already mentioned, this has been done before and we don't need to reinvent the wheel.

What is important though, is the precision of the visual map in the superior colliculus. There are 100 million photoreceptors in a retina, the visual fields around the fovea get very tiny, less than a minute of arc. It is very doubtful that "a" (single) spot of light on "a" (single) photoreceptor would be enough to elicit an eye movement, but a spot of light "of sufficient size and intensity" surely would. How big a spot of light do we need to elicit an eye movement toward its center? Not entirely surprisingly, near the fovea the answer is about 2 * sqrt(10) photoreceptors on a side, which corresponds to the 10:1 loss in precision from V1 to the SC. So when the spot of light hits V1, we're going to get multiple neurons activating, which means multiple targets (they'll resemble overlaid Gaussians), and the center of mass of those is where we want the eye to move to.

So that's easy, if our maps are aligned. The retinal location will agree with the V1 location and both inputs will synergistically activate the same region of SC. This will cause SC to issue a command to the oculomotor system saying "move here", and then by some TBD magic this will result in a muscle contraction of the correct amplitude and duration.

This is the geometry of the mapping from retina to visual cortex.

1766555522989.webp


This is the geometry of the mapping from retina to SC.

1766555737202.webp


Or rotated to preserve topography,
1766555808247.webp


Pretty obvious, right? The maps align quite well. Now we can figure out the magnification factor at any point in the visual field, and relate that to the force we want on our muscle.
 
It's Christmas time, gonna take a break, hang out with the kids for a while. The hardware should arrive while I'm gone, upon return it'll probably take me a week to set up the Raspberry Pi and then we're off to the races.

Hopefully you'll like this project, it'll start slow but in a month it'll do stuff no one else can do (not even Chinese robots). After the vision works I'd like to do approach-avoidance behavior, to make use of the third dimension and prove it works.
 
Sigh. Jet lag.

The entire field of machine learning for vision is only 12 years old. The first object detection effort was the R-CNN network developed by Ross Girshick at Microsoft in 2014.

Since then, the industry standard open source software has become YOLO, invented by Joseph Redmon at the U. of Washington in 2016. People use it because it's fast. This is what the Raspberry Pi uses for its demo apps.

You can use YOLO in real time if the camera resolution isn't too fine - the time to converge depends on how the training was done but typically these days we're dealing with 30 or 60 frames per second from the camera - which I will point out is TEN TIMES BETTER than the human brain, which only does 3 frames per second. The difference is, that our brains extract velocity information in hardware, whereas the computer has to figure it out using software. YOLO does just fine with a 320x640 security camera, it starts getting taxed if you try to run full HD at 60 fps.

This is what YOLO does, fundamentally.

1767007013785.webp


It identifies an object, and draws a box around it. The box is called a BOUNDING BOX, it's the smallest possible box inside which the object will fit. An image can contain many objects, therefore many bounding boxes. The objects can also be moving. This is what it looks like when YOLO analyzes people walking:

1767007263982.gif


Us humans can quickly extract moving objects from scenes, but what if I asked you "which one of these people is wearing white sneakers"? What do your eyes do when I ask that? (Other than :rolleyes: lol)

There are two strategies in common use. One is, the eyes saccade rapidly from shoe to shoe. The other is, the eyes hover near the center of mass of the sidewalk and wait for the white shoes. We can use both strategies, sometimes at once.

So this particular task of defining the boundaries of an object, can be done with or without any knowledge of what the object is. In the first pic you see a bicycle, and it helps if we know a little about what a bicycle looks like when we try to define the boundary between the front wheel and the head of the dog - because if we mistakenly assign any part of the bike frame to the dog's head, then the dog's bounding box changes. But we can use details from the image itself (shading, speculation, etc), it just takes longer.

YOLO is a one shot object detector, just like our brains. YOLO stands for "You Only Look Once". What distinguishes YOLO is it does the entire regression in a single pass, instead of first analyzing the scene and then identifying the object. This concept is sure to send traditional computer scientists into a tizzy. "How can you identify an object if you don't know where its boundaries are". But this is the magic of neural networks. They don't have to know about the logic of objects within scenes, instead they use optimization to iteratively arrive at the best fit of objects and boundaries. It doesn't take long, just a few milli-seconds.

The camera in the second pic is static, it's just a security camera on a sidewalk somewhere. But for this experiment, we're going to tell the AI that one of the people walking in the sidewalk is named Joe, and Joe is wearing white sneakers. We want our AI to follow Joe - in other words once Joe has been identified, we expect a series of eye movements to the right, that follow Joe till he walks out of the image.

In some circles this experiment would be called "stimulus driven pursuit", but we're going to set it up differently. We will use a series of saccades to Joe's predicted next position. So it'll be almost like reading - the AI will accept a frame, and it knows how long it takes to process and it knows Joe's velocity, so it simply calculates "in the next frame Joe's feet will be at the coordinates (x,y)", and that's where it will move the eyes. Sounds simple enough, right? (grin... :p)
 
Back
Top Bottom