alignment of sensory fields

Oh btw - if you're a careful analyst you will have noticed two important things about the first pic:

1. The dog is occluding the bicycle. Part of the dog is actually inside the bicycle's bounding box.

2. The bicycle's bounding box is incorrect on the left side. Note that the front tire extends well in front of the red boundary. Why do you think that happens?

Look carefully at the left edge of the dog's yellow bounding box. What do you see?

Well gee, that looks like it could be the left edge of the bicycle"s bounding box too! Right?

The front wheel of the bike exactly lines up with the back of the dog's body. Whereas, the red box lines up with the handlebars, instead of the wheel. Hm.

There's an additional interesting problem with the bicycle seat, I'll leave it to you to find it.

:)
 
So now we run headlong into the issue of ATTENTION.

If you're looking for white sneakers, your brain will automatically exclude everything else. And, once you've found Joe, you'll ignore everyone else.

This is a part of the robotics we'll save for later, the planning of action. You have accepted the task of finding the white sneakers so this becomes your goal. You are engaging in "active scanning", which is something quite different from passively perusing a scene. The difference can be quantified in terms of attention - which is a measure (not an observable, but it can in theory be calculated from direct measurement). How much you exclude other information is generally characterized in terms of a "radius of inclusion", and this requires that all the information be mapped into a coordinate system where the more relevant features are closer together, using the distance metric defined by attention. We want "white" "sneakers", and white is close to yellow but it's far away from black. Sneakers are close to tennis shoes but they're far away from wing tips. This is the domain of information geometry, and neural networks are really really good at changing coordinate systems.

The other thing is, it's really easy for you to focus on a foot - but how do you tell the computer to do it? You know which part of the body is the foot, you can find it without even thinking. But to a computer, Joe's foot translates into a spot of white just below the leg, that moves in a highly correlated way with the position of the head just above it. Joe's foot always belongs to Joe, with or without occlusions, no matter if someone gets in front of it or it disappears from view for a while.

Jeez, that's a mouthful. I just bit off segmentation and short term memory in one easy sentence. "Sounds simple enough, right?" :p
 
As good (semi-decent ha ha) engineers, before starting the project we'll investigate the work that's already been done.

There's a whole ton of hobbyist type stuff on YouTube, I won't list it for you but you can find it by googling "robotic eye movements".

But this right here, is the real enchilada, and it's current.


I don't know what NARX is, never heard of it. But whatever it is, I'm sure I have a better one, and if the software is well built we can hook mine in instead.

Great minds think alike. This group is in Europe, and there's a Chinese group working on the same thing. Of course you realize this is all just infrastructure for the real goal. Science needs an experimental testbed to study attention. That's what we're building.
 
Building a system like this, gives one a very good idea of how the human brain works, and why it works (so well).

Think about the YOLO target identification scenario with the people walking down the street. We're doing eye movements, right? How many targets are there, and where are they located?

If you asked an ordinary PC this question it would take forever - BECAUSE, the computer analyzes each frame off line, after the fact. But in our brains, the camera has a dedicated visual coprocessor sitting right behind each eye - in hardware. Nothing is faster, there's no delay time. An ordinary GPU can do convolutions on an HD frame in about ten milliseconds, but our brains do it in under a millisecond, and all kinds of other information is extracted at the same time - in hardware.

It's a distributed architecture. We used to do this type of stuff when I worked on the Beowulf cluster. Any single machine wasn't fast enough to do the whole task, but fortunately the task could be broken down into sub-tasks which could be completed independently and then the results combined at the end.

Our brains handle vision exactly the same way. One area does color. Another does velocity. A third takes care of targets, it doesn't care about target properties, just where they are and how many of them there are. This last bit is used not only for eye movements, but also for mapping scenes into short term memory. Everything is topographic, all automatically aligned with the retinal map by connections that self organized shortly after birth. It's only rarely that someone needs the information about "which" target is at a particular location (the white sneakers example, where I specifically asked you otherwise you wouldn't have cared), and when the correlation has to be made, it's as simple as working backwards through the connection stream in the neural network.

At some point we'll spend a post talking about signals and coding and what the radio people call "modulation" - and its relationship to information theory and how much information you can cram into a wire. The point being that algebra is still necessary, the brain has to be able to say "the object at this location was X and it had properties Y".

Which works beautifully for us, because this is nothing more than simple associative memory. Neuroscientists worry about how it happens with just one presentation, mostly till now the machine learning types don't care. But you can easily see how important this issue is in the context of attention, because these types of correlations are the lifeblood of all our higher processes. Our brains sample the visual field and build an internal model, then manipulate the world by manipulating the model - and the hope is that understanding this mirroring will help us better understand our selves.

btw, for neuroscience junkies, the brain area at issue is the anterior cingulate cortex, and its hippocampal cousin in the dorsolateral prefrontal cortex. The ACC is part of the Circuit of Papez, which actually has very little to do with "affect" except insofar as it assigns values to objects. The mammillary bodies process head direction, and the habenula feeds the vestibular system. This circuit is for scene mapping, not emotions. The purpose of the hippocampus is encoding all the scene details in a way that can subsequently be retrieved by associative search. The DLPFC provides context (from memory) about objects in the current scene, while the ACC takes snapshots of the current scene relative to head position and eye position and maps them into a coherent "episode" with a story line and a set. The ACC and the DLPFC talk to each other, they're part of the "dorsal attention network".

We learn this simply by studying eye movements. The next step is to prove that these simple principles work by using neural network hardware to tackle real world problems. Like, reading. Or, if you're DOD, guiding a personalized missile up your enemy's ass even when he's in the middle of a moving crowd.
 
To begin this next section, we'll ask how the brain transforms scene maps into long term memory. By now we should have a pretty good understanding of how scene maps are formed. But short term memory is very specific, damage to a single brain area (the hippocampus) destroys it completely. Whereas, there is no single area for long term memory. Long term memory is all over the place, it's kind of holographic. If you chop out pieces of the brain you don't lose your memories, they just get fuzzy, with a little less detail.

The biological context around this is essential, and by reading through this link you can quickly pick up a lot of the key concepts. This paper talks about the inferior temporal cortex, which handles object recognition.


One of the principles in play is that brain areas often arise by duplication of other brain areas. The discussion about the retinotopic maps in monkey brain is excellent. There's like 20 of them all together, and if you look at them the right way they look like a big convolutional neural network from the machine learning textbooks. (I can find a complete map of all 20 areas if you wish, monkey brains are basically very similar to human brains so you can get a feel for what ours looks like).

Another principle is the interplay between top-down and bottom-up information flow, which is what we've been talking about in terms of "predictive coding". The biologists started out thinking the brain was programmed genetically, but now we know it's all about the data. Your brain is different from mine because you've had different experiences, it's that simple.

The infrastructure is the part that's programmed genetically. When we talk about memory consolidation, that is a top down process that depends on special biochemistry (the neuroscientists call it long term plasticity), involving a plethora of calcium ion channels with differing time courses, some voltage dependent and some not. But "attention" is before all that, because nothing irrelevant makes it into long term memory.

So far, the most popular theory of consolidation is "hippocampal replay" during sleep and quiet wakefulness. But the evidence is scant - replay definitely occurs but its relationship to consolidation is a swag. What helps, in terms of understanding the constraints, is to consider what happens "after" the scene map. Once you have the scene, what do you do? Well, you navigate... and, you manipulate. Let's say you take an action on an object to bring about a result - say, the monkey pushes a button to get a reward. So that's our action, we're going to push a button in our scene somewhere. The button is the object that's been identified, but can we apply the same action to some other object? Sure! We can push the door knob, we can push the cage, we can push anything. But the button is the object associated with the reward. So we apply "this" action to "that" object. And of course, the object could be anywhere in the visual field (or not in the visual field at all), and it's the object we care about and we only need its specific location when we push it.

So the semantic aspects of object recognition guide attention. If we can't find the object in the visual field, maybe we go looking for it. Maybe we look around the room. Maybe we get up and look in the kitchen or the bedroom. Maybe we even go foraging, go to the store or something. Interestingly enough, the same strategies that govern search in the visual field, also govern search in the scene. (Remember, the scene is the entire 3d map of our surroundings with ourselves in the middle of it, whereas the visual field is just the snapshot we happen to be looking at right now given our location and body/head orientation and eye position).

In addition to top down goal driven attention, there is also "exploratory attention". That's like when you scan an image with no particular goal. Maybe you go to the museum and you see a piece of art and you say "hm that's interesting" without knowing who created it so you don't know the idiosyncracies to look for. You just kinda scan it, and maybe you notice the brush strokes (especially if you're an artist) or maybe you don't. If someone asks you half an hour later to describe it, you can do pretty good. But if you wait two days, a lot of the detail is lost - long term memory only stores the essentials.
 
It turns out NARX is exactly what I was talking about. It's a Kalman filter, built with a recursive neural network as the adaptive element.

So, this looks do-able. There's one topic we should pay attention to, that we haven't talked about yet, which is the interplay between saliency and reward. It is possible this effort might require a dopamine system, and neuroscience confirms this because our attention center in the ACC is tightly coupled with the nucleus accumbens.

All it is, is you're choosing which object or feature to pay attention to. The ACC is like a central switchboard, a crossbar that links features with their values. If you're scanning an image, your eyes tend to move to the features with the highest value (or "most important").

The only other thing we need is a decision as to what constitutes the end of a scene, so the scene can be consolidated. Typically this could take several forms - maybe goal reached and goal abandoned would be the two most common. "Interrupted" is probably common too.

The hardware is supposed to arrive next Monday. I have to make some space for this project, not sure how that's going to work but...

Attention is infrastructure. It's neither sensory nor motor, it's something in between. From a computing standpoint, it is obvious that a buffer is needed, to hold all the context that belongs to a scene. When the scene is complete the buffer has to be flushed, otherwise scenes can not be distinguished.

Brain anatomy says we can use up to 20 convolutional layers between the retina and the cognitive map. In those 20 layers, we have to get from point-like retinal receptive fields to the spatial grid in the hippocampus.

My project will use a Lego cart (because all the right motors are conveniently available), that will be my organism. I want my organism to become aware of where it is, not by asking GPS but by doing it the natural way, by looking around and figuring it out. (Waymo can do this, but it asks GPS).
 
We're going to align the sensory fields for 20 topographic convolutional layers, using nothing but self organization and simple gradients.

Note that the oculomotor map is also a sensory map in this context. The difference is which of these maps are inside feedback loops, and which aren't. If they're inside feedback loops they'll self organize without any additional instructions from the experimenter. If they're not inside feedback loops they might need an external gradient or something.

While the hardware is being built, I'll emulate this whole thing in software. It's really easy. It can be done in a few lines of code with PyTorch or TensorFlow.

Here's the rub from my perspective. PyTorch and TensorFlow only do algebra, and all the algebra involves synapses. They don't do dynamics, and they also don't do gap junctions and astrocytes. No Runge-Kutta, nothing like that. The back propagation algorithm won't allow that kind of model. It only allows "statistical" correlations, so for example it works with spiking neurons but it uses a Poisson model which involves a time-averaged "rate".

Where this matters is as follows:

In the machine learning version of this architecture, all 20 convolutional layers are in series. They have to be. You can't say "update layer 15 at the same time layer 14 is sending direct inputs to layer 16". If you try to do this the credit assignment will get all mixed up

Whereas, in the real biological version, there are lots of parallel pathways. There's a series path, V1 => V2 => V3 => V4, but there's also V1 => V3 and V1 => V4. So we will have to do this differently, we can't use ordinary back propagation.

Instead, we will use the predictive coding algorithm from Rao & Ballard, which doesn't care about synchronicity (that is to say, errors are handled locally instead of globally). This will allow us to install dynamics "later", if we code up our network the right way. The difference in performance for a megapixel array is negligible, if we code up our network the right way.

The machine learning model also can not easily handle multiple synapses between the same two neurons. In the cerebral cortex there are up to 1000 synapses between the same two pyramidal cells. Once again the credit assignment problem becomes difficult unless all the synapses are scaled equally - but the whole reason for having them in the first place is to be able to scale them unequally!

Anyway... these issues will go on the back burner for now, however we'll keep them in mind so as not to preclude later solutions. For training we can use any of the NIST stereo image datasets, which can be found here:


We'll train in the same order as the brain. Retina to V1 comes first, followed by retina to SC and then the oculomotor plant (because we want the plant inside the feedback loop for the fixation reflex).
 
Thinking of ways of testing this to make sure it actually works.

To test, we need an "eye movement monitoring system". And this set of glasses, is one of the better ones I found, it has built in gyros and accelerometers for faithful reading of head position, and it has a screen that shows you what you're actually looking at.


This device has a spatial resolution of about 0.6 degrees. In the lab we can do much better. Scleral search coils with a quadrature drive will provide a resolution of about 0.03 degrees. I built one of these for my work with PML. Two square wave signals 90 degrees apart running through a 100 watt per channel stereo amp, driving four large coils wound on bicycle rims.

The glasses above will give you eye position 200 times a second, and the advantage is they're external. Otherwise we could just read the motor position directly from the motor. But it's nice to have additional independent verification. The same is true for head position, we can duplicate all that for the VOR.

Okay well, I made it as far as Flagstaff, I'll be back in LA tomorrow and the box of hardware should be waiting at the front door. The glasses are 6 grand, so I'm not going to spring for that just yet, we'll use the direct read till things start working.
 
Back
Top Bottom