riddle me this, Batman

scruffy

Diamond Member
Joined
Mar 9, 2022
Messages
25,908
Reaction score
22,374
Points
2,288
Why does the brain require separate systems for first derivative information?

For example, in the retina, there are X cells and Y cells. The X cells are static, they fire continuously whenever there's light anywhere in the receptive field.

But Y cells process motion, they only respond to moving lights (not stationary lights), and lights that turn on or off (suddenly appear or disappear).

Why? Why can't the brain just extract the first derivative from the X cells?

The separation of first derivative information is ubiquitous, it occurs in every sensory system. Why?

Those of you into differential geometry probably know the answer. If you understand relativity you probably know the answer too.

Why are separate Nvidia GPU's required for the first derivative information?

Remember, there are two kinds of first derivative: time, and space. "Onset" is time, whereas "mesh size" (spatial frequency) is space. In the retina, the receptive field surround is always opposite to its center, for instance you have on-center/off-surround and vice versa. So the biggest response in an on-center cell happens when the light turns on just at the boundary of the receptive field center.

1739335965295.webp
 
Why does the brain require separate systems for first derivative information?

For example, in the retina, there are X cells and Y cells. The X cells are static, they fire continuously whenever there's light anywhere in the receptive field.

But Y cells process motion, they only respond to moving lights (not stationary lights), and lights that turn on or off (suddenly appear or disappear).

Why? Why can't the brain just extract the first derivative from the X cells?

The separation of first derivative information is ubiquitous, it occurs in every sensory system. Why?

Those of you into differential geometry probably know the answer. If you understand relativity you probably know the answer too.

Why are separate Nvidia GPU's required for the first derivative information?

Remember, there are two kinds of first derivative: time, and space. "Onset" is time, whereas "mesh size" (spatial frequency) is space. In the retina, the receptive field surround is always opposite to its center, for instance you have on-center/off-surround and vice versa. So the biggest response in an on-center cell happens when the light turns on just at the boundary of the receptive field center.

View attachment 1077574

You knew I was going to give you the answer, right? :p

The answer is, the first derivatives are the BASIS VECTORS for the tangent plane TpM, without which distances and angles can not be measured, and projection and parallel transport can not take place.

The visual system for example, is all about geometry. It needs to abstract "surfaces" from visual input. For example, reaching for an object hidden behind other objects. To do this it needs to generate a model of the outside world, in an ALLOCENTRIC reference frame, which is different from the egocentric reference frame defined by eye position and head direction.

The key point to understand is that the ALLOCENTRIC reference frame has to have an origin. (Otherwise it's affine, which severely restricts the types of geometry that can be performed). And, the origin has to coincide with the egocentric origin, which in the case of the visual system is somewhere behind the eyes.

The act of "reaching" occurs in the same coordinate system as the ALLOCENTRIC visual reference frame - then and only then is it subsequently re-translated back into a (different) egocentric reference frame based on body position.

To reach and grasp a hidden object, we have to abstract the SURFACE of the hidden object. Which requires differential geometry - because changing the position or orientation of the object is equivalent to parallel transport.

If the hidden object has an important feature (for example a handle on a coffee mug), "exploration" will typically occur after the reach but before the grasp. This is to exactly align the allocentric coordinates with the egocentric frame used in the grasp. Which tells us that this alignment is only "approximate" in our consciousness, it is a dynamic abstraction and the only way it occurs is by registering surfaces in an allocentric frame.
 
Evidence:

Functional MRI in humans as well as single cell and evoked potential recording from monkeys and cats shows that the transient response precedes the sustained response in visual cortex, areas V1 V2 and V3.


It is difficult to experimentally separate surface information from other types of visual information (contours, edges, etc). One of the clever methods involves visual illusions. Another involves the filling in of visual information in the blind spot. Yet another involves the sudden appearance of a surface within a moving point cloud.

In humans, inferior temporal cortex (the "what" pathway) is heavily involved in surface processing.
 
The story of visual surface identification begins in the retina.

Retinal cells can be considered to perform "local wavelet transforms" on selected aspects of the visual input. In humans the static processing is mostly in the midget cells of the fovea, which is essential for reading and face recognition. Primates are the only mammals with a fovea. 80% of foveal ganglion cells are midget ON or OFF cells.

Two new experimental techniques have elaborated retinal structure in detail. One is the in vitro growth of "organoids" from stem cells. The organoids grow into actual retinas which are functionally and genetically identical to real human retinas, allowing experimentation in the lab that can not be done in vivo.


The other is RNA mapping which has identified 58 distinct cells types processing visual information in the human retina.


The 60-ish cell types are organized into 10 distinct functional and connective layers in the human retina.


Broadly there are 3 cell layers and 2 synapse layers. The inner and outer cell layers are comprised of photoreceptors and ganglion cells respectively. The middle layer has horizontal, bipolar, and amacrine cells and does most of the processing. Bipolar cells connect to one or a few photoreceptors, and feed the ganglion cells that transmit signals through the optic nerve. They handle the CENTER of the receptive fields, which could be either on-center or off-center. Horizontal cells handle the SURROUNDS of the receptive fields, which have the opposite polarity from the centers.

Combining the center and surround creates a "Mexican hat" shape which is a primitive wavelet. Mathematically, wavelets are different from the sines and cosines underlying Fourier transforms, because they have finite integrals, and thus preserve both frequency and time information when used in signal transforms.

The "periphery" of the human retina is defined as a distance > 5mm from the fovea. It has low spatial acuity and handles mostly motion information.

In the fovea, the signal generated by a wavelength specific photoreceptor cone is split into 12 distinct components which ultimately end up in the neural spike trains of the retinal ganglion cells that feed the optic nerve. Each component is carried by a specific type of bipolar cell. The numbers of the different cell types are well known and have been thoroughly catalogued in dozens of species.

The outputs of bipolar cells are organized into 10 distinct sublayers in the retina's inner plexiform layer, which are then sampled by the ganglion cells neurons. This is what the stratification looks like:

1739835845415.webp


There about 20 different types of retinal ganglion cells, each of which samples this stratification differently. The vast majority project to the lateral geniculate nucleus of the thalamus, with a small fraction projecting to the superior colliculus in the midbrain (which targets eye movements). After the thalamus, visual information goes to the cerebral cortex. The cells in the LGN mostly preserve the receptive field structure of the retinal ganglion cells, except that "tails" are added to the (circularly symmetric) wavelets. In the cortex however, these change into an array of oriented lines, with a full spectrum of angles covering each point in visual space. The construction is like a fiber bundle, with the visual field being the base space and the line orientation being the extension. There are about 1 million axons in each optic nerve, however there are about a billion neurons in the primary visual cortex.
 
Evidence:

1739860206714.webp


The first derivative information ("M") precedes the static information ("P") by a full 20 msec.

Next we'll talk about how the visual cortex processes surfaces - not in space, not in time, but in SPACETIME.

An interesting thing happens to the polar-coordinate retinal image when it hits the visual cortex: it becomes Cartesian.

1739860415518.gif


1739860914881.webp


This is accomplished by way of a complex log mapping that takes place in the blue pathway above.

The actual V1 visual cortex looks like this:

1739861079663.webp


1739861143288.gif


So, the combination of a change of coordinates, coupled with the early arrival of the tangent plane basis vectors, allows the visual cortex to compute surfaces using simple matrix math. (Tensors, as used in the tensor calculus for differential geometry).

The visual cortex generates multiple vector fields for each point in visual space: intensity, contrast, color, motion, and ocular dominance (depth, for stereoscopic vision to reconstruct 3D images). These informations are combined in different ways for specific purposes. The "what" pathway leading to the temporal lobe handles object reconstruction for memory searches. The "where" pathway through the parietal lobe handles precise positioning for tracking and grasping under changing/moving visual conditions. Both pathways feed into episodic memory, although much of the "where" information is deliberately forgotten unless the navigation is somehow important (like, related to reward).
 
Last edited:
So the first step in analyzing the visual field is, we want to know where the objects are. Objects are surfaces. So, we build a map of the surfaces. Let's call this map S. For each eye, we have S(x,y), which when the two eyes are put together by stereoscopic vision, becomes S(x,y,z).

Obviously, objects move. They don't stay in one place. So we are interested in determining the boundaries of objects, which we can call "edges". And it turns out, the the transient channel in our retina, handles edges. Specifically, moving edges. The Y cells (which are also called M cells in primates, M stands for magnocellular because the neurons are big, and heavily myelinated so the signals arrive faster) process moving edges. They are transient in time, and they respond to motion. We can call this "first derivative information", and we have partial derivatives with respect to time and space.

How this actually works is, when an edge enters the OFF-surround of a Y cell, it emits a big negative pulse, and when it enters the ON-center, it emits a big positive pulse. Both pulses decay rapidly, in a matter of milliseconds. So if a Y cell emits a big negative pulse followed by a big positive pulse, it means a moving edge has entered the receptive field from the periphery. As the edge of an object sweeps across the retina, all the Y cells it encounters will behave in this manner. So it is fairly easy for the visual cortex to determine the presence of an edge, as well as its direction and speed. All it has to do is track the big transient negative and positive pulses in the Y channel.

The objects in the visual field are large compared to the size of a retinal neuron's receptive field. Each eye handles some 150 degrees of visual field, but there are a million retinal neurons. So the cortex sees "line segments" of edges, the Y cells fire in rapid sequence as the edge moves, and the nearest-neighbor Y cells can track and reconstruct the orientation and direction of moving edges.

There are only a very limited set of transformations needed to determine which edges move together. As an "object" moves, so do its edges. The visual cortex is much faster than the motion of most objects in the visual field. And before we talk about surface reconstruction, let's talk about that for a second.

There are two main classes of "interruptions" in the visual signal. The first is eye blinks, and the second is eye movements. Eye blinks are easy, they're handled by a reflex loop that begins in the cornea, travels through the trigeminal nerve to the trigeminal nucleus in the pons, and from there it goes to motor neurons in the facial nucleus that contract the orbicularis oculi muscle which lowers the upper eyelid. And eye blink lasts at least 100 msec (a long time!). When the eye blinks, the visual signal is temporarily disrupted. But we don't really notice the disruption, because collateral fibers from the trigeminal nucleus enter the thalamus and temporarily inhibit the visual signal. The important point is, when the eyes open again, everything is in a different place. The world has changed while our eyes were shut. The edges we were talking about, have moved. Not very far, because eye blinks are quick compared to object motion. But the visual signal is "discontinuous" during the blink.

The second type of interruption, is eye movements. In addition to the two best known classes of eye movements (called saccades and smooth pursuit), there are micro-tremors that move the eye from side to side as fast as 200 Hz. Each of these tremors causes the visual field to move "slightly", whereas a saccade causes it to move "bigly". It turns out the visual signal is also inhibited during saccades (but not smooth pursuit or tremors). The target of a saccade is determined in the superior colliculus, which receives input from the frontal eye fields for voluntary eye movements. Usually a voluntary saccade is a command from the frontal lobe to foveate a particular target. But there are also involuntary saccades, which can be caused by something as simple as the sudden appearance of a bright light, that don't require any input from the frontal lobe. In either case the visual signal is temporarily inhibited by collateral fibers from the superior colliculus to the thalamus. The point once again is, after any kind of eye movements, the visual field has changed.

If the visual field changes "enough", the brain does a reset, which may be accompanied by a P300 and some subsequent exploratory eye movements. But a lot of the time a reset is not necessary, the brain aligns the new image with the old image and picks up where it left off.

And this is a very important thing to understand. The thalamus is NOT simply a relay station. It calculates the difference between a prediction of the visual field, and the actual visual field. The prediction is driven by neurons in layer 6 of the cerebral cortex, that feed back into the thalamus. So you have a feedforward pathway that goes from retina to thalamus to cortex, and a feedback pathway that goes from cortex to thalamus. Both pathways operate at the alpha frequency, which is about 10 Hz. So basically the visual cortex is performing a new analysis of the visual field every 100 msec or so, interrupted by eye blinks and eye movements. It takes about 50 msec for a visual signal to get from the retina to the visual cortex. So what happens during the other 50 msec?

I'm glad you asked, because it takes us right back to the spiking behavior of the Y cells in the retina. The instant the eyes open after an eye blink, or the instant an eye movements lands, there is a burst of spiking from the Y cells (because, remember, they're transient). The Y cells are saying "the edges are HERE" (map), and this is the very first information that reaches the visual cortex. It is short lived, only 5 msec or so (because the retinal Y cells adapt almost instantly). Subsequently, the information about EDGE MOTION reaches the brain - so, having determined initial position we're now calculating direction and speed. As we'll see shortly, this is exactly the information needed to identify objects and map their surfaces.

Because we're not really interested in contours in the visual field, we're more interested in the contours of OBJECTS in the visual field.

Note in passing that it's not "just" eye position that changes where objects are, it's also head position and the movement of the organism itself. As a general rule "everything" in the visual field moves. Even static objects like walls move, their boundaries move because of eye movements, head movements, and the motion of the organism. The job of the visual system is to build maps of contour and shading for OBJECTS. Which is fairly easy once objects have been identified and their positions and boundaries are known.

The first piece of object analysis is the building of vector fields for luminance and shading based on the contours of moving edges. This gives us the properties of surfaces. Surfaces combine to form objects. Objects have shape. A chair is a chair no matter if it's big or small or what color it is, or whether it has a cushion or not, or even if it has 6 legs instead of 4. Generally speaking though, chairs have a consistent surface construction. A flat place for the butt, and multiple legs to keep the butt off the ground. Maybe (usually) a back (chairs without backs are called stools, or Swedish chairs, or some other words to identify they're not "normal" chairs). The relationship of those surfaces is what constitutes a chair.
 
Last edited:
Okay - so to understand visual surface processing, you'll want to know a bit about differential geometry. I'll try to cover the basics inline.

Imagine we're looking at a ball. The ball is a sphere, geometrically (it may be decorated, but fundamentally the ball is a sphere). It's a 2-dimensional surface, geometrically. Not 3, just 2. Mathematicians call it a 2-sphere.

We, humans, perceive this ball as being embedded in our 3-dimensional visual space. But the surface properties (brightness, color, etc) are 2-dimensional. To understand this, you can imagine yourself as a bug or a little tiny man, living "in" the surface of the ball. Kind of like Flatland.

The 3-dimensional bird's eye view is handled by extrinsic geometry, where the surface is seen as S(x,y,z). However the bug's eye view is handled by intrinsic geometry, where the surface is a curved version of S(x,y).

When we see the ball bouncing around in 3 dimensions, it also has a motion component, which makes our view actually 4 dimensional (space plus time).

So our visual system has to align and translate between 2, 3, and 4 dimensions. It has to separate the features belonging to the ball, from the features thst arise when the ball interacts with other objects. Like, shadows cast by other moving objects. When we say "the ball is green", that's a property of the 2-sphere - but when we say "the ball is bouncing" it's a 4 dimensional description. When we say "the bouncing ball is green", we're relating the intrinsic features to the extrinsic features.

Mathematically, the identification of curved surfaces is sophisticated, the geometry involves metric tensors that change for every point in space and time. The good news is, that these relationships all boil down to dot products, which is simple algebra. And neural networks are very good with matrix math. A photonic neural network can classify an object in less than 500 picoseconds.

The way our visual system handles surfaces, is with "primitives". A primitive is a geometrical abstraction, for instance the primitive of the shape of a man might be a stick figure. In.our visual system, primitives are built by convolutional layers. This means features are extracted gradually, then put back together again in various meaningful ways. In addition to size and color, every moving surface has identifiable and extractable properties associated with it. Like, the normal vector, which points "out" from the surface, away from the direction of curvature. The normal vector doesn't exist in 2 dimensions. In 3 dimensions it's constant as long as the object doesn't change shape. In 4 dimensions it changes all the time, with the motion of the features of the object (for instance shadows can occlude the normal).

The structure of a convolutional neural network can be easily conceived as hierarchical layers of local dot products. An example is this:
1740041495974.webp


The concept of "chair" doesn't happen till the recognition layer on the right. First, the edge extraction is converted to a series of stick-figure surfaces having the property that they all move together and retain shape. Then, one view is related to another so the shape becomes constant no matter what the viewing angle is, or how far away it is or what its orientation is or how fast it's moving. Finally, the object properties are matched against a library of known (and labeled) objects, and subsequently this match is passed to episodic memory along with its features.

You can easily see that such processing requires memory at many time scales. Eye tremor is 200 Hz, visual cortex frames are 10 Hz, eye blinks and eye movements may last for hundreds of milliseconds, and objects may move around in the visual field for dozens of seconds. But a chair is always a chair, once it's been identified it remains a chair. There is one sequence of processing for identification, and another sequence of processing for subsequent tracking.

We'll talk about identification first, because it's pretty easy, it mainly involves a lot of matrix math. But first I have to get some sleep, have to be up at 4am, so we'll continue tomorrow.
 
I don't know about anyone else but I read every word of all your posts,

As Mr. Spock would say

 
Okay. Surfaces.

First thing to understand, is how the visual cortex (V1, the first stage) actually looks, and how it relates to the convolutional neural network architecture shown above.

Then, after understanding what V1 does, we'll move on to V2, and then V4 and MT, and finally the inferior temporal cortex (IT) which is akin to the "recognition layer" on the far right of the convolution map.

We talked about primitives. It turns out, that V1 is "pre"-primitive. It extracts feature from the visual space, which are then assembled into primitives by subsequent areas of the visual cortex. Specifically, the three features of greatest relevance for V1 are ocular dominance, edge orientation, and color.

Just to get a visual for what I'm about to say, first realize that the input from the eyes is combined in the visual cortex. The left half of the visual field (from both eyes) goes to the right cortex, and vice versa. BUT, the input from each eye is kept separated in V1, in the form of ocular dominance columns. Each ocular dominance column processes input from one eye. Here is what ocular dominance columns actually look like in the surface of the V1 cortex:

1740112356551.webp


Black is from the left eye, white is from the right.

But superimposed on this structure, are two other structures, one bigger and one smaller. Note the distance scale in the above picture. Taken together, the columnar architecture of V1 constitutes a matrix map of the (right or left half of) the visual field.

The matrix looks like this: for each point in the visual field, there is a "hypercolumn" approximately 1mm in diameter. Within that, there are two ocular dominance columns, about 500 microns each. Within each ocular dominance column there is a "color blob" in the middle, surrounded by a pinwheel of "orientation selective mini-columns" of extent 20-30 microns each. The pinwheel looks like this:

1740112975056.webp


The black dot in the middle is the "color blob". It is a small region that stains positive for cytochrome oxidase. The color blob is color sensitive, but the orientation pinwheel is not.

Thus, for each point in the visual field there is an input from each eye, and within that input there is a color and an array of edge orientations. So we end up with a vector field for color, and a tensor field for orientations.

Within the orientation mini-columns, there is a preference for motion normal to the orientation. So if the edge orientation is vertical, the cells will respond best to horizontal motion. In humans and monkeys, there is a slight over-representation of the 90 degree angle in the orientation pinwheels in general, so a slight preference for vertical edges with horizontal movement.

In the V1 visual cortex there are 6 anatomical layers, which is an interesting story in and of itself. The columns we're talking about are vertical, whereas the layers are horizontal. So you can conceive of the processing matrix looking approximately like this:
1740113821245.webp


Whereas the layers look like this:

1740113989557.webp


The circuitry for a processing column looks like this:

1740114167269.webp


The wiring between layers looks like this:

1740114269689.webp


In the above diagram "M" is the magnocellular input from retina (via the lateral geniculate nucleus of the thalamus), which is the Y cells that handle motion. Whereas "P" is the parvocellular input from the X cells of the retina that handle static luminance and contrast. With respect to color, a fraction of the X cells are red-green opponent (some are just black and white or gray scale, depending on whether they get input from rods or cones), whereas the "K" (koniocellular) inputs are blue-yellow opponent. As expected, the K inputs and the red-green P inputs feed the color blobs, which can be seen mostly in layer 3 of the cortex.

In turn, this visual information from V1 is then fed to V2, in the following way:

1740114382701.webp


The color information ("blob") goes to the thin stripes of V2, whereas the orientation information goes to the thick and pale stripes. We'll talk about V2 tomorrow, that is where visual surface primitives begin to be assembled.

For now, you can see that the primary visual cortex V1 is basically a gigantic matrix engine that builds a vector field for color and a tensor field for orientation, at each point in visual space.

The cellular circuitry is quite complex, and I'm leaving a lot out of the description just to provide an overview. For example you can see that the incoming and outgoing information is "multiplexed", each channel handles more than one piece of information, and how it does that has to do with the timing of neural firing and the relative excitation and inhibition between neurons. For now I just want to give you enough information to understand how surfaces are extracted and processed in V2 - because surfaces necessarily require the combination of signals from each eye into binocular representations in 3 dimensions.

And then later we'll see how the middle temporal area (MT, also known as V5) handles motion.

The general picture of the higher visual cortex is that V2 extracts surfaces, V4 extracts objects, and V5/MT tracks those objects as they move. IT, the inferior temporal lobe which is near the hippocampus anatomically, is responsible for recognizing objects based on memory, and the hippocampus is the beginning of the navigation system that lets us run mazes for rewarding objects, and grasp rewards that are hidden behind other objects.

Here is an overview of the human visual system minus the recognition part:
1740116373388.webp


You can see there's an upper (dorsal) portion, and a lower (ventral) portion. So far we're just talking about the lower (ventral) stream, which is sometimes called the "what" pathway, as distinct from the dorsal stream which is called the "where" pathway. Both pathways are fed from V2, the surface processing area.

V2 is what is labeled "extrastriate cortex" in the above pic. It surrounds V1 and is about 3 times as big. There is an area called V3 which is also part of the extrastriate cortex, but generally in humans it's lumped in with V2 because the boundaries are indistinct and the wiring is almost identical. In monkeys though, V3 is more distinct and has different wiring. Both monkeys and humans have a "where" pathway and a "what" pathway in the visual system.

It is the "where" pathway that accomplishes the changes of basis that allows the rapid switching of reference frames when we're navigating a maze or exploring new terrain. Such exploration builds coordinate maps of visual space around our head and body movements, which is something completely different from object recognition. But both pathways feed into the "cognitive map" in the hippocampus that allows us to purposefully navigate visual terrain.

There are other parts of the visual system we haven't talked about (and won't), like MTS which has the specific purpose of recognizing faces and identifying the emotional components of facial maps. It's part of the "what" system, and is definitely related to surface mapping, but it's somewhat extraneous to surface geometry itself, it's a separate topic for a different thread.

Tomorrow we'll get to the differential geometry that allows us to identify surfaces from color and orientation. The good news is, it mainly boils down to simple algebra - in this case tensor algebra which is just an extension of vectors. Neural networks excel at calculating dot products, the results are almost instantaneous. If you're familiar with differential geometry you'll know about the Jacobian and the metric tensor, which vary at each point in space. Part of what the neuroscientists call "lateral inhibition" and "recurrent excitation" has to do with calculating the metric tensor and using it to estimate the curvature of surfaces.

We will eventually see, that the full 4 dimensional picture for moving surfaces looks a lot like general relativity, mathematically. Where, we have to use intrinsic geometry because there's nothing (we know of) that's "external" to the universe, therefore when we're looking at curvature we have to do it intrinsically. The brain is clever though, it uses both intrinsic and extrinsic descriptions, and maps between them to be able to accurately reconstitute and track moving objects from moving surfaces in real time.
 
Last edited:
Okay, so V2. (We're still talking about surfaces in visual space, and how that relates to object perception).

The key word for V2 is (binocular) disparity.

This is an immensely complicated topic, and in the process of talking about it, I'll use other important keywords and underline them to emphasize their significance.

And a word of caution in advance: don't try to understand this using "receptive field tuning" in the neuroscience literature, you'll just get confused. In the old days they used to poke around in the brain with micro electrodes, and they didn't know a whole lot about the time course of neural responses. There is a huge literature from the 80's about "disparity tuning in cortical neurons", and most of it is misleading and considerably inaccurate. Steady state receptive fields are summed from surrounding neurons, so if you're seeing disparity tuning in V1 it means the ocular dominance columns are interacting with each other (most likely by way of inhibitory interneurons).

V2 is the first area where signals from the two eyes directly interact. It processes binocular disparity, and it does so using edge detection and orientation detection with an intermediate goal of figure-ground separation. To talk about this, we'll necessarily have to talk about geometry. We may have to do this in multiple posts, because the topic is too huge for one lengthy session.

The first thing to understand, is our eyes have a vergence angle. Eye vergence is controlled by the depth focus system in the frontal eye fields. It's a completely different system from saccades or smooth pursuit, handled by different pathways. In saccades the eyes move together, in vergence they move apart.

Our eyes are separated by 5-7 cm. The "vergence focus" is the point in visual space where geometric normals, emanating from the center of each fovea, meet. This picture shows the definition of "vergence angle":

1740193626850.webp


When the eyes look at something in the visual field, each eye gets a slightly different image. (This is always true regardless of the vergence angle). For a given vergence angle, here in turn is the definition of binocular disparity:
1740193951121.webp


There is absolute disparity, and relative disparity. Absolute disparity is the difference between the retinal position of a single point on the two eyes, whereas relative disparity is the difference in absolute disparity between two different points in space.

Computer vision junkies know a lot about this, but when they use two cameras the disparities are usually measured in pixels rather than angles. In the human brain there is instead a "point" of focus, and disparities are measured relative to that point

So you'll remember that V1 has a matrix-like organization that generates a vector field for color and a tensor field for orientation. The word field is important in geometry and physics. A scalar field has a number for each point in space (for example a heat map, where the number is temperature). A vector field has an n-dimensional vector for each point in space, for example in our color map the basis colors are red, green, yellow, and blue, therefore our vector is 4 dimensional (it has 4 numbers, one for each color, indicating the amount of that color at each point in space). A tensor field has an (m,n) tensor for each point in space, where m is the number of contravariant (upper) indices and n is the number of covariant (lower) indices. We'll talk a lot more about tensors, they're very important - so don't worry if tensors are new to you, you'll learn a lot about them by the time we're done. The orientation columns in V1 form a tensor field because there can be more than one orientation at each point in space. For example we might have a corner, which has two different orientations. Or, we might have a curve, which has a continuously varying orientation. In a curve specifically, the orientations can interact with each other, which is why we need a tensor and not a simple vector. This is especially important in binocular disparity, where each eye sees a slightly different version of the orientations and angles.

And, there's an even more important point. When objects move, in visual space, their 2-d orientations and angles may change - as seen by humans, in each eye. However the 3-d surfaces of the objects themselves are invariant (if they're solid objects and not amoebas or jellyfish). Thus, the process of figure-ground separation involves matching the 2-d images with the reconstructed 3-d estimates of the objects in the visual field, and extracting the invariants.

Invariance is a topological concept, a geometric concept, and also an algebraic concept. Physicists will naturally zero in on the concepts of eigenvalues and eigenvectors, but invariance is a lot more than that. Nevertheless it's a good conceptual start. If we have a map f, from one vector space to another, the eigenvectors are those whose directions don't get changed by the map - they only get scaled, and the eigenvalues are the scaling factors. In other words, eigenvectors are those whose directions are invariant under the map.

And before we start talking about V2, one more note on the math - if you know linear algebra, you know what a "basis vector" is. In orthonormal Cartesian coordinates, our 2-d basis vectors are (1,0) and (0,1), and they never change, they're the same everywhere in space. But our eyes naturally use polar coordinates, (r, theta) where r is a length and theta is an angle. There is yet a third and vitally important definition of basis vectors, as the coordinates of the tangent plane for any surface. In differential geometry, this third set of basis vectors changes for every point in space. For example, if we have a sphere, the tangent plane at the north pole is different from the tangent plane at the equator. If we know the tangent plane, we can calculate its basis vectors in any coordinate system (Cartesian or polar).

In differential geometry, vectors and tensors are invariants, whereas their coordinates are not. The coordinates change along with the basis vectors. For example, if we have a circle of radius 1, the coordinates of the north pole are (0,1) in Cartesian and (1,π/2) in polar. The vector pointing to the north pole hasn't changed, but its coordinates have changed because we're using a different basis. We can map between coordinates systems, using simple algebra. In two dimensions, there is a 2x2 matrix called a "linear map" that will do the conversion for us. There is a forward map and a backward map, and the two mapping matrices are inverses of each other.

For simple shapes like squares and circles, the math is straightforward linear algebra. Things get a little more complicated when we start talking about curved surfaces of arbitrary shape. We'll talk more math later, for now this is enough for binocular disparity.

So V2 - this is what V2 actually looks like:
1740198052030.webp


What they call "interstripes" is also called "pale stripes". In a nutshell, the thin stripes process color, the pale stripes process orientation, and the thick stripes process binocular disparity and the motions related to it.

How these stripes map the visual field is as follows:

1740198326452.gif


Why is color important for binocular disparity? If you'd like to jump ahead, read:


Before we really get into this, it is important to know some of the methods by which physiologists and psychologists test the response to disparity. One of the most important methods is the "random dot stereogram", where a set of random dots is presented to each eye, only a few of which have their positions correlated between eyes. Here is an example:

1740199710370.webp

So now we want to start talking about figure-ground separation, and how it relates to edge detection (in other words how orientation columns are used to reconstruct the boundaries of surfaces).

And I have to take a break, and then we'll get into the meat of what V2 does. (It calculates absolute and relative disparity, and from that it extracts the boundaries of moving surfaces). To show how it works, we'll use object occlusion to determine figure and ground.
 
Okay, so occlusion, figure-ground calculation, and absolute binocular disparity.

Each of us has a dominant eye. You can tell which one is yours, with the following simple test.

Find a vertical object behind another vertical object. Some distance away, is best. For example a telephone pole some distance behind one of those spiky fences with the spikes on top (which is kind of like a grating). Now, with both eyes open, choose a spike, and line up the telephone pole so it's directly in line with the spike.

Now close one eye, and then the other. You'll notice that on one side, the image remains unchanged (or only "very slightly" changed), whereas on the other side, there is a noticeable shift in the position of the telephone pole relative to the grating. The unchanged eye is your dominant eye.

What's happening here, is kind of like this (figure A):

1740202665775.webp


In figure A, the lightly shaded tree nearer to the eye is occluding the darker tree that's farther away. The eye shown is the dominant eye, but let's imagine there's another eye right next to it, in a slightly different position. This second eye, will see slightly more or slightly less of the dark tree, depending on which side it's on. (Reference the earlier picture of binocular disparity in the previous post, for the viewing angles).

Figure B is generally perceived as a tilted square on a dark background, with the light-dark background being the contour (boundary) of the square - but notice the information is ambiguous. The square could be a window.

Figure C demonstrates the concept of border ownership. The interpretation of a 2-d image depends on how the contrast borders are assigned. Consider the border marked by the black dot. If the border is assigned left, the square is an object in front of a dark background. If the border is assigned right, the square becomes a piece of background seen through a window. Given a flat (monocular) display without depth cues, the visual system assumes the object interpretation. This is an example of perceptual figure-ground organization.

Binocular disparity information provides depth cues and disambiguates the three dimensional structure of the objects in the visual field. The same principle that applies between objects, also applies within objects.

These are static displays without motion. In such a situation it is fairly easy to map between the Cartesian coordinates of two trees and the polar coordinates of their representations in each retina. This is what V2 does. It uses absolute disparity to generate a 3d representation of the visual field. It multiplexes the motion information on top of this map and passes it on to the higher layers along with the map. It doesn't assign objects per se, but it assigns boundaries to what's inside them based on the disparity information.

Once again this is simple matrix math. In a computer it would be done with an exhaustive pixel by pixel search for related contours, but in the brain it's done all at once in parallel. This is made possible by the convolutional architecture of the neural network layers.

Now, if you jumped ahead and read the PNAS cite in the previous post, you'll understand why color is important in this process. (If you didn't jump ahead you can read the cite now). But again, be careful of the neuroscience literature. Old world monkeys (macaque, rhesus) tend to be trichromatic, but new world monkeys are often bichromatic and have different brain architecture. A cat brain is very different from a monkey brain (for one thing cats have no fovea), and a mouse brain isn't even close. If you read the neuroscience literature "after" understanding the convolutional architecture you'll say "yep, makes sense", but if you try to do it bottom up you'll get very confused.

By now, the motion part of this should be pretty obvious. When objects are moving relative to each other, the occlusions are changing, and the assignment of boundary contours is changing. Again, simple matrix math - all we need to know is which contour belongs to which object. That's what the feedback pathways are for in the brain. There is top-down tracking of the assignments when things switch. And, it's not a passive process, it's based on expectation (in other words it's predictive, and you can easily see how this makes perfect sense when the assignments are changing 10 times a second).

The first area in the brain where "objects" actually appear is V4, which is fed directly from V2. This is where the specular information is calculated. The motion processing areas are also fed directly from V2, and they combine the object information with the moving contour information. V5 for instance is in a circuit that involves the pulvinar nucleus of the thalamus and the superior colliculus that targets eye movements. V4 and V5 both respond to visual attention. We often ignore a substantial portion of the visual field, which leaves the neurons free for other things. But when we're paying attention, the entire visual system lights up like a Christmas tree, in an fMRI.

This entire description only takes us halfway through the convolutional layers shown earlier. It takes us as far as identifying that something "is" an object, without telling us "what" it is. The "what" part is handled by the inferior temporal cortex, it combines high level visual information from all the convolutional layers, and uses a weight matrix to assign importance to features. It then does an associative lookup on those features. IT is highly context sensitive, for instance the exact same orientation and size ratios that make up a "chair" may become a "corridor" if we're navigating a maze. Situational context is determined by a loop between the hippocampus and the lateral frontal cortex. In between IT and this loop sits the entorhinal cortex and the retrosplenial cortex, which is where we find the circuitry that allows us to switch rapidly between allocentric and egocentric reference frames (for example the grid cells, that only light up when the organism is in a specific position in the room). Grid cells are allocentric, they depend on a 3-d map of the world that only comes from many successive views of objects in the visual field (combined with head direction and body position information for each view).

Now we have to talk a little bit more about math, which'll be tomorrow. The conversion between allocentric and egocentric reference frames requires the tangent vector basis system, which it happens is almost identical to the mapping between absolute and relative disparity that allows us to assign properties to objects. Global vs local disparity is a big issue. Global disparity can separate figure from ground, but local disparity is required to assign the sophisticated visual properties to objects, like texture, shading, and specular reflection.
 
Okay, to follow up, for those of you that are interested, I'll share a few links that will describe the rest of the situation more quickly and efficiently than I can.

So far I've given you an overview of the matrix structure underlying the biological analysis of the visual field. The rest of the story involves the actual math. That is to say, the math used by neurons, as distinct from the math used by computers.

To really understand this topic, you need to understand both. For example, differential geometry is used differently in computers than it is in the brain. Even though it's the same math and the same principles apply, the goals are usually a bit different.

And, computers can do everything they need to do with exterior geometry, whereas brains are often confined to the interior mode. So for example, to understand the reconstruction of arbitrary curved surfaces, you need to know how to use the metric tensor in each mode. In a way, the problem is easy, it's kind of like measuring distances and angles on a globe versus a Mercator projection. But in another way the problem is a little harder, because we don't know in advance if the Mercator projection is actually flat.

So I'll give you these links, which will help you understand surface reconstruction from the standpoint of computing, and then from the standpoint of brains. And I'll state in advance, that this is not the whole story, because brains use top-down controls that are missing from these descriptions. However these links should provide a pretty good description of the bottom-up story, which is simply the reconstruction of surfaces in visual space from binocular sensory information.

So first, computing:



Then, brains:



If you'd like a quick and direct demonstration of how the real-time brain's visual perception can be easily confused by very small changes in binocular disparity, get yourself one of those desktop magnifiers (the kind the electronic assembly people use when they're soldering circuit boards), and look at a small and detailed object through it (like, a circuit board). Now, wiggle the magnifier. Just a little bit, it doesn't take much. You can kick it with your hand, very gently, and look through the magnifier while it's still wiggling. It's very disorienting. It's a unique sensation, different from vertigo (although that element is in it), and different from the types of discomfort experienced by airplane pilots, although that element is there too. In this experiment you're specifically impairing visual cortical area V2, so this may be similar to what people with strokes in V2 experience, although in their case it's permanent, which is why in many cases they can no longer tolerate changing visual environments.
 
For the fully advanced view, see if you can digest this.

I mentioned that binocular vision was like general relativity in some ways. Specifically, they both naturally use hyperbolic geometry, which is why we need tensor calculus to handle it.

In vision, we have the two "approximately" flat projections into each retina, and then we have the integrated 4 dimensional spacetime of the "cyclopean eye". To get from one to the other we need unitary transformations in the complex plane, just like relativity and quantum mechanics.

In vision specifically, we need an automorphism group which turns out to be the Mobius transform group, where the eyes become fixed points of the automorphism.

Mathematically this is tensor algebra, a 3-point ratio from an object in space to each eye has an angle measuring the difference in azimuth and a modulus measuring difference in distance. The 4-point cross ratio of such ratios is the only 4-point invariant under the Mobius transform.

This naturally generates a hyperbolic geometry because the complex half plane is holomorphically equivalent to an open unit disk. Both Poincare and Klein would agree with this interpretation. The mapping between two binocular views and the cyclopean view therefore becomes a hyperbolic tangent function.

Which, it turns out, is straightforward to represent in a neural network, as long as we use biological characteristics instead of machine learning algorithms.

Neuroscience struggles with this because most biologists don't know math. And the few that do, often get brainwashed by the machine learning concepts. (Steve Grossberg's "adaptive resonance theory" is a prime example, he's a brilliant theoretician but his models are very sequential, almost digital, and he has to invoke "fuzzy logic" to get them to work in the real world).

Brains are fundamentally simple, they build complex behavior from very simple chemical kinetics and some simple mapping rules - which developing cells can use just by finding chemical gradients with cell surface receptors. Such maps are never "precise", what happens is they self organize around the simple geometry. If you doubt that you can look at the loss of binocular disparity during the self organization of the retinotopic mapping from the eyes to the thalamus. Neural signaling "turns off" while this happens, because the information has to be used in a different way to fix the map.

So relativity, quantum mechanics, and human vision all use the same math and the same set of rules - which makes sense because nature re-uses what works. A complex system will always reflect the properties of its simple components. The whole universe and all its components are built from unitary symmetries, and the geometry reflects that, and so does the algebra that deciphers complex behaviors from simple symmetry groups.

 
I don't know about anyone else but I read every word of all your posts,

As Mr. Spock would say


There's one born every minute as they say.

Also this dingbat uses the forum as a personal blog, that's not really what these forums are designed for, its an abuse of the forum IMHO, why anyone wants to read page after page of plagiarized crap is anyone's guess.
 
An excellent reference on this topic is the book

Visual Differential Geometry and Forms

by Tristan Needham, Princeton University Press

The Mobius transformation is discussed in detail in chapter 6.

Mobius transformations are basically combinations of rotations, translations, and dilations (scaling), and depending on the type of geometry we're interested in, possibly inversions (reflections). The latter is only pseudo-conformal insofar as the orientation of angles is inverted (even though the magnitude of the angle stays the same).

The Mobius transformation is simple matrix math, it looks like this:

| a b | | z |
| c d | | 1 |

With respect to neural networks, two things are germane. First, you need to know how multiplication and division are accomplished in synapses. Second, you need to know how complex numbers are represented in the network. (In the above matrix math, z is a complex number, and we already know by Euler's formula that the angle being represented is a combination of sines and cosines that are segregated by the imaginary number i - so in a neural network we simply do matrix projections to get our sine and cosine, and then ensure they're segregated along separate axes).

Using this architecture, Mobius transformations are easily accomplished both locally and globally, and binocular alignment becomes a simple matter of finding the kernel (the point where the result is 0) using local excitation and inhibition from neighboring neurons.

From a machine learning standpoint note that we're using "vanilla" neural networks, nothing is physics-informed or math-informed. The key to understanding the reconstruction process is realizing that orientations are tangent vectors. When we have a curve in space, the orientation columns with maximal response will be those that align with the tangent to the curve. Thus we have a natural intrinsic way to do differential geometry on the visual field, and we can do it in both interior and exterior modes by simply using Mobius transformations to map depth.

All this is simple matrix math, but it requires the presence of first derivative information so the orientations are "primed" before the static information arrives. (In other words the derivative has to exist up front, there's not enough time to calculate it, and calculating it would require extra processing power meaning more neurons and longer delays).

Multiplication and division can be accomplished in a number of ways by neurons, the fastest and most effective ways are axo-axonic and dendro-dendritic synapses (which are ubiquitous in the cerebral cortex). Division can be accomplished by "shunting" synapses, which are also ubiquitous. And note that our results don't have to be incredibly precise, all we have to do is align images. We want to find the kernel of our local maps, the point where the operation yields a 0 result, so it can be "give or take" a little bit. The precision in humans is quite remarkable given the architecture, and how that's accomplished is by phase coding, which is the same way bats and barn owls achieve millimeter resolution tracking prey, from neurons that only provide centimeter resolution. This is why we have orientation columns, to enhance the resolution of binocular disparity processing through phase coding.

If you like you can pull the numbers to verify that this mapping actually works. Retinal neurons in the fovea have a resolution (receptive field size) of a few minutes of arc, but by the time the information hits the thalamus it's been reduced to degrees. Cortex is in degrees too, not minutes. But the presence of orientation columns combined with phase coding restores minute resolution.

How the cortex actually works is fascinating. You have time-to-first-spike (TTFS) which carries one type of information, then you have bursting which carries a different type of information. All this occurs in cycles based on the alpha rhythm, and neighboring cycles can be synchronized or not depending on the type of processing that's occurring. Between TTFS and bursting there is a quiet period where calculations are being performed on the TTFS information using lateral inhibition, and a template is being generated for the arrival of the bursts. The burst information is then compared with the template and a difference vector is generated, which is then used in the learning process, for STP and STD. The longer term forms of learning (LTP and LTD) usually occur from bursting only, they don't depend on the template. A similar process occurs in the hippocampus (it's been very well studied), but the cerebral cortex has twice as many layers as the hippocampus and essentially performs two of these operations at once. One of them handles the real time aspects of the image and the other is more related to "recognition", which is to say top-down processing from the higher portions of the processing chain.

You will note in this architecture, that the segregation of first-derivative information is retained through the first four layers of visual processing. This is because the tangent information is used directly in calculations. In two dimensions the movement of a surface is parametrized by time, the location of a point in u,v coordinates becomes u(t) and v(t). However in three dimensions it's converted to a vector using a map (u,v) => (x,y,z), where the origin is behind the eyes at the focal depth. This is how we get from intrinsic to extrinsic geometry and back again, and how we translate between egocentric and allocentric reference frames.
 
Last edited:
So let's recap.

Why does the first derivative information get there first?

So the tangent vectors are available for the calculations that require them.

Why do we have orientation columns?

So sines and cosines can be quickly and easily extracted relative to the tangent vectors, by simple matrix projection.

Why do we have ocular dominance columns?

Because the view from each eye needs to be calculated separately, the sines and cosines are different in each eye.

Why do we have color blobs?

Because color is need to calculate shading and speculation relative to the surface boundaries (which means relative to the tangent vectors).

Why do we have binocular stripes in V2?

Because we need to merge the information from the two eyes both locally and globally.

Why is the color information kept separated in V2?

Because V4 handles it, texture can only be calculated once the surfaces are known.

Other than retinotopic matrices, what is the primary mechanism of information representation in the visual cortex?

Phase coding relative to the alpha rhythm, reset by eye blinks and eye movements.

What is the function of lateral inhibition in the cerebral cortex?

Among other things, to make phase coding more precise.

What is the first area in the brain to merge information from both halves of the visual field?

V2, because it calculates binocular disparity.

How is binocular disparity calculated?

By aligning the surface boundaries from each eye, which requires a color vector and an orientation tensor.

How are curved surfaces reconstructed in the brain?

By generating an orientation tensor at each point in visual space, represented as linear combinations of tangent vectors and co-vectors.

What is a co-vector and why is it needed,?

A co-vector is a surface of equal orientation. It's needed to reconstruct the coordinate maps of curved surfaces in 3 dimensions.

What is the brain area that determines the motion of surfaces?

V5/MTS, which preferentially feeds the "where" stream instead of the "what" stream.

What is the brain area that recognizes objects?

IT, the inferior temporal lobe.

What is the brain area that calculates egocentric and allocentric reference frames for navigating visual space?

The hippocampus, along with the entorhinal and retrosplenial cortex and Area 7 of the parietal lobe.

In what brain area is the where stream and the what stream first combined?

The hippocampus. The circuitry includes the entorhinal cortex.

What brain area determines the context for visual navigation?

The hippocampus. The circuitry includes the lateral frontal lobe. Context requires the extraction of related information from episodic memory.

So here is a basic but complete description of the human visual system, including its processing mechanisms. Human beings can process a visual scene within about 200 milliseconds (two alpha cycles). Humans can also memorize a face with one presentation. A photonic neural network can categorize an object in half a nanosecond, but it takes several thousand presentations to learn a face. Computer memory is also transient, whereas human memory is permanent.
 
By the way, four fixed points for the Mobius transformation are easy to define. They are:

the position of the two eyes
the binocular focal point - and -
the location of the object

This is why there is input from the eyes positioning system in the superior colliculus, to all points in the what stream, starting in the LGN, before visual information even hits the cortex.

There is another important nucleus in the thalamus, called the pulvinar, that has to do with attention. It handles not only visual information, but also auditory information and other modalities. However the majority of pulvinar projections target the visual cortex. Pulvinar axons terminate in layer 1 of V1, and layer 4 of V2/V3 and other visual cortex areas, including IT.

Pulvinar has two visual maps and two maps of the cerebral cortex. It matches visual content with the cortical processing area.

One of the important features of cerebral behavior is "synchrony". Generally speaking, synchronized neurons talk to each other because they're phase-locked, which is to say, they process each other's TTFS (time to first spike). Phase locking also means they burst at the same time. This mechanism is a way of achieving multiple convolutional networks in parallel. (Think of it much like a "thread" in a computer operating system, where threads can be assigned to available CPU's). Visual threads are likely controlled by the pulvinar.
 
So now we have a very easy way to understand the differential geometry.

Which, we need to understand the difference between the way computers do things, and the way the brain does things.

First of all, we need to establish the a vector is a geometric invariant. It looks the same in any coordinate system, it's just that the coefficients change depending on our choice of basis vectors.

For example - you're used to the Cartesian coordinate system, where the basis vectors are the same everywhere in space. Let's call the basis vectors E. In Cartesian coordinates then, E(x) is (1,0) and E(y) is (0,1), that kind of thing.

But in polar coordinates, the basis vectors change at every point in space. We can sée this from the transformation formula, x = r cos theta, and y = r sin theta, where the polar basis vectors are r and theta.

To determine the distance between two points on a curved surface, we need the metric tensor. Which is why we need differential geometry. In differential geometry, the basis vectors become the derivative operators, in the direction of the coordinates. This is why we use the tangent plane TpM at a point p, because in a small neighborhood around p the surface is "approximately flat", which means we can use calculus to obtain the tangent vectors in any given direction.

So in this case, if we have Cartesian coordinates and we want to translate them to polar coordinates, we use the Jacobian matrix (actually its transpose, but I'll just call it the Jacobian to keep it simple).
1740551449244.webp


And similarly, if we start with polar coordinates and want to convert to Cartesian, we can use the inverse Jacobian.

To get the metric tensor though, we need the dot product operation, and we can do that in one of two ways. If we know the lengths and angles, we can calculate the dot product from the norm and the cosine of the angle. Or, we can use the metric tensor to calculate the dot product using matrix multiplication, according to the formula g(v,w) where g is the metric tensor and v and w are any two vectors. This formula can also be rewritten as p = J'cJ, where p are polar coordinates and c are Cartesian coordinates, and J is the Jacobian matrix that converts between them.

So for example - in the retinotopic mapping from the thalamus to the first area of the visual cortex V1, we have a complex log spatial mapping that converts polar coordinates to Cartesian coordinates, as mentioned earlier. This means we lose our polar coordinates, and if we need them again we have to recalculate them. It also means that feedback from the cortex to the thalamus has to perform the inverse mapping if we want it to remain retinotopic.

It is certainly easier to calculate dot products in Cartesian coordinates, where the metric tensor is just the identity matrix everywhere in space - which is why the brain does it that way. The purpose of the dot products is to compute the projections of surface vectors onto the coordinate system. We need this to calculate binocular disparity, and for subsequent calculation of surface boundaries in 3 dimensions.

But here's the twist: the complex log mapping doesn't change the motion information. It remains encoded in polar coordinates from the retina. For that reason, the orientation columns are not simple alignment vectors, they also process spatial frequency which is the co-vectors. From a tensor algebra point of view, if you have the vectors and co-vectors you can generate any tensor, which includes any linear map.

So you can see why things are the way they are: if we want to calculate the Jacobian and inverse Jacobian (which we need because they're our forward and backward transforms for coordinate systems), we need the projections of arbitrary surface vectors which means the sines and cosines. To translate these back to polar coordinates (to extract and map the motion information) we need the metric tensor. The reason we can't just scrape these from the retina is because we need the integrated "cyclopean" view in 3 dimensions. To get this "directly" in polar coordinates would be computationally difficult (and time consuming). So the brain does the clever thing: it first maps to Cartesian coordinates (using a hardware mapping which is computationally "free"), where calculations are much easier because the basis vectors are spatially consistent. Then it extracts the vectors and co-vectors using edge detectors and spatial frequency detectors. Then it uses those to build the surfaces by aligning the input from the two eyes (which can now be mapped by co-moving edges and spatial frequencies). Finally it assigns motion to the surfaces, the information for which has been multiplexed into the communication channels all the way from the retina, through every stage of processing.

This whole chain of computation is very quick, because it only uses matrix math. Everything that's computationally expensive is done in hardware, including change of coordinates, determination of angles and distances, and extraction of vectors and co-vectors. All of these things are done in a single alpha cycle, by V1 and V2 using TTFS (discussed earlier). A second alpha cycle is then required to calculate surfaces ("objects") from binocular disparity. A third alpha cycle is only needed for object recognition - so it makes perfect sense that a P300 should occur at the third alpha cycle (and not before) when the object information is nonsensical or surprising.

What is missing from this description is the role of synchronization - or more specifically its inverse, desynchronization. The short story is we just don't know. So far it looks like it has something to do with attention, and something to do with memory. We do know that visual hot spots (important stimuli that require attention) can drive the cortex into criticality (extreme desynchronization). And, the amount and precision of information being processed in such a state is 100x greater than normal. No one knows what this means, yet.

But the rest of it is becoming cut and dried. Research on visual processing by neurons started in 1959, so it's taken 65 years to get this far. Thousands of rats, cats, and monkeys had to give their lives to make it happen. Now we can do it with machines, on sub-nanosecond time scales. The next frontier is photonic computing using micro-ring resonators which requires practically no energy, and when combined with quantum memristors the memory can be made permanent at zero processing cost.
 

New Topics

Back
Top Bottom