To begin this next section, we'll ask how the brain transforms scene maps into long term memory. By now we should have a pretty good understanding of how scene maps are formed. But short term memory is very specific, damage to a single brain area (the hippocampus) destroys it completely. Whereas, there is no single area for long term memory. Long term memory is all over the place, it's kind of holographic. If you chop out pieces of the brain you don't lose your memories, they just get fuzzy, with a little less detail.
The biological context around this is essential, and by reading through this link you can quickly pick up a lot of the key concepts. This paper talks about the inferior temporal cortex, which handles object recognition.
Inferior temporal cortex (IT) is a key part of the ventral visual pathway implicated in object, face, and scene perception. But how does IT work? Here, I describe an organizational scheme that marries form and function and provides a framework for ...
pmc.ncbi.nlm.nih.gov
One of the principles in play is that brain areas often arise by duplication of other brain areas. The discussion about the retinotopic maps in monkey brain is excellent. There's like 20 of them all together, and if you look at them the right way they look like a big convolutional neural network from the machine learning textbooks. (I can find a complete map of all 20 areas if you wish, monkey brains are basically very similar to human brains so you can get a feel for what ours looks like).
Another principle is the interplay between top-down and bottom-up information flow, which is what we've been talking about in terms of "predictive coding". The biologists started out thinking the brain was programmed genetically, but now we know it's all about the data. Your brain is different from mine because you've had different experiences, it's that simple.
The
infrastructure is the part that's programmed genetically. When we talk about memory consolidation, that is a top down process that depends on special biochemistry (the neuroscientists call it long term plasticity), involving a plethora of calcium ion channels with differing time courses, some voltage dependent and some not. But "attention" is
before all that, because nothing irrelevant makes it into long term memory.
So far, the most popular theory of consolidation is "hippocampal replay" during sleep and quiet wakefulness. But the evidence is scant - replay definitely occurs but its relationship to consolidation is a swag. What helps, in terms of understanding the constraints, is to consider what happens "after" the scene map. Once you have the scene, what do you do? Well, you navigate... and, you manipulate. Let's say you take an action on an object to bring about a result - say, the monkey pushes a button to get a reward. So that's our action, we're going to push a button in our scene somewhere. The button is the object that's been identified, but can we apply the same action to some other object? Sure! We can push the door knob, we can push the cage, we can push anything. But the button is the object associated with the reward. So we apply "this" action to "that" object. And of course, the object could be anywhere in the visual field (or not in the visual field at all), and it's the object we care about and we only need its specific location when we push it.
So the semantic aspects of object recognition guide attention. If we can't find the object in the visual field, maybe we go looking for it. Maybe we look around the room. Maybe we get up and look in the kitchen or the bedroom. Maybe we even go foraging, go to the store or something. Interestingly enough, the same strategies that govern search in the visual field, also govern search in the scene. (Remember, the scene is the entire 3d map of our surroundings with ourselves in the middle of it, whereas the visual field is just the snapshot we happen to be looking at right now given our location and body/head orientation and eye position).
In addition to top down goal driven attention, there is also "exploratory attention". That's like when you scan an image with no particular goal. Maybe you go to the museum and you see a piece of art and you say "hm that's interesting" without knowing who created it so you don't know the idiosyncracies to look for. You just kinda scan it, and maybe you notice the brush strokes (especially if you're an artist) or maybe you don't. If someone asks you half an hour later to describe it, you can do pretty good. But if you wait two days, a lot of the detail is lost - long term memory only stores the essentials.