At this point some of the support mechanisms start to make sense, yes?
Think information theory, S = - p log p
Before you can estimate information, you need to know the probability. Which means you need the distribution.
To get the probability, you have to be able to separate the event from the background, which usually involves labeling (assigning a name to it) and estimating the distribution, from sparse data using Bayesian inference.
The first is object individuation and memorization, which is triggered by the late component of a P300 event related potential, which comes out of the dorsal attention stream and signals "I don't have this information".
This triggers a memory process, that's
very different from scene mapping. Here you're not putting together episodes, you're doing the exact opposite. You're
removing time related information, so you can store the invariances. Scene mapping results in memory traces in the dorsolateral prefrontal cortex, whereas object individuation results in memory traces in the inferior temporal cortex.
To estimate the distribution, you can do one of two things:
A. optimize thermodynamically using an energy function
B. draw from a library of distributions, weighting each one to approximate your data
Method A is the Hopfield approach (Ising model, statistical mechanical), method B is a mixture network (information geometry).
The resulting distribution then becomes your initial guess for future Bayesian inferences.
Attached to each object is a
value, that indicates what it's good for (what its capabilities are, so to speak). This is yet a third kind of memory, it happens in the medial prefrontal cortex and the contextual input for scene mapping arrives via the nucleus accumbens. The three kinds of memory come together in the so-called VLAM, the vision-language-action model which is an LLM with vision attached to it, used in robots with motor systems.
en.wikipedia.org
VLAM's are extensions of the VLM's that power AI systems like Canva, you give it a command in English and it generates an image for you. CoPilot, is another example.
The "actions" are intended for warehouse workers, but you can just as easily have a kung fu master. In my case they'll be simple eye movements. The input will be a visual scene with moving objects and a command saying which object to target. The output will be a sequence of eye movements designed to maximize the information gained from the target.