Believe it or not, we actually know very little about probability.
Most of what we know is based on "frequentist" assumptions, where everything is objective, and the empirical has to equal the objective otherwise the assumptions break down.
For instance the probability "distribution" never varies, it's cast in stone. This is why the "all possible paths" construction works, and why the Fisher information metric gives us the amount of information conveyed by a new observation.
Information theory is this way too, the probability has to be a known function for it to work.
But quite obviously, brains don't work this way, the probabilities are almost never known in advance. Brains use an entirely different method called "Bayesian", which is inherently subjective. You're allowed to make guesses about the probabilities, and then adjust those guesses based on observations.
So imagine if you will, a Bayesian version of information theory. Let's say you have a bit transmitted on a wire. Information theory says the bit can be either a 0 or a 1. So you can calculate the uncertainty as - p log p, which means the bit tells you nothing if it's always a 0 or a 1. The only time you get information is if the probability of the bit is "either" a 0 or a 1. In that case the arrival of the bit (the "outcome") is exactly like a coin flip, and information theory says you get 1 bit of information, because you now "know" the outcome.
The coin flip is called a "Bernoulli trial", and depending on the probability of success this is how much information you get:
For this to work, the relationship between the probability and the information must be known in advance, in this case it's a binomial distribution. If you have repeated coin flips you can calculate probabilities with "n choose k" and so on.
But let's say we didn't know the distribution in advance - which is probably the case in most of nonlinear thermodynamics (eg quantum behavior in free space and time). What do your sensibilities tell you about how much information you'd get from the sudden appearance of a bit?
In a way, this situation can be reformulated in terms of an unknown probability
distribution. Maybe instead of a binomial Bernoulli distribution you're looking at an exponential Poisson distribution where bits arrive randomly at a rate r. If you know in advance that you're dealing with an exponential distribution you can calculate the information, but what if you don't know that?
Clearly, the bit tells you "something". You just don't know what it is, because you have no context. The Bayesian method lets you define context. The more bits you see, the more context you have. This is formalized by the "prior" and the "posterior" in Bayesian statistics.
The Central Limit Theorem says that eventually, you will have "enough" context.
To figure out what the bits "mean", you have to relate one but to all the other bits. So this becomes a topological problem in set theory. The only requirement is that the bits be "separable", in other words distinguishable. The "meaning" relates to how many ways you can partition N bits in k dimensions, where k is a "guess". So what your brain does is it tries to optimize k. It tries to find the "maximum likelihood" for k, where k grows with N. This is fundamentally a problem in algebraic topology.
This is why "topological" quantum computing is so important. If your bit happens to be a qubit, it can take on any value between 0 and 1 with a continuous uniform distribution. A continuous uniform distribution is a topological "interval" in one dimension. You can no longer think in terms of discrete probabilities, because the number of points (possible outcomes) in the interval is uncountably infinite. The cardinality of the interval is the same as the cardinality of R. (Georg Cantor proved this, and many other similar topological conceptualizations).
So information theory won't work in this context because the probability of any given point outcome is ZERO. You have to go to the von Neumann formulation which defines what is essentially a "margin of confidence" - it is what the Bayesians call a "credible" interval. Topologically this is equivalent to partitioning the interval. For example you can say "I am 95% sure that the outcome will be between .1 and .9", or some similar statement of confidence. This is equivalent to partitioning the interval into three subsets, where the bulk of the interval is in the middle subset. Therefore information no longer has a defined value, it only has a confidence level - because the cardinality of each subset is the same as the cardinality of the original (total) set.
The astute observer will understand how this relates to the OP.