why information theory is wrong

scruffy

Diamond Member
Mar 9, 2022
18,488
14,998
2,288
Information theory is wrong.

Its assumptions are erroneous.

Why?

Because it confuses information, with knowledge.

Information is not knowledge. They are two very different things.

Sending a bit through a noisy channel is worthless, unless both parties know in advance what the bit means.

According to information theory, if 1000 bits are transmitted and the value of each of these bits is know to both the sender and receiver ahead of time with certainty, no information has been transmitted.

A claim which was made by Shannon and is regurgitated here:


However this is obviously false! Information HAS been transmitted. That much is self evident. What they're really trying to say is that no KNOWLEDGE has been added to the receiver because of the transmission.

If someone (say, an editor) knows the contents of a newspaper ahead of publishing time, is any information transmitted at publication? Of course it is!

If a grifter knows the outcome of a horse race before it happens, is any information transmitted when the result is posted? Of course it is!

Information theory is wrong. It's incomplete. And it has an erroneous description of what constitutes information.
 
Another example of how WRONG the fundamental assumption of information theory is:

If you have two test technicians as sender and receiver, and they send a test packet back and forth (which they both know the structure and contents of), is information being transmitted?

Hell yes it is!

Information theory is wrong. Its fundamental assumption is incorrect, BECAUSE it confuses information with knowledge.
 
Another example of how WRONG the fundamental assumption of information theory is:

If you have two test technicians as sender and receiver, and they send a test packet back and forth (which they both know the structure and contents of), is information being transmitted?

Hell yes it is!

Information theory is wrong. Its fundamental assumption is incorrect, BECAUSE it confuses information with knowledge.
As far as I can see, the wiki article doesn't get into "right" or "wrong". The theory is offered as either being useful or not useful in understanding how we process information.
 
As far as I can see, the wiki article doesn't get into "right" or "wrong". The theory is offered as either being useful or not useful in understanding how we process information.
Useful?

In the same sense that Newtonian physics is useful in a relativistic world.

Information theory specifically doesn't try to address how "we" process information. It equates information with knowledge. That is not only not useful, it's plain old wrong.

There is an encoding step preceding the transmission, during which knowledge is encoded into structured information.

Then there is the transmission itself.

Finally, there is a decoding step after the transmission, during which information is converted back into knowledge.

If you want to talk about uncertainty and probabilities, these three steps are not necessarily independent. An encoding error can be propagated and it can also be corrected, either randomly or by design.
 
Mathematically -

Information theory is based on the probability of a message within the space of all possible messages.

Only, they DON'T consider the space of all possible messages. They only consider a small subset of it.

Proof: their definition of information is log S^n, where S is the number of symbols, n is the number of messages, and the log function gives you an order of magnitude kind of like the Richter scale.

But this definition does not take into account the many ways the symbols can be re-used and re-defined by "modulation".

The method "would" work if they took into account "all possible paths" in the sense of Feynman, but they don't do that.

For instance information at the quantum level can be overloaded by a factor of sqrt(2), an observation which immediately contradicts the information theoretic definition.

And, any stream of symbols can be further modulated by treating it as a carrier. Which may or may not introduce additional symbols.

The bottom line is Shannon's communication scenario depends on HIDDEN SHARED KNOWLEDGE between the sender and receiver. The hidden shared knowledge is in the form of the probability distribution, and the form ("meaning") of the symbols.

For example - let's say, we have a voltage level that represents "1". Call it 5 volts, like in digital logic chips. And, because Shannon is a discrete theory, let's say we have a clock that defines the bit stream.

In this situation there are two ways to represent "0". One way, is to assign a voltage level to it. The other way, is to have it simply be "missing" (so "nothing there" indicates a 0). If the chosen voltage level in the first option is 0, these options become perceptually identical. But they're not necessarily identical in terms of the amount of information being transmitted. It depends on the type of modulation the two parties have AGREED to, if they're using time division multiplexing "missing" means something different than it does in other modulation types.

You can easily see how information theory fails when you consider the situation of a sender shoving bits into the pipe as fast as the channel will allow, and the receiver reading them out at a tenth the speed.
 
Useful?

In the same sense that Newtonian physics is useful in a relativistic world.
Actually, Newtonian physics is eminently useful in this age and relativistic physics is a refinement model which is significant only when motion occurs at relativistic speeds. University classes still teach Newtonian physics and the students end up using it most of the time.

Relativistic physics itself is a special case model which is only useful in it's place. For example at the very small scale (the quantum level) concepts like locality and time cease to be useful and other approaches are needed.
Information theory specifically doesn't try to address how "we" process information. It equates information with knowledge. That is not only not useful, it's plain old wrong.
Not sure where u got that, the link you shared (tnx btw) had the word "knowledge" appear only twice and neither time was it defined as being equal to "information".
There is an encoding step preceding the transmission, during which knowledge is encoded into structured information.

Then there is the transmission itself.

Finally, there is a decoding step after the transmission, during which information is converted back into knowledge.

If you want to talk about uncertainty and probabilities, these three steps are not necessarily independent. An encoding error can be propagated and it can also be corrected, either randomly or by design.
Interesting, but it'll take a while for me to get comfortable w/ all you shared. Question, given that you show a profound grasp of what information theory has to offer, why have you studied it in the first place?
 
Not sure where u got that, the link you shared (tnx btw) had the word "knowledge" appear only twice and neither time was it defined as being equal to "information".

Yes. They mix up the two concepts, use the interchangeably. Which is wrong.

Interesting, but it'll take a while for me to get comfortable w/ all you shared. Question, given that you show a profound grasp of what information theory has to offer, why have you studied it in the first place?

Because I need an accurate definition of entropy.

Which is absolutely "not" any kind of thermodynamic relationship to the Gibbs distribution.

Are you familiar with Renyi entropy?

It has to do with SCALE. In a way it's a restatement of fractional geometry, the measured length depends on the length of the ruler.

The basic principle in play is what they call in topology, "coverage". It's loosely the same concept as Feynman's "all possible paths". For example, as you decrease scale you may reach a point where things start bumping into each other, which changes the system behavior. And, you can try to describe that mechanistically, but it becomes computationally expensive and so sometimes it makes sense to use a probabilistic approach, as the scale changes -

However the two approaches should yield the same answer!

In machine learning, and in neural networks in the brain, there is "signaling" in the form of unreliable transmission through unreliable channels, and this is the part information theory tries to address. But there is also "storage" which occurs in the form of associative memory which is mainly correlations between content, and this is where 'information theory" fails miserably, because the entropy calculation is way off.

If you really want to understand this you could look into "information geometry". Which provides a much more robust and intuitively more accurate definition of entropy. The idea is, you can map the entire family of Gaussian distributions to the X-Y axes using just mean and variance, so as the distributions change based on learning you end up with "paths" in the parametrized probability space.

Where it gets complicated is, these paths are part of a system dynamic, which looks like a bunch of coupled oscillators. (See "Kuramoto model"). So the "paths" ride on top of oscillatory behavior, which is meaningful because of phase coding, which is yet another modulation type unaccounted for by information theory.

The problem with IT is the definition of entropy is incorrect, and that traces back to the fact that in their calculation of the total number of system states they MISS some. S^n is not correct. Already the quantum results prove it, you can get sqrt(2) * S^n with no effort at all, and 2*sqrt(2) * S^n with a little bit of effort.
 
Here's another way of looking at it.

In a discrete (finite) system, the entropy is supposed to encode the degree of departure from a pure state, in other words the degree of mixing.

This obviously applies to quantum physics, but it also applies to neural networks in the brain. There, one can formulate a gauge theory much like in physics, and describe the expectation of possible futures much like in physics.

In fact, there is a description of the "collapse" of possible futures around the point in time called "now", that parallels the description of quantum decoherence.

The short story is, it's the phases that give you the extra degrees of freedom, and to the extent that you can manipulate them and squeeze them and so on, you can reveal them and make use of them. Mathematically, it means you have to treat each symbol as a complex number, and so for example, you have the Von Neumann entropy which is the trace of a density matrix, and this description ignores the off-diagonal terms which are meaningful in any kind of non-commutative situation (of which there are many - anything where the order or sequence of things matters).

There are many interesting interpretations of the relationship between probability and complex numbers. Here is one that might be useful:

 
Here is yet another crucial concept: negative probability.

Negative probability is a very simple relaxation of Kolmogorov's first axiom: probabilities no longer have to be positive definite.

Negative probability, is what gives rise to complex probability.

As follows:

We begin with the concept of a "half coin", which is basically the "square root" of an ordinary fair coin. If you flip the half coin twice in a row, you get the same set of outcomes as a full fair coin.


Well, i is the sqrt(-1), and that's where the relationship begins.

Negative probabilities are not new, they were discussed by the physicist Wigner in 1932 in the context of phase space quantization, and by other physicists like Dirac and Feynman.


Here's the important part:

There is a deep relationship between fractional probabilities and fractional derivatives. Consider the classical differentiation operator D, and then consider its "square root" D^(1/2), which when applied twice in a row yields the same result as D.


This link has an excellent summary:


But it's even more intriguing than that.

Half-coins have an infinite number of sides. Every other one of which has a negative probability. The density function ends up looking like a comb, which is "nowhere dense" in the same sense as a Cantor Dust.
 

Forum List

Back
Top