No one has given a good answer to the above two questions for thousands of years. Until 1948, Shannon put forward the concept of "information entropy" in his famous paper "Principles of Communication and Mathematics", which solved the problem of information measurement and quantified the function of information.
Without information, any formula and number game can't rule out uncertainty. This simple conclusion is very important. Almost all applications of natural language processing, information and signal processing are a process of eliminating uncertainty.
What is information entropy?
The information quantity of a piece of information is directly related to its uncertainty.
For example, we need to know a lot of information if we want to know a very, very uncertain thing. On the contrary, if you already know more about something, you don't need much information to make it clear.
So from this perspective, the amount of information is equal to the amount of uncertainty.
Probability and information entropy;
Example 1: For example, someone told you today that there will be no smog in Beijing in the winter of 20 19. We are not sure about this, because in the past five years, there have been very few days in Beijing without smog in winter. At this time, in order to understand this matter, you need to consult meteorological data, expert forecasts and so on. This is a process of eliminating uncertainty with external information. The more uncertain this matter is, the more external information you need. At this time, the information entropy is very large.
Ex. 2: On the contrary, for example, someone told you today that in the winter of 20 19, Beijing will continue to have foggy weather. According to previous cognition, this is an event with minimal uncertainty, that is, almost no external information is needed. The information entropy at this time is very small.
If whether there is smog in Beijing in the winter of 20 19 is regarded as a random variable, then the above example shows that the measurement of information entropy depends on the probability distribution.
Definition of information entropy:
(Formula-1)
Where is the probability, and the unit is.
Why?
Suppose there are two independent random variables, and the probability of their simultaneous occurrence is also joint probability (Formula -2).
If we artificially want the measurement of information to satisfy the addition and subtraction operation. In other words, the sum of information A and information B satisfies: A+B.
(Formula 3)
It's easy to think of logarithmic operation on the formula-1!
? -& gt; ? (Equation 4)
Definition, slightly sort out formula -4 to get formula -3.
Geometric understanding of information entropy;
As can be seen from the figure 1:
The closer the probability is to 0 (indicating that the event is very likely not to happen) and the closer the probability is to 1 (indicating that the event is very likely to happen), the smaller the value of information entropy is.
Example: the probability of rain today is p.
The first case: when P=0, it means that the possibility of rain today is 0, and the event is a definite event. At this time, the information entropy is:
In the second case, when P= 1, it means that the possibility of rain today is 1, and this event is also a definite event. At this time, the information entropy is:
The third case: 0
Definition:
Suppose sum is two random variables. What we want to know is that if we know the random distribution, then we know the entropy:
Now let's assume that we still know part of the situation, including the probability of appearing together with and the probability distribution under different preconditions. Conditional entropy is defined as:
? (Formula 5)
Case 1: When independent, see Figure 2:
The second case: Dependency is shown in Figure 3:
As can be seen from Figure 2 and Figure 3:
? (Equation 6)
In other words, with more information, the uncertainty about it is reduced!
What is mutual information?
Conditional entropy tells us that when the obtained information is related to the thing to be studied, it can help us eliminate uncertainty. Of course, the word "relationship" is too vague. In order to quantify this "relationship", Shannon put forward the concept of "mutual information" in information theory.
Definition:
? (Formula 7)
In fact, mutual information is the difference between the entropy and conditional entropy of random events:
? (Equation 8)
As can be seen from formula -8, the so-called mutual information is the information provided to eliminate the uncertainty of the other under the premise of knowing one.
What is relative entropy?
Relative entropy, like mutual information, is used to measure correlation. Different from mutual information, relative entropy is used to measure the similarity of two integer-valued functions.
Definition:
? (Formula 9)
-For two identical functions, the relative entropy is 0.
-The greater the relative entropy, the greater the differential of the function.
-For probability distribution, if all values are greater than 0, relative entropy can measure the difference between two random distributions.
It should be pointed out that the relative entropy is asymmetric:
? (Formula-10)
In order to eliminate asymmetry, Zhan Sen and Shannon proposed a new calculation method:
? (Formula-1 1)