The Beauty of Mathematics Series 4: How to Measure Information?
Information is a very abstract concept. We often say that there is a large amount of information or a small amount of information, but it is difficult to say clearly how much information there is. For example, how much information does a 500,000-word Chinese document have? Until 1948, Shannon put forward the concept of "information entropy" (shāng), which solved the problem of quantitative measurement of information. The information quantity of a piece of information is directly related to its uncertainty. For example, if we want to find something that is very, very uncertain, or something that we know nothing about, we need to know a lot of information. On the contrary, if we already know more about something, we don't need much information to make it clear. Therefore, from this perspective, we can think that the measurement of information is equal to uncertainty. So how do we quantify the amount of information? Let's look at an example. The World Cup will be held soon. Everyone cares who will be the champion. If you miss the World Cup, ask an audience who knows the result after the game, "Which team is the champion"? He doesn't want to tell me directly, but let me guess, and every time I get it right, he has to charge me one yuan to tell me, right? How much do I have to pay him to know who is the champion? I can number the teams from 1 to 32 and ask, "Is the champion team in 1- 16?" If he told me that I was right, I would ask, "Is the champion in 1-8?" If he tells me that I guess wrong, I naturally know that the champion team is at 9- 16. It only takes me five times to know which team is the champion. So the information about who won the World Cup is only worth five yuan. Of course, Shannon does not use money, but uses the concept of "bit" to measure the amount of information. One bit is a binary number, and one byte in a computer is eight bits. In the above example, the information content of this message is five bits. If sixty-four teams enter the finals one day, the message "Who won the World Cup" will be six, because we have to guess again. Readers may have found that the number of bits of information is related to the logarithmic function log of all possible situations. (log32=5, log64=6. Some readers may find that we may actually guess who is the champion without guessing five times, because teams like Brazil, Germany and Italy are much more likely to win the championship than teams like Japan, the United States and South Korea. So when we first guess, we don't need to divide the 32 teams into two groups. We can divide the most likely teams into one group and the other teams into another group. Then we guess whether the champion team is among those popular teams. We repeat this process and group the remaining candidate teams according to the probability of winning the championship until we find the champion team. In this way, we may guess the result three or four times. Therefore, when the possibility (probability) of each team winning the championship is not equal, the information of "Who won the World Cup" is less than five. Shannon pointed out that its accurate information should be =-(p 1 * logp1+p2 * logp2+...+p32 * logp32), where P1,P2, ... and P32 are the probabilities of winning the championship respectively. Shannon called it "entropy", which is generally represented by the symbol H, and the unit is bits. Interested readers can calculate that when 32 teams win the championship with the same probability, the corresponding information entropy is equal to five. Readers with a mathematical foundation can also prove that the value of the above formula cannot be greater than five. For any random variable X (such as the winning team), its entropy is defined as: the greater the uncertainty of the variable, the greater the entropy, and the greater the amount of information needed. With the concept of "entropy", we can answer the question raised at the beginning of this paper, that is, how much information is there on average in a 500 thousand-word Chinese document. We know that there are about 7000 commonly used Chinese characters (national first-class and second-class standards). If every word has an equal probability, then we need about 13 bits (that is, 13 bits binary number) to represent a Chinese character. But the use of Chinese characters is unbalanced. In fact, the first 10% Chinese characters account for more than 95% of the text. Therefore, even if only the independent probability of each Chinese character is considered, regardless of the relevance of the context, the information entropy of each Chinese character is only about 8-9 bits. If contextual relevance is considered again, the information entropy of each Chinese character is only about 5 bits. Therefore, a 500,000-word Chinese book contains about 2.5 million bits of information. If compressed by a good algorithm, the whole book can be saved as a 320KB file. If we directly store this book in two-byte national standard code, it will take about 1MB, which is three times as much as the compressed file. The difference between these two quantities is called "redundancy" in information theory. It should be pointed out that the 2.5 million bits we are talking about here is an average, and the amount of information contained in a book of the same length can be much worse. If a book repeats a lot, it will have less information and greater redundancy. The redundancy of different languages varies greatly, while the redundancy of Chinese is relatively small in all languages. This is consistent with people's general belief that "Chinese is the simplest language". In the next episode, we will introduce the application of information entropy in information processing and two related concepts, mutual information and relative entropy. Readers interested in Chinese information entropy can read an article "Language Information Entropy and Complexity of Language Model" written by Professor Wang and me in Electronic Newsletter.