MPEG can complete the compression of video and audio. Of course, only video compression is mentioned here.
In fact, what compression should do is to remove the triple redundancy of the message source. Include spatial redundancy, temporal (dynamic) redundancy and structural (static) redundancy.
The amplitudes of adjacent pixels in the same frame of source image are similar, that is, the amplitudes of adjacent pixels in the same row are similar, and the amplitudes of pixels in the same position between adjacent rows are similar. This is called spatial redundancy of images;
The amplitude of pixels in the same position in two adjacent frames of the source image is similar, which reflects the temporal (dynamic) redundancy of the source image.
The number of bits used in each pixel of the source image represents the bit structure, and the use of more bits is redundant, which reflects the static (structural) redundancy.
How does MPEG eliminate these redundancies? It mainly starts from two aspects:
1. Use the statistical characteristics of image signals for compression.
That is:
Temporal redundancy is removed by motion compensation (MC).
Discrete Cosine Transform (DCT) and Run-length Coding (RLC) are used to eliminate spatial redundancy.
Variable length coding (VLC) is used to eliminate static redundancy.
I'll talk about the concrete realization of these three calculations later. Now you just need to understand that they are not too complicated, at least not as daunting as their names.
2. Using human visual physiological characteristics to design compression.
The sensitivity of human eyes to different frequency components and different motion degrees of objects is different, which is determined by the visual physiological characteristics of human eyes. For example, the human eye contains 65.438+0.8 billion columnar cells sensitive to brightness and 0.08 billion spinal cells sensitive to color. Because the number of columnar cells is much larger than that of spinal cells, the sensitivity of eyes to brightness is greater than that to color. Therefore, the image can be controlled to conform to the visual characteristics of human eyes, so as to achieve the purpose of compressing image data. For example, human eyes are more sensitive to low-frequency signals than high-frequency signals, so high-frequency signals can be represented by fewer bits; The human eye is more sensitive to static objects than dynamic objects, which can reduce the number of bits representing dynamic objects; The sensitivity of human eyes to luminance signals is greater than that to chrominance signals, and the number of bits representing chrominance signals can be reduced in line and frame directions. The human eye is more sensitive to the central information of the image than the edge information of the image, so it can allocate fewer bits to the edge information; The human eye is sensitive to the horizontal and vertical information of the image, which can reduce the number of bits representing the high-frequency components of tilt information. In practical work, eyes can be treated separately because of their different sensitivities to brightness and chroma. (This passage is quoted from the 2003 textbook of TV Engineering of Beijing Broadcasting Institute)
So we change the unit component RGB into YUV (or YCrCb) global component, and emphasize the brightness information when encoding, but we can remove some chroma information, such as changing 4:4:4 to 4:2:2, which means changing the bit structure of the video. What is removed is the so-called static redundancy.
Use the method mentioned in the teaching section to remove redundant information of the structure (RGB->; YUV), which can achieve moderate compression. Removing structural redundancy has no effect on image quality, so it can be called "lossless compression". However, the compression ratio of lossless compression is not high and the compression ability is limited. In order to improve the compression ratio, MPEG standard adopts the technique of "lossy compression" which is harmful to the image quality, that is, removing the above-mentioned redundancy in time and space. These are all costs-but this account is very cost-effective.
Let's talk about the above algorithm in detail ~ You may need a little foundation of discrete mathematics/advanced mathematics to understand it better, but it doesn't matter if you don't understand it well, you just need to understand the role played by these processes ~
First, motion compensation prediction. This seems to be the most familiar part for ccf members here. I should describe it more carefully, because many people know it.
What is motion compensation? The motion compensation process is to move the corresponding micro-block of the previous image frame according to the obtained motion vector. In order to compress the temporal redundancy of video signal, MPEG adopts motion compensation prediction.
Motion-compensated prediction assumes that the current image can be predicted locally by translating the image with a certain advance time (pre). Local here means that the amplitude and direction of displacement can be different everywhere in the figure. The result of motion estimation is used for motion compensation in order to reduce the prediction error as much as possible. Motion estimation includes a set of techniques to extract motion information from video sequences, and the characteristics of this technique and the processed image sequences determine the performance of motion compensation.
The so-called prediction is actually to deduce the predicted value of the pixel considered in the current (n) frame from the previous (n- 1) frame image, and then transmit the difference between the actual pixel value and its predicted value of the n frame through motion vector coding. For example, let the macroblock (MB) be a rectangular block of M×N, and compare the macroblock of (n- 1) frame with the macroblock of n frames. This is actually a motion compensation process of macroblock matching, that is, comparing the macroblocks with 16× 16 pixels in the n frame with all the macroblocks with 16× 16 pixels in the n- 1 frame defining the search area (SR). This process attempts to determine the position of motion in MB to N frames of n- 1 frame. If the image brightness signal of n- 1 frame is f [n-1 (i, j)], then the image brightness signal of n frame is f [n (i, j)], where (i, j) is any position of M×N macroblocks in n frame, and one M×N macroblock in n frame. A certain macroblock can always be searched, so as to minimize the absolute value of the difference between the macroblock in n frames and the macroblock to be matched, obtain the motion data of the motion vector, and obtain the corresponding prediction value of n frames under the control of n- 1 frame and the motion data. This is done until all pixels at any position (i, j) of an M×N macroblock of n frames are predicted by pixels of n- 1 frame. As we all know, not only two adjacent frames like n and n- 1 can be MCP, but actually MPEG- 1 and MPEG-2 can be MCP based on one of the previous frames.
In order to improve the prediction effect, field prediction can be used. This is clearly stated in the Bible of Silky Sama.
It should be noted that MPEG defines image prediction based on frame, field and double field, and also defines motion compensation of 16×8.
For progressive scanning, frame-based image prediction can be used; Interlaced scanning mode and field-based image prediction can also be used. Therefore, MPEG-2 encoder should first judge whether each image is compressed in frame mode or field mode. In interlaced scanning mode: in the scene with little movement, frame-based image prediction is adopted, because there is almost no displacement between two adjacent lines of the frame-based image, and the correlation between adjacent lines in the frame is stronger than that in the field, and the spatial redundancy removed from the whole frame is much more than that removed from a single field; In the scene of violent movement, field-based image prediction is adopted. Because there is a delay time of 1 field between two adjacent lines based on a frame, the pixel displacement between adjacent lines is large, and the correlation between two adjacent lines in a frame will be greatly reduced. The correlation between two adjacent lines in a field-based image is stronger than that between adjacent lines in a frame. In 1 frame, there are many high-frequency components of inter-field motion (silky emphasis), and more high-frequency components are removed from the field than from the whole frame. As can be seen from the above, the key to choose between frame-based image prediction and field-based image prediction is line correlation. Therefore, before DCT, the choice of frame DCT coding or field DCT coding should be made, and the correlation coefficients of adjacent lines in the frame and adjacent lines in the field should be calculated by the difference between the original image or the brightness 16× 16 after motion compensation. If the correlation coefficient between adjacent lines in a frame is greater than that between adjacent lines in a field, frame DCT coding is selected, otherwise field DCT coding is selected.
MPEG adopts the compression algorithm of DCT- DCT proposed by Ahmed (a great mathematician) and others in 1970s to reduce the spatial redundancy of video signals.
DCT transforms motion compensation errors or original image information blocks into coefficient sets representing different frequency components, which has two advantages: first, signals tend to concentrate most of their energy in a small range of 1 in frequency domain, so only a few bits are needed to describe unimportant components; Secondly, frequency domain decomposition maps the processing process of human visual system, and allows the subsequent quantization process to meet its sensitivity requirements.
This point is described in detail in the tutorial I have at hand, and I will quote it directly:
The spectral line of the video signal is in the range of 0-6MHz. Most of the video images of 1 are low spectral lines, and only the video signals at the edge of the image with low image area ratio contain high spectral lines. Therefore, in the digital processing of video signals, the number of bits can be allocated according to the spectral factors: more bits can be allocated to low-frequency regions with large information and fewer bits can be allocated to high-frequency regions with small information, without causing perceptible damage to the image quality, thus achieving the purpose of rate compression. However, all these can only be effectively coded when the entropy value is low. Whether a string of data can be encoded effectively depends on the probability of each data. The large difference in the probability of each data occurrence indicates that the entropy value is low, and the data string can be effectively encoded. On the other hand, if the probability difference is small and the entropy value is high, effective coding cannot be performed. The digitization of video signal is converted from video level by A/D converter at a specified sampling frequency, and the amplitude of video signal of each pixel changes periodically with the time of each layer. The sum of the average information of each pixel is the total average information, which is the entropy value. Because the probability of each video level is almost equal, the entropy value of the video signal is very high. Entropy is a parameter that defines the rate compression ratio. The compression ratio of video image depends on the entropy value of video signal. In most cases, video signals have high entropy values. In order to encode efficiently, it is necessary to change the high entropy value to the low entropy value. How to become a low entropy value? It is necessary to analyze the characteristics of video spectrum. In most cases, the amplitude of video spectrum decreases with the increase of frequency. Wherein the low frequency spectrum obtains the level from 0 to the highest with almost equal probability. In contrast, high frequency spectrum usually gets low level and rare high level. Obviously, the entropy of low frequency spectrum is higher, and the entropy of high frequency spectrum is lower. Therefore, the low-frequency component and the high-frequency component of the video can be processed separately to obtain a high-frequency compressed value.
As can be seen from the above quotation, rate compression is based on two algorithms: transform coding and entropy coding. The former is used to reduce the entropy value, and the latter turns data into an effective coding method, which can reduce the number of bits. In MPEG standard, DCT is used for transform coding. Although the transformation process itself does not produce rate compression, the transformed frequency coefficient is very beneficial to rate compression. The whole process of digital video signal compression is actually divided into four main processes: block sampling, DCT, quantization and coding-first, the original image is divided into n (horizontal) ×N (vertical) sampling blocks, and 4×4, 4×8, 8×8, 8× 16 and 16 can be selected as needed. Its range is between 139- 163, and it is sent to the DCT encoder to convert the sampling block from time domain to DCT coefficient block in frequency domain. The transformation of DCT system is carried out in each sampling block, and each sample in these blocks is a digitized value, which represents the amplitude value of the video signal of the corresponding pixel in a field.
The concrete algorithm of DCT and its inverse operation during decompression is as follows.
When u, v = 0, if the coefficient after DCT is f (0,0) =1and the recursive function after IDCT is constant, f (0,0) is called DC. When u, v≠0, the coefficient after forward transformation is F(u, v)=0, then the regenerative function after inverse transformation is not constant, and the coefficient after forward transformation is AC coefficient.
A concrete application of DCT transform is shown in the following figure: (We make a slide of employee training, which is just right)
/rzy/Kean/DCT pro . jpg
From the above transformation principle, we can notice two points: first, the 64 DCT frequency coefficients after DCT correspond to the 64 pixel blocks before DCT, and there are 64 points before and after DCT, so there are only 1 lossless transformation processes without compression. Secondly, the spectrum of all DCT coefficient blocks of a single 1 field image is almost concentrated in the coefficient block in the upper left corner, and only from the spectrum of this block can a 1 compressed image be formed; The amplitude of DC coefficient in the upper left corner of the frequency coefficient matrix output by DCT is the largest, which is 3 15 in the figure. Because it represents the DC component on the X axis and the Y axis, it represents the average of all amplitudes of the input matrix. Other DCT coefficients moving downward from DC coefficient to right have higher frequency and smaller amplitude, and the lower right corner of the figure is -0. 1 1, which means that most of the image information is concentrated on the DC coefficient and its nearby low-frequency spectrum, while the high-frequency spectrum far from DC coefficient contains almost no image information, even only clutter. Obviously, although DCT itself has no compression function, it lays an essential foundation for "take" and "give" in future compression.