Too many omissions, delete indicators directly.
For example, after investigating the population information, we found that the item "age" was missing 40%, so we deleted the indicator directly. You don't have to worry about this variable when you do the problem later.
Because when there are too many missing data in a variable, even if we try to make up for it, it may be far from the actual situation, so these data are of little value.
So, how much less is "more"? There are no hard and fast rules. Obviously, it is 30% to 40% less, and it must be more. And if the data of 65.438+0.4 billion people is thousands or even tens of thousands less, it is not too much. Therefore, it is necessary to analyze specific problems.
Applicable competition question: variables with "too much" missing data.
Make up for it with means and patterns.
The so-called average is the average, and the mode is the value that appears the most times.
Quantitative data, such as the height and age of a group of people, are placed in the position where the data is missing with the overall average;
Qualitative data, such as gender, education level of a group of people, satisfaction with certain events, etc., are filled with the highest frequency value, that is, the mode.
Applicable competition questions: the data of population, age, economy and industry are large, and the accuracy of individuals is not high.
Newton interpolation method.
Simply put, Newton interpolation is to construct an approximate function according to a fixed formula, and use the value of the approximate function to make up for the missing value.
Disadvantages: unstable oscillation at the edge of the interval, that is, Runge phenomenon. Figuratively speaking, when there are many interpolation times, the function in the interval looks quite normal, but it becomes ups and downs at the edge of the interval.
Because of Runge phenomenon, Newton interpolation method is not suitable for problems that need derivatives.