For the judgment of duplicate items, the basic idea is "sorting and merging" Firstly, the records in the data set are sorted according to certain rules, and then whether the records are duplicate is detected by comparing whether the adjacent records are similar. In fact, there are two operations, one is sorting, and the other is calculating similarity. In the general process, the repeated method is mainly used to judge, and then the repeated samples are simply deleted.
Conceptual analysis
Divide a group of physical or abstract objects into classes composed of similar objects. Find and remove those values (outliers) that fall outside the cluster, and these outliers are regarded as noise.
Regression tries to find the change law between two related variables, and smoothes the data by fitting a function, that is, by establishing a mathematical model to predict the next number, including linear regression and nonlinear regression.