Indirect data mining: do not choose specific variables in the target, but describe them with models; But to establish a relationship between all variables.
Data mining method
Neural network method
Neural network has attracted more and more attention in recent years because of its good robustness, self-organization and adaptability, parallel processing, distributed storage and high fault tolerance, which is very suitable for solving data mining problems.
genetic algorithm
Genetic algorithm is a random search algorithm based on biological natural selection and genetic mechanism, and it is a bionic global optimization method. Genetic algorithm is applied to data mining because of its implicit parallelism and easy combination with other models.
Decision tree method
Decision tree is a commonly used algorithm in forecasting model. By purposefully classifying a large number of data, some valuable potential information can be found. Its main advantages are simple description and fast classification, and it is especially suitable for large-scale data processing.
Rough set method
Rough set theory is a mathematical tool to study imprecise and uncertain knowledge. Rough set method has several advantages: it does not need to give additional information; Simplify the expression space of input information; The algorithm is simple and easy to operate. The object of rough set processing is an information table similar to a two-dimensional relational table.
Cover positive examples and reject counterexamples.
It uses the idea of covering all positive examples and rejecting all counterexamples to find the law. First, choose a seed from the positive example set and compare it with the counter-example set one by one. If it is compatible with a selector consisting of field values, it will be discarded, otherwise it will be retained. According to this idea, if we cycle all the positive examples, we will get the rules of positive examples (selector conjunctive formula).
Statistical analysis technology
There are two kinds of relationships between database field items: functional relationship and correlation relationship, which can be analyzed by statistical methods, that is, using statistical principles to analyze the information in the database. The commonly used statistical methods include regression analysis, correlation analysis and difference analysis.
Fuzzy set method
That is, fuzzy set theory is used to make fuzzy evaluation, fuzzy decision, fuzzy pattern recognition and fuzzy cluster analysis on practical problems. The higher the complexity of the system, the stronger the fuzziness. General fuzzy set theory uses membership degree to describe one kind or another of fuzzy things.
Data mining task
Associative analysis
There is a certain regularity between the values of two or more variables, which is called correlation. Data association is an important discovery knowledge in database. Correlation is divided into simple correlation, time series correlation and causal correlation. The purpose of association analysis is to find out the hidden association network in the database. Generally, two thresholds of support and credibility are used to measure the relevance of association rules, and parameters such as interest and relevance are continuously introduced to make the mined rules more in line with the requirements.
Cluster analysis
Clustering is to divide data into several categories according to similarity. The data in the same category are similar to each other, but the data in different categories are different. Cluster analysis can establish macro concepts and discover possible relationships between data distribution patterns and data attributes.
classify
Classification is to find out the concept description of a category, which represents the overall information of this kind of data, that is, the connotation description of this kind of data, and use this description to construct a model, usually expressed by rules or decision trees. Classification is to obtain classification rules through a certain algorithm and using training data sets. Classification can be used for rule description and prediction.
predict
Prediction is to use historical data to find out the law of change, establish a model, and predict the types and characteristics of future data from this model. Prediction is related to accuracy and uncertainty, and is usually measured by prediction variance.
Time series pattern
Time series pattern refers to the pattern with high repetition probability found in time series. Like regression, it uses known data to predict future values, but the difference between these data is the time when the variables are located.
Deviation analysis
Deviation contains a lot of useful knowledge and there are many anomalies in the data in the database. It is very important to find the data anomalies in the database. The basic method of deviation test is to find out the difference between observation results and reference values.