Current location - Training Enrollment Network - Mathematics courses - Overview of data mining concepts
Overview of data mining concepts
Overview of data mining concepts

Data mining is also called KDD, data analysis, data fusion and decision support. The word KDD first appeared in the 1 1 International Joint Artificial Intelligence Conference held in August. Then KDD symposiums were held in 199 1, 1993 and 1994, which brought together researchers and application developers from various fields and focused on data statistics, massive data analysis and calculation, knowledge representation and knowledge application. With the increasing number of participants, KDD International Conference has developed into an annual conference. The 4th International Conference on Knowledge Discovery and Data Mining held in new york from 65438 to 0998 not only had academic discussions, but also more than 30 software companies showed their data mining software products, many of which have been applied in North America, Europe and other countries.

First, what is data mining?

1. 1, data mining history

In recent ten years, people's ability to produce and collect data by using information technology has been greatly improved, and tens of millions of databases have been used in business management, government office, scientific research and engineering development. This trend will continue to develop. Therefore, a new challenge is put forward: in this era called information explosion, information overload has become a problem that almost everyone needs to face. How can we find useful knowledge from Wang Yang's information ocean in time and improve the utilization rate of information? In order to make data truly become the company's resources, we must make full use of it to serve the company's own business decisions and strategic development, otherwise a large amount of data may become a burden, or even garbage. Therefore, in the face of the challenge of "people are overwhelmed by data, but people are hungry for knowledge". On the other hand, artificial intelligence, another field of computer technology, has made great progress since the birth of 1956. After the game period, natural language understanding, knowledge engineering and other stages, the current research focus is machine learning. Machine learning is a science that simulates human learning by computer. Mature algorithms include neural network and genetic algorithm. Using database management system to store data, using machine learning method to analyze data, and mining the knowledge behind a large number of data, the combination of the two contributed to the emergence of knowledge discovery (KDD) in the database. Therefore, data mining and knowledge discovery (DMKD) technology came into being and flourished, showing its strong vitality more and more.

Data mining is also called KDD, data analysis, data fusion and decision support. The word KDD first appeared in the 1 1 International Joint Artificial Intelligence Conference held in August. Then KDD symposiums were held in 199 1, 1993 and 1994, which brought together researchers and application developers from various fields and focused on data statistics, massive data analysis and calculation, knowledge representation and knowledge application. With the increasing number of participants, KDD International Conference has developed into an annual conference. The 4th International Conference on Knowledge Discovery and Data Mining held in new york from 65438 to 0998 not only had academic discussions, but also more than 30 software companies showed their data mining software products, many of which have been applied in North America, Europe and other countries.

2.2 The concept of data mining

From 1989 to now, the definition of KDD has been continuously improved with the in-depth research. At present, the relatively accepted definition is given by Fayyad and others. KDD is a high-level process for identifying effective, novel, potentially useful and ultimately understandable patterns from data sets. As can be seen from the definition, DataMining is a process of extracting hidden information and knowledge from a large number of incomplete, noisy, fuzzy and random data. People don't know these information and knowledge in advance, but they are potentially useful. People regard raw data as the source of knowledge, just like mining from ore. The original data can be structured, such as data in a relational database, or semi-structured, such as text, graphics, image data, or even heterogeneous data distributed on the network. The method of discovering knowledge can be mathematical or non-mathematical; It can be deductive or inductive. The discovered knowledge can be used for information management, query optimization, decision support, process control and so on. , can also be used for the maintenance of the data itself. Therefore, data mining is a very extensive interdisciplinary subject, which brings together researchers from different fields, especially scholars and engineers in the fields of database, artificial intelligence, mathematical statistics, visualization, parallel computing and so on.

In particular, data mining technology is application-oriented from the beginning. It is not only a simple retrieval and query call to a specific database, but also statistics, analysis, synthesis and reasoning of these data at the micro, meso and macro levels, so as to guide the solution of practical problems, try to find the correlation between events, and even use the existing data to predict future activities.

Generally speaking, it is called KDD in scientific research field and data mining in engineering field.

Second, the steps of data mining

KDD includes the following steps:

1, data preparation

The processing object of KDD is a large amount of data, which is generally stored in the database system and is the result of long-term accumulation. However, it is often not suitable for knowledge mining directly on these data, and data preparation is needed. Data preparation generally includes data selection (selecting relevant data), purification (eliminating noise and redundant data), speculation (estimating missing data), transformation (mutual conversion between discrete value data and continuous value data, grouping and classification of data values, calculation and combination between data items, etc. ) and data reduction (reducing the amount of data). If the object of KDD is a data warehouse, then these tasks are often ready when the data warehouse is generated. Data preparation is the first and important step of KDD. Whether the data is ready or not will affect the efficiency and accuracy of data mining and the effectiveness of the final model.

2. Data mining technology

Data mining is the most critical step in KDD, and it is also a technical difficulty. Most people who study KDD are studying data mining technology, which includes decision tree, classification, clustering, rough set, association rules, neural network, genetic algorithm and so on. According to the goal of KDD, data mining selects the parameters of the corresponding algorithm, analyzes the data, and obtains the pattern model that may form knowledge.

3. Evaluate and explain the model.

The pattern model obtained above may have no practical significance or practical value, and may not accurately reflect the true meaning of the data, or even be contrary to the facts in some cases, so it needs to be evaluated to determine which patterns are effective and useful. The evaluation can be based on users' years of experience, and some models can also directly use data to test their accuracy. This step also includes presenting the pattern to the user in an easy-to-understand way.

4. Consolidate knowledge

The pattern model understood by users is considered as a practical and valuable form of knowledge. At the same time, we should pay attention to knowledge.

Sexual examination, to solve conflicts and contradictions with previously acquired knowledge, so that knowledge can be consolidated.

Step 5 use knowledge

Knowledge is discovered for application, and how to apply knowledge is also one of the steps of KDD. There are two ways to use knowledge: one is to support decision-making only by looking at the relationship or result described by knowledge itself; The other is to apply knowledge to new data, which may cause new problems and need to further optimize knowledge.

Third, the characteristics and functions of data mining

3. 1, the characteristics of data mining

Data mining has the following characteristics. Of course, these characteristics are closely related to the data and purpose of data mining.

1, the scale of data processed is very huge.

2. Queries are generally instant random queries put forward by decision makers (users), which often fail to form accurate query requirements.

3. Because the data changes quickly and may be out of date soon, it is necessary to respond quickly to dynamic data to provide decision support.

4. It is mainly based on the statistical laws of large samples, and the laws it finds may not be applicable to all data.

3.2, the function of data mining

Data mining can discover the following kinds of knowledge:

Generalized knowledge reflects the same nature of similar things;

Characteristic knowledge, reflecting the characteristic knowledge of all aspects of things;

Differential knowledge, reflecting the attribute differences between different things; Relational knowledge, reflecting the dependence or association between things;

Predicting knowledge and inferring future data according to historical and current data; Deviation knowledge reveals the abnormal phenomenon that things deviate from convention.

All this knowledge can be found at different conceptual levels, and with the promotion of the concept tree, from micro to meso to macro, to meet the needs of different users and different decision-making levels. For example, from the data warehouse of a supermarket, a typical association rule may be that "nine times out of ten customers who buy bread and butter also buy milk" or "almost all customers who buy food use credit cards", which is very useful for merchants to formulate and implement customized sales plans and strategies. As for discovery tools and methods, there are classification, clustering, dimensionality reduction, pattern recognition, visualization, decision tree, genetic algorithm, uncertainty processing and so on. To sum up, data mining has the following functions:

Prediction/Verification Function: The prediction/verification function refers to predicting or verifying the values of other unknown fields with several known fields in the database. Prediction methods include statistical analysis, association rules and decision tree prediction, regression tree prediction and so on.

Descriptive function: Descriptive function refers to finding an understandable pattern to describe data. Description methods include the following: data classification, regression analysis, clustering, induction, building dependency patterns, variation and deviation analysis, pattern discovery, path discovery and so on.

Fourthly, the mode of data mining.

The task of data mining is to find patterns from data. Pattern is an expression e expressed in language L, which can be used to describe the characteristics of data in data set F. The data described by E is a subset FE of set F. As a pattern, E is required to be simpler than enumerating all elements in data subset FE. For example, "If the score is between 8 1 and 90, the score is excellent" can be called a model, while "If the score is between 8 1, 82, 83, 84, 85, 86, 87, 88, 89 or 90, the score is excellent" cannot be called a model.

There are many modes, which can be divided into two categories according to their functions: prediction mode and description mode.

Prediction mode is a mode that can accurately determine a certain result according to the value of data items. The data used for mining prediction patterns can also be clearly known. For example, according to the data of various animals, a model can be established, and all viviparous animals are mammals. When there are new animal data, we can judge whether the animal is a mammal according to this model.

Description pattern is to describe the rules existing in data, or to group data according to similarity. Descriptive patterns cannot be directly used for prediction. For example, on the earth, 70% of the surface is covered by water and 30% is land.

In practical application, models are often divided into the following six types according to their actual functions:

1, classification mode

A classification pattern is a classification function (classifier), which can map data items in a dataset to a given class. Classification patterns are usually expressed as classification trees. According to the value of the data, start from the root of the tree, go up along the branches that the data meets, reach the leaves, and determine the category.

2. Regression model

The function definition of regression model is similar to that of classification model, and the difference between them is that the predicted value of classification model is discrete, while the predicted value of regression model is continuous. If the characteristics of an animal are given, the classification model can be used to determine whether the animal is a mammal or a bird. Given a person's education and work experience, the regression model can determine whether his annual salary is below 6000 yuan, between 6000 yuan and 1000 yuan, or above 10000 yuan.

3. Time series model

The time series model predicts the future value according to the trend of data changing with time. The particularity of time should be considered here, such as some periodic time definitions such as week, month, quarter and year. The possible impact of different days, such as holidays, the calculation method of the date itself, the correlation between the time before and after (how much influence the past has on the future) and other places that need special consideration. Only by fully considering the time factor and using a series of values of existing data that change with time can we better predict future values.

4. Clustering mode

The clustering model divides the data into different groups, and the differences between groups are as large as possible and the differences within groups are as small as possible. Different from the classification mode, before clustering, we don't know how many groups we will be divided into, what kind of groups we will be divided into, and we don't know which (several) data items to use to define groups. Generally speaking, people with rich business knowledge should be able to understand the meaning of these groups. If the generated pattern is incomprehensible or unavailable, it may be meaningless and needs to go back to the previous stage to reorganize the data.

5. Lenovo mode

Association pattern is the association rule between data items. The association rule is in the following form: "Among those who are unable to repay the loan, 60% have a monthly income of less than 3,000 yuan."

6. Sequence pattern

Sequence pattern is similar to association pattern, but the association between data is related to time. In order to find the sequence pattern, it is necessary not only to know whether the event occurred, but also to determine the time when the event occurred. For example, among people who buy color TV sets, 60% will buy DVD players within three months.

Five, the discovery task of data mining

There are many disciplines and methods involved in data mining, and there are many classifications. According to the mining tasks, it can be divided into classification or prediction model discovery, data aggregation, clustering, association rule discovery, sequence pattern discovery, dependency or dependency model discovery, anomaly and trend discovery, etc. According to the mining objects, there are relational database, object-oriented database, spatial database, temporal database, text data source, multimedia database, heterogeneous database, heritage database and World Wide Web. According to mining methods, it can be roughly divided into machine learning method, statistical method, neural network method and database method. Machine learning can be subdivided into inductive learning methods (decision tree, rule induction, etc. ), case-based learning, genetic algorithm, etc. Statistical methods can be subdivided into regression analysis (multiple regression, autoregressive, etc. ), discriminant analysis (Bayesian discriminant, Fisher discriminant, nonparametric discriminant, etc. ), cluster analysis (systematic clustering, dynamic clustering, etc. ), exploratory analysis (principal component analysis, correlation analysis, etc. ) and so on. Neural network method can be subdivided into: forward neural network (BP algorithm, etc. ) and self-organizing neural network (self-organizing feature mapping, competitive learning, etc. ). Database methods are mainly multidimensional data analysis or OLAP methods, as well as attribute-oriented induction methods.

From the perspective of mining tasks and mining methods, there are four very important discovery tasks: data summary, classification discovery, clustering and association rule discovery.

5. 1, data summary

The purpose of data aggregation is to condense data and give a compact description. The traditional and simplest data summary method is to calculate the statistical values such as sum value, average value and variance value on each field of the database, or to express them in graphic ways such as histogram and pie chart. Data mining mainly discusses data summarization from the perspective of data summarization. Data generalization is the process of abstracting the relevant data in the database from low level to high level. Because the information contained in the data or objects in the database is always the most primitive and basic information (this is to avoid missing any potentially useful data information). People sometimes want to process or browse data from a higher level view, so it is necessary to summarize the data at different levels to meet various query requirements. At present, there are two main techniques for data generalization: multidimensional data analysis method and attribute-oriented induction method.

1. Multidimensional data analysis method is a data warehouse technology, also known as online analytical processing (OLAP). Data warehouse is an integrated, stable, historical and different time data set for decision support. The premise of decision-making is data analysis. Aggregation operations such as summation, summation, average, maximum value and minimum value are often used in data analysis, and the calculation amount of such operations is particularly large. Therefore, for the convenience of decision support system, it is a natural idea to calculate and store the results of aggregation operations in advance. The place where the results of aggregation operations are stored is called a multidimensional database. Multidimensional data analysis technology has been successfully applied to decision support systems, such as the famous SAS data analysis software package, the decision support system of Business Object, and the decision analysis tools of IBM.

Multidimensional data analysis method is used to summarize data, and for data warehouse, data warehouse stores offline historical data.

2. In order to process online data, the researchers proposed an attribute-oriented induction method. Its idea is to directly summarize the data views that users are interested in (which can be obtained through general SQL query language), instead of storing the summarized data in advance like multidimensional data analysis methods. Proponents of this method call this data induction technique attribute-oriented induction. After generalization, the original relationship is a generalized relationship, which summarizes the original relationship at a lower level from a higher level. With the generalization relation, we can operate it deeply and generate knowledge that meets the needs of users, such as generating feature rules, discrimination rules, classification rules, association rules and so on based on the generalization relation.

5.2. Classification discovery

Classification is a very important task in data mining, which is widely used in business at present. The purpose of classification is to learn a classification function or classification model (also called classifier), which can map data items in the database to one of the given categories. Both classification and regression can be used for forecasting. The purpose of forecasting is to automatically deduce the general description of given data from historical data records, so as to forecast future data. Different from regression method, the output of classification is discrete category value, while the output of regression is continuous value.

In order to construct a classifier, a training sample data set is needed as input. The training set consists of a set of database records or tuples, each tuple is a feature vector (also called attribute or feature) composed of the values of related fields, and the training sample also has a category label. The form of specific samples can be: (v 1, v2, …, vn; c); Where vi represents the field value and c represents the category.

The construction methods of classifier include statistical method, machine learning method and neural network method. Statistical methods include Bayesian method and nonparametric method (nearest neighbor learning or case-based learning), and the corresponding knowledge representation is discriminant function and prototype case. Machine learning methods include decision tree method and rule induction method. The former is a decision tree or a discriminant tree, while the latter is generally a production rule. The neural network method is mainly BP algorithm, and its model representation is a forward feedback neural network model (a framework consisting of nodes representing neurons and edges representing connection weights). BP algorithm is essentially a nonlinear discriminant function. In addition, a new method-rough set has recently appeared, and its knowledge representation is production rules.

Different quantifiers have different characteristics. There are three classifier evaluation or comparison scales: 1 prediction accuracy; 2. Computational complexity; 3. Simplicity of model description. Prediction accuracy is the most commonly used comparison scale, especially for prediction classification tasks. At present, the recognized method is 10 hierarchical cross-validation method. The computational complexity depends on the specific implementation details and hardware environment. In data mining, because the object of operation is a huge database, the complexity of space and time will be a very important link. For descriptive classification tasks, the simpler the model description, the more popular it is; For example, the rule-based classifier construction method is more useful, while the results produced by neural network method are difficult to understand.

In addition, it should be noted that the effect of classification is generally related to the characteristics of data. Some data are noisy, some are missing, some are sparsely distributed, some fields or attributes are highly correlated, some attributes are discrete, and some are continuous or mixed. At present, it is generally believed that no method can be applied to data with various characteristics.

5.3. Gather

Clustering is to divide a group of individuals into several categories according to similarity, that is, "birds of a feather flock together". Its purpose is to make the distance between individuals belonging to the same category as small as possible, while the distance between individuals belonging to different categories as large as possible. Clustering methods include statistical method, machine learning method, neural network method and database-oriented method.

In statistical methods, clustering is called cluster analysis, which is one of the three major methods of multivariate data analysis (the other two are regression analysis and discriminant analysis). This paper mainly studies clustering based on geometric distance, such as Euclidean distance and Minkowski distance. Traditional statistical clustering analysis methods include systematic clustering, decomposition, addition, dynamic clustering, ordered sample clustering, overlapping clustering and fuzzy clustering. This clustering method is based on global comparison, and all individuals need to be investigated to determine the classification of classes. Therefore, it requires that all data must be given in advance and new data objects cannot be added dynamically. Cluster analysis method does not have linear computational complexity, so it is difficult to apply to very large databases.

Clustering in machine learning is called unsupervised or teacher-less induction; Because compared with classification learning, the examples or data objects of classification learning have category labels, while the examples to be clustered have no labels, which need to be automatically determined by clustering learning algorithm. In many artificial intelligence documents, clustering is also called conceptual clustering; Because the distance here is no longer the geometric distance in statistical methods, but is determined according to the description of the concept. When the clustering objects can be dynamically increased, concept clustering is called concept formation.

In neural network, there is an unsupervised learning method: self-organizing neural network method; Such as Kohonen self-organizing feature mapping network and competitive learning network. In the field of data mining, the reported neural network clustering method is mainly self-organizing feature mapping method, and IBM specifically mentioned the use of this method for database clustering segmentation in its White Paper on Data Mining.

5.4. Association rule discovery

Association rules are rules in the form of "90% customers who buy bread and butter also buy milk" (bread+butter (milk)). The main object of association rule discovery is transaction database, and the application is sales data, also known as shopping basket data. A transaction generally consists of the following parts: transaction processing time, a group of goods purchased by the customer, and sometimes a customer identification number (such as credit card number).

Due to the development of bar code technology, retail departments can use front-end cash registers to collect and store a large number of sales data. Therefore, if these historical transaction data are analyzed, it can provide extremely valuable information for customers' purchasing behavior. For example, it can help how to put the goods on the shelf (for example, put together the goods that customers often buy at the same time) and how to plan the market (how to match each other). It can be seen that discovering association rules from transaction data is very important for improving the decision-making of retail and other commercial activities.

If the support and credibility of association rules are not considered, there are infinite association rules in the transaction database. In fact, people are generally only interested in association rules that meet certain support and credibility. In the literature, rules that meet certain requirements (such as greater support and credibility) are generally called strong rules. Therefore, in order to find meaningful association rules, two thresholds need to be given: minimum support and minimum credibility. The former is the minimum support that a user-defined association rule must meet, and it represents the minimum support that a group of items should meet statistically. The latter is the lowest reliability that the user-specified association rules must meet, which reflects the lowest reliability of association rules.

In practice, the more useful association rules are generalized association rules. Because there is a hierarchical relationship between the concepts of articles, such as jackets and ski shirts belong to coats, and coats and shirts belong to clothes. With the hierarchical relationship, it can help to find some more meaningful rules. For example, "buy a coat, buy shoes" (here, coats and shoes are high-level projects or concepts, so this rule is a generalized association rule). Because there are thousands of goods in shops or supermarkets, on average, the support of each product (such as ski shirts) is very low, so it is sometimes difficult to find useful rules; However, if you consider higher-level items (such as coats) with higher support, you may find useful rules. In addition, the idea of association rule discovery can also be used for sequential pattern discovery. When buying goods, users not only have the above correlation laws, but also the laws of time or sequence, because many times customers will buy these things this time, buy things related to the last time next time, and then buy related things.