Current location - Training Enrollment Network - Books and materials - Research Status of Data Mining at Home and Abroad
Research Status of Data Mining at Home and Abroad
With the rapid development of network and database technology and the wide application of database management system, people have accumulated more and more data. Data mining is to extract hidden information and knowledge from a large number of practical application data. It uses many technologies such as database, artificial intelligence, mathematical statistics, etc. It is a deep data analysis method.

Keywords: data mining; Knowledge; Analysis; Marketing; financial investment

With the rapid development of network and database technology and the wide application of database management system, people have accumulated more and more data. Therefore, data mining technology came into being. Below, this paper makes a simple introduction to data technology and its application.

First, the definition of data mining

Data mining is a process of extracting hidden, unknown but potentially useful information and knowledge from a large number of incomplete, noisy, fuzzy and random practical application data. It is a new business information processing technology, and its main feature is to extract, transform, analyze and model a large number of business data from commercial databases, and extract key data to assist business decision-making. In short, data mining is actually an in-depth data analysis method. From this point of view, data mining can also be described as an advanced and effective method to explore and analyze a large number of enterprise data, reveal hidden, unknown or verified laws, and further model.

Second, data mining technology.

Data mining technology is the result of long-term research and development of database technology, and the development of data warehouse technology is closely related to data mining. In most cases, data mining should first take the data out of the data warehouse and put it into the data mining library or data mart, because the data warehouse will clean up the data and solve the problem of data inconsistency, which will bring many benefits to data mining. In addition, data mining also takes advantage of advances in artificial intelligence (AI) and statistical analysis, both of which are devoted to pattern discovery and prediction. Database, artificial intelligence and mathematical statistics are the three pillars of data mining technology. Because of the different knowledge found in data mining, the technologies used are also different.

1. Generalized knowledge. Refers to the general descriptive knowledge of category characteristics. According to the microscopic characteristics of data, it is found that its representativeness, universality, high-level concept, meso-level and macro-level knowledge reflect the similarity of similar things, which is a generalization, refinement and abstraction of data. There are many methods and techniques to discover generalized knowledge, such as data cube and information-oriented reduction. The basic idea of data cube is to realize the calculation of some commonly used high-cost aggregation functions, such as counting, summation, average, maximum value and so on. And store these implementation views in a multidimensional database. Attribute-oriented reduction is to express data mining queries in SQL-like language, collect relevant data sets in the database, and then apply a series of data promotion technologies to the relevant data sets for data promotion, including attribute deletion, concept tree promotion, attribute threshold control, counting and other aggregate function dissemination.

2. Relevant knowledge. It reflects the knowledge of the dependence or correlation between one event and other events. If there is an association between two or more attributes, the attribute value of one attribute can be predicted according to other attribute values. The most famous association rules

Then Apriori algorithm and FP-growth algorithm. The discovery of association rules can be divided into two steps: the first step is to identify all frequent itemsets iteratively, and the support of frequent itemsets is not lower than the minimum set by users; The second step is to construct the rules whose reliability is not lower than the lowest value set by the user from the frequent itemsets. Identifying or discovering all frequent itemsets is the core of association rule discovery algorithm, and it is also the most computationally intensive part.

3. Classification knowledge. It reflects the characteristic knowledge of similar things and the different characteristic knowledge between different things. Classification methods include decision tree, naive Bayes, neural network, genetic algorithm, rough set method, fuzzy set method, linear regression and K-means division. The most typical classification method is decision tree. It is a decision tree constructed by case sets, and it is a instructive learning method.

Firstly, the decision tree is formed according to the training subset. If the tree can't give all objects a correct classification, select some exceptions and add them to the training subset, and repeat the process until a correct decision set is formed. The final result is a tree, whose leaf nodes are class names, and the middle nodes are branched and ambiguous, corresponding to some possible ambiguous values.

4. Predictive knowledge. Inferring future data from historical and current data according to time series data can also be regarded as related knowledge with time as the key attribute. At present, time series prediction methods include classical statistical methods, neural networks and machine learning. In 1968, BoX and Jenkins put forward a set of perfect time series modeling theories and analysis methods. These classical mathematical methods predict time series by establishing stochastic models. Because a large number of time series are non-stationary, their characteristic parameters and data distribution change with time. Therefore, only by training some historical data and establishing a single neural network prediction model, it is impossible to complete the accurate prediction task. Therefore, people put forward a retraining method based on statistics and accuracy. When the existing prediction model is no longer applicable to the current data, the model is retrained to obtain new weight parameters and establish a new model.

5. Prejudice knowledge. It is a description of differences and extreme special cases, revealing abnormal phenomena of things deviating from the normal, such as special cases outside the standard class and outliers outside the data clustering. All this knowledge can be found at different conceptual levels, and with the advancement of the conceptual level, from micro to meso and macro, it can meet the needs of different users at different decision-making levels.

Third, the data mining process

Data mining refers to the complete process of mining previously unknown, effective and practical information from large databases, writing graduation thesis and using this information to make decisions or enrich knowledge. The basic process and main steps of data mining are as follows:

The general contents of each step in the process are as follows:

1. Identify business objects and clearly define business problems. Understanding the purpose of data mining is an important step in data mining. The final structure of mining is unpredictable, but the problems to be explored should be predictable. Mining for data mining is blind and will not succeed.

2. Data preparation. (1) Data Selection Search all internal and external data information related to business objects, and select data suitable for data mining applications. (2) Data preprocessing. Study data quality, and carry out data integration, conversion, reduction, compression, etc. To prepare for further analysis and determine the type of mining operation to be carried out. (3) data conversion. It is the key to the success of data mining to convert data into analysis model and establish analysis model for mining algorithm.

3. Data mining. Mining the transformed data. In addition to perfecting and selecting the appropriate mining algorithm, all other work can be completed automatically.

4. Result analysis. Explain and evaluate the results. Generally speaking, the analysis method used should depend on the mining operation, and visualization technology is usually used.

5. Absorption of knowledge. Integrate the knowledge gained from the analysis into the organizational structure of the business information system.

Fourthly, the application of data mining

Data mining technology is application-oriented from the beginning. At present, data mining is a very fashionable word in many fields, especially in banking, telecommunications, insurance, transportation, retail (such as supermarkets) and other commercial fields.

1. Marketing. Due to the wide application of management information system and P0S system in commerce, especially in retail industry, especially the use of bar code technology, a large number of data about users' purchases can be collected, and the amount of data is increasing. For marketing, understanding some characteristics of customers' shopping behavior through data analysis is of great help to improve competitiveness and promote sales. By using data mining technology, we can obtain information such as customers' purchasing orientation and interests through the analysis of user data, thus providing a reliable basis for business decision-making. The application of data mining in marketing can be divided into two categories: database marketing and shopping basket analysis. The task of database marketing is to select potential customers through interactive query, data segmentation and model prediction, so as to sell products to them. By insulting the existing customer data, users can be divided into different grades, and the higher the grade, the greater the possibility of their purchase. Basket analysis

It identifies customers' buying behavior patterns by analyzing market sales data. For example, if goods A are purchased, the probability that goods B are purchased is 95%, which helps to determine the layout and placement of the store shelves to promote certain goods, and also helps to select and match goods more purposefully. The systems in this field include: Opportunity explorer, which can be used for causal analysis of abnormal sales in supermarkets. In addition, IBM has also developed some tools (part of IntdligentMiner and QUEST) to identify customers' buying behavior patterns.

2. Financial investment. Typical financial analysis fields include investment evaluation and stock market prediction, and the analysis methods generally adopt model prediction methods (such as neural network or statistical regression technology). Because of the high risk of financial investment, it is more necessary to analyze the relevant data of various investment directions in order to choose the best investment direction when making investment decisions. Whether it is investment evaluation or stock market prediction, it is a prediction of the development of things, which is based on the analysis of data. Data mining can discover the relationship between data objects by processing the existing data, and then make reasonable predictions by using the learned patterns. There are Fidelity stock selection and LBS fund management in this system. The former's task is to use neural network model to select investment, while the latter uses expert system, neural network and genetic algorithm technology to assist in the management of securities up to 600 million US dollars.

3. Fraud screening. Banks or enterprises often commit fraudulent acts such as vicious overdraft, which brings huge losses to banks and commercial units. Predicting this kind of fraud can reduce losses. Fraud screening mainly summarizes the relationship between normal behavior and fraud, and obtains some characteristics of fraud, so that when an enterprise meets these characteristics, it can warn decision makers.

The most successful systems in this field are Falcon system and FAIS system. FALCON is a credit card fraud estimation system developed by HNC company, which has been used by quite a few retail banks to detect suspicious credit card transactions. FAIS is a system for identifying financial transactions related to money laundering, which uses general government data forms. In addition, data mining can also be used for distant star detection in astronomy, genetic engineering research, web information retrieval and so on.

Concluding remarks

With the development of database, artificial intelligence, mathematical statistics and computer software and hardware technology, data mining technology will be widely used in more fields.

References:

[1] Teaching reform and exploration of introduction to strict database system [J]. Journal of Shanxi Radio and TV University, 2006, (15): 16- 17.