Current location - Training Enrollment Network - Mathematics courses - Decision tree algorithm
Decision tree algorithm
C4.5 algorithm inherits the advantages of ID3 algorithm, and improves ID3 algorithm in the following aspects:

1) uses information gain rate to select attributes, which overcomes the shortcoming of selecting attributes with more values when selecting attributes with information gain.

2) Pruning in the process of building trees;

3) discretization of continuous attributes can be completed;

4) Be able to handle incomplete data.

C4.5 algorithm has the following advantages: the generated classification rules are easy to understand and have high accuracy. Its disadvantage is that the data set needs to be scanned and sorted many times in the process of constructing the tree, which leads to the inefficiency of the algorithm. In addition, C4.5 is only applicable to data sets that can reside in memory. When the training set is too large to fit in memory, the program cannot run.

The specific algorithm steps are as follows:

1 Create node n

If the training set is empty, it will be marked as failure at the return node n.

3 If all the records in the training set belong to the same category, mark the node n with this category.

4 If the candidate attribute is empty, N is returned as a leaf node and marked as the most common class in the training set;

5 for each candidate attribute attribute_list

If the candidate attributes are continuous, then.

7 discretize this attribute.

8 Select the attribute D with the highest information gain rate from the candidate attribute_list.

9 mark node n as attribute d

10 Consistent value d for each attribute

1 1 The branch with the condition D=d starts to grow from node n..

12 let s be the training sample set in the training set D = D

13 If s is empty.

Add a leaf to 14 and mark it as the most common class in the training set.

15else plus point background with C4.5(R-{D}, c, s) returns:

Cart (Classification Regression Tree) is a very interesting and effective nonparametric classification regression method. It achieves the purpose of prediction by constructing a binary tree.

CART model of classified regression tree was first put forward by Breiman and others, and it has been widely used in statistics and data mining technology. It uses a completely different way from traditional statistics to construct prediction criteria. It is given in the form of binary tree, which is easy to understand, use and explain. In many cases, the prediction tree constructed by CART model is more accurate than the algebraic prediction criterion constructed by common statistical methods, and the more complex the data and the more variables, the more obvious the superiority of the algorithm. The key of this model is the construction and accuracy of prediction criteria.

Definition:

Classified regression first uses known multivariate data to construct prediction criteria, and then predicts a variable according to other variable values. In classification, people often measure an object first, and then use certain classification standards to determine which category the object belongs to. For example, given the identification characteristics of a fossil, predict which family, genus or even species the fossil belongs to. Another example is to understand the geological and geophysical information of a certain area and predict whether there are mines in this area. Regression is different from classification, it is used to predict a certain value of an object, not the classification of the object. For example, given the characteristics of mineral resources in a certain area, the resource quantity in this area is predicted.