Current location - Training Enrollment Network - Mathematics courses - Watermelon Book Chapter 4 Decision Tree
Watermelon Book Chapter 4 Decision Tree
This chapter mainly introduces what a decision tree is and how to build it. The first three sections explain the construction of discrete decision trees, and the fourth section explains how to deal with continuous values to build decision trees.

Decision tree: based on the tree structure, make judgments and decisions according to the attributes of the samples. For example, give a watermelon various attributes, such as color = "turquoise", root = "shrinking" and sound = "turbidity", and judge whether this watermelon is a good melon or not through these attributes.

The leaf node of the decision tree corresponds to the decision result, the other nodes correspond to the test of an attribute, and the root node contains all samples.

How to choose the optimal attribute? The purpose of dividing by attributes is to make the samples of each node of the decision tree as identical as possible, that is, the "purity" of the nodes is getting higher and higher.

There are three commonly used indicators for judging purity, which are also commonly used indicators for three decision tree algorithms.

Let's first look at a new definition, information entropy, which is a commonly used indicator to measure the purity of a sample set. Assuming that the proportion of K samples in the current sample set D is, the information entropy of D is:

The smaller the information entropy, the higher the purity of D.

Suppose that there are v possible values of discrete attributes, and if the sample set D is classified by attributes, there are v branch nodes, and the v branch nodes contain all samples with attribute values in D, which are recorded as: Information entropy of representation; Represents the proportion of samples whose property value is.

After the definition is clear, let's take a look at what information gain is. Information gain: simply speaking, it is the weighted average of the information entropy of D minus the information entropy of each subset classified by attributes.

* * Note: The greater the information gain, the greater the purity after classification according to this attribute; Information gain is a common index of ID3 decision tree learning algorithm.

Gain rate is a common index of C4.5 decision tree algorithm, and it is an improvement of information gain.

Definition: called "intrinsic value" of attribute A.

First, the information gain of each attribute is calculated. Take its average value, select those attributes that are higher than the average value, and then select the attribute with the highest gain rate.

Gini index: it reflects the probability that the labels of two sample categories randomly selected from data set D are inconsistent, and it can also be understood as1-the probability that two sample categories randomly selected are consistent.

Formula:

When we want to calculate the Gini coefficient of an attribute A:

The minimum attribute of Gini index is the optimal partition attribute.

Pruning is the main means for decision tree learning algorithm to deal with "over-fitting", and the basic strategies are "pre-pruning" and "post-pruning".

Pre-pruning: in the process of generating decision tree, each node is estimated before segmentation. If the generalization ability of the decision tree can be improved after division, it is divided, otherwise the current node is taken as a leaf node.

Post-pruning: first, generate a complete decision tree from the training set, and then judge from the bottom up whether replacing the current node with leaf nodes can improve the generalization ability, and if so, replace it.

Use the performance evaluation mentioned in the second chapter to judge whether the generalization performance is improved. Taking the reservation method as an example, pre-pruning judges the accuracy before and after node generation. The higher the accuracy, the stronger the generalization ability, depending on whether it is generated or not. Generate a decision tree after pruning, judge the accuracy before and after replacement, and see if it is replaced (the examples in the book are easy to understand, so I won't simply describe them here).

The decision tree generation in the first three sections takes discrete values as an example. Here, let's talk about how to generate a decision tree from continuous values.

By the way, the linear regression in the third chapter needs to convert discrete values into continuous values through sequential continuity or vectorization. In chapter 4, decision trees need to discretize continuous values.

Dichotomy: Assuming that there are n different values of continuous attributes in sample set D, and sorting these values from small to large, we can take a dividing point T to divide D into subsets, including those samples whose values on attribute A are less than T, while those samples whose values are greater than T ... Set D is divided into two, so it is called dichotomy.

Note that t usually selects the center point of two adjacent attribute values,

After using dichotomy, we need to judge whether this division is optimal, so we need to use the above-mentioned information gain, continuous value information gain:.

Note: The greater the information gain, the better the segmentation. And the continuous value attribute can be used as the division attribute of the current node, which is different from the discrete value attribute.

When some attributes of samples are missing, discarding these samples will waste information, and the loss of limited samples may make learners unsuitable.

Facing the sample set with missing attribute values, we need to solve two problems: ① how to determine the division attributes; (2) How to divide the samples with missing attribute values after given the division attributes.

First of all, we determine the division attribute, and judge the quality of attribute A from the samples without missing attribute values. The division will give the sample a weight, with the sample with certain attribute value being 1, and the sample without attribute value being divided by weight.

Given the training set D and attributes, let the sample subset representing that there are no missing values in D assume that there are V values, and let the sample subset with values in the representation and the sample subset belonging to class K give each sample a weight, and define:

For attribute a, it represents the proportion of samples with missing values.

For attribute A, it indicates the proportion of K-type in the sample, and there is no missing value.

For attribute A, it represents the proportion of samples with values in attribute A among samples without missing values.

Information gain:, where

Through the above information gain, we can judge which attribute is the best as the division attribute, thus determining the division attribute. The second problem is to divide the samples with missing attribute values into branches by weight. For example, it is easier to understand that if attribute A is divided into attributes, A has three values: 1, 2 and 3. First, put samples with certain attribute values into branches. Suppose * * has 10 samples, and its attribute A has a certain attribute value, with 5 samples with the attribute value of 1, 3 samples with the attribute value of 2, and 2 samples with the attribute value of 3, then at this time, there is one.

Each attribute is regarded as a coordinate axis in the coordinate space, so the samples of d attributes are relative to a point in the d-dimensional space.

The classification boundary formed by decision tree has an obvious feature: the axes are parallel, that is, its classification boundary consists of several segments parallel to the coordinate axis.