Thought: the whole idea is to simplify the complex and grasp the key to the problem, that is, to reduce the dimension. Of course, since it is the key, it is natural to sacrifice accuracy.
Problem solving: Because each variable reflects some information of the studied problem to a certain extent, and there is a certain correlation between the indicators, the information reflected by the obtained statistical data overlaps to a certain extent. When using statistical methods to study multivariate problems, too many variables will increase the amount of calculation and the complexity of analyzing problems.
People hope that in the process of quantitative analysis, the fewer variables involved, the more information they can get. In order to reduce redundancy and noise as much as possible, we can generally choose one of the related variables, or combine several related variables into one variable as a representative, and use a few variables to represent all variables.
Principle: Because there is a certain correlation among many variables involved in the evaluation, there must be leading factors. According to this point, by studying the relationship between the original variable and the internal structure of the correlation matrix, several comprehensive indexes affecting an element of the target variable are found out, so that the comprehensive indexes are linear fitting of the original variable. In this way, the comprehensive index not only retains the main information of the original variable, but also has some properties superior to the original variable, which is convenient for us to grasp the main contradiction when studying the evaluation of complex target variables.
Image understanding
For example, there are two columns, M and F, in which the value of M column is 1 if the student is male, and 0 if the student is female. F column, if it is male, the value is 0; Otherwise, the value is 1. From these two relationships, we can know that these two columns of data are strongly correlated. As long as you keep one column, you can completely restore another column. ? Of course, it is not limited to data deletion, but also includes data conversion, and deletion can be understood as one of these methods.
Of course, the above situation is impossible in real data. Here is just an introduction to this idea. In reality, we need to consider which column of information can be deleted to minimize the loss. Or can the loss information be reduced by data transformation? How to measure the loss of information? What are the steps to reduce the dimension of the original data?
Example of coordinates:
Let's look at the picture below, which is an oval grid. An ellipse has a major axis and a minor axis. Now, if we want to express the main changing trend of lattice, we can construct a new coordinate system with long axis and short axis (or parallel to long axis and short axis). In extreme cases, the short axis becomes a point, so the long axis can represent the trend and characteristics of this lattice. In this way, two-dimensional data becomes one-dimensional data.
Basic knowledge reserve
Inner product and projection:
Inner product operation, which maps two vectors into a real number. Its geometric meaning is the projection length of vector A in vector B. (The following figure takes a two-dimensional vector as an example, and the same is true for multi-dimensional space. )
In the above formula, b is the unit direction base:
Similarly, taking Figure B as an example, the vector B is (3,2), which actually means that the projection value on the X axis is 3 and the projection value on the Y axis is 2. In fact, this adds an implicit message, that is, this coordinate axis is a unit vector in the X-Y axis direction. The X Y axis here is actually the basis of what we say. It just defaults to (1, 0) and (0, 1).
Therefore, to describe a set of vectors, we must first determine a set of bases. Then find the projection of this vector in this set of bases. The requirements for bases are linearly independent and not necessarily orthogonal. But we usually use orthogonal bases because they have good properties.
Base conversion
Above, we learned the basic principles. If (3,2) is also described in the new basis, then just multiply the vector by the new basis.
What if there are multiple bases in the description? That is, multiply by an array.
How to reduce dimensions
We are all very clear about the above ideas. So how do we reduce the dimension by basis transformation? Here's an example. Suppose we have the following matrix.
To handle aspects, we now subtract the field average from each field, so it becomes as follows.
Represented by the coordinates shown below.
So, we want to use one-dimensional coordinates to represent it now, and we want to keep the original information as much as possible. How do we choose the direction (cardinal number)? (two dimensions reduce one dimension)
The idea is that I want the projection values to be dispersed as much as possible to avoid overlapping.
Covariance:
In probability theory and statistics, covariance is used to measure the joint change degree of two random variables. Variance is a special case of covariance, that is, the covariance between a variable and itself.
Expectation: In probability theory and statistics, the expectation of a discrete random variable (or expectation in mathematics, called expectation in physics) is the sum of every possible result multiplied by its result probability in the experiment. For example, the expected value of dice is1*1/6+2 *1/6+…+6 *1/6 = 3.5.
The covariance formula is:
Where e (x) = u e (y) = v.
Covariance represents the total error of two variables, which is different from the variance of only one variable error. If the trends of two variables are consistent, that is, one of them is greater than one's own expectations and the other is greater than one's own expectations, then the covariance between the two variables is positive. If the trends of two variables are opposite, that is, one is greater than its expected value and the other is less than its expected value, the covariance between the two variables is negative. If x and y are statistically independent, the covariance between them is 0.
Process and steps
Step 1: Standardization
Standardize the range of variables in the input dataset so that each variable can be roughly analyzed in proportion. Simply put, it is to turn the data with big differences into comparable data. For example, a variable of 0- 100 is converted to a variable of 0- 1. This step can usually be done by subtracting the average and dividing by the standard deviation of each variable value. The standard deviation formula is as follows
Then the commonly used standardized exponential variable formula can be
Step 2: Calculation of covariance matrix
The purpose of this step is to understand how the variables in the input dataset change relative to the average. Or in other words, see if there is a relationship between them. Because sometimes variables are highly correlated, because they contain a lot of information. Therefore, in order to identify these correlations, we calculate the covariance matrix.
The covariance matrix is a p×p symmetric matrix (where p is the dimension), and all possible initial variables and related covariance are taken as entries.
Well, now we know that the covariance matrix is just a table, which summarizes the correlation between all possible paired variables. The following is to calculate the eigenvectors and eigenvalues of covariance matrix to filter the principal components.
Step 3: Calculate the eigenvectors and eigenvalues of covariance matrix and identify the principal components.
Eigenvectors and eigenvalues are concepts of linear algebra, which need to be calculated from covariance matrix to determine the principal components of data. Before we begin to explain these concepts, let's understand the meaning of principal component.
Principal component is a new variable composed of linear combination or mixture of initial variables. The new variables (such as principal component) in this combination are not related to each other, and most of the initial variables are compressed into the first component. So 10 dimension data will show 10 principal components, but PCA tries to get as much information as possible in the first component, and then get as much residual information as possible in the second component, and so on.
For example, suppose you have a 10 dimension of data, and you will eventually get the content shown in the following screen, in which the first principal component contains most of the information of the original data set, while the last principal component only contains part of it. Therefore, organizing information in this way can reduce the dimension without losing too much information, which requires discarding components that carry less information.
The relationship between variance and information here is that the greater the variance carried by the line, the greater the dispersion of data points along the line, and the more scattered points along the line, the more information it carries. Simply put, as long as the principal component is regarded as a new axis, which provides the best angle for observing and evaluating data, the difference of observation results will be more obvious.
The eigenvector of covariance matrix is actually the direction of the axis with the largest variance (or the largest amount of information), which we call the principal component. The eigenvectors are sorted in the order of eigenvalues, from highest to lowest, and the principal components sorted by importance are obtained.
Step 4: Feature Vector
As we saw in the previous step, calculating the eigenvectors and arranging them in descending order according to their eigenvalues enables us to find the principal components in order of importance. What we have to do in this step is to choose whether to keep all the components or discard those components with low importance (low eigenvalue) and form a vector matrix with other components, which we call eigenvectors.
Therefore, the eigenvector is just a matrix, which contains the eigenvectors of the components that we decided to keep as columns. This is the first step of dimensionality reduction, because if we choose to keep only P of the N feature vectors (components), the final data set will have only P dimensions.
Step 5: redraw the data along the principal component axis.
In the previous steps, there is no need to modify any data except standardization. You only need to select the principal component to form the feature vector, but you should always be consistent with the original axis (that is, the initial variable) when inputting the data set.
In this step, which is also the last step, the goal is to form a new feature vector by using the feature vector of covariance matrix, and relocate the data from the original axis to the principal component axis (so it is called principal component analysis). This can be achieved by multiplying the transpose of the original data set by the transpose of the feature vector.
merits and demerits
Advantages: Simplify complexity and reduce the amount of calculation.
Disadvantages: loss of accuracy to some extent. And can only deal with the "linear problem", which is a linear dimension reduction technology.
abstract
Suppose we get a data set with m samples, and each sample is described by n features (variables), then we can reduce the dimension according to the following steps:
1. Take each sample in the data set as a column vector and arrange it by columns to form a matrix with n rows and m columns;
2. Subtract the average value of each row vector (variable) of the matrix, so that the average value of the new row vector is 0, and get a new data set matrix X;
3. Find the covariance matrix of X, and find the eigenvalue λ and unit eigenvector E of the covariance matrix;
4. Arranging the unit eigenvectors into a matrix according to the order from the largest eigenvalue to the smallest eigenvalue to obtain a transformation matrix P, and calculating the principal component matrix according to PX;
5. Calculate the variance contribution rate and cumulative variance contribution rate with eigenvalue, and take the top k principal components whose cumulative variance contribution rate exceeds 85%. If dimensionality is to be reduced to a specific k dimension, directly take the top k principal components.