Current location - Training Enrollment Network - Mathematics courses - A summary of three commonly used dimensionality reduction methods
A summary of three commonly used dimensionality reduction methods
? The difference between LDA dimensionality reduction and PCA dimensionality reduction is that LDA is supervised dimensionality reduction, and its principle is to map features to low dimensionality, and the category of original data can also be clearly reflected in low dimensionality data, that is, low dimensionality data can also be used for classification.

? Let's look at the two-dimensional situation first. We hope to find a vector that makes the distance between two classes and the distance between samples in two classes as small as possible after the data points are mapped to this vector. In this way, an objective function is obtained. Molecule is the square of the difference between the two kinds of mean values after projection, and we hope that the larger the value, the better. The denominator is the sum of the hash values of the projected classes divided by the variance of the number of samples. Further simplifying the molecule, we can get the transposition of the projection vector multiplied by the outer product of the quasi-mean difference vector before projection and then multiplied by the projection vector. The denominator is the transposition of the projection vector multiplied by the sum of the inter-class hash matrices before projection, and then multiplied by the projection vector. At this time, it is necessary to minimize the projection vector of the objective function. Since the objective function value does not change with how many times the projection vector is enlarged or reduced, we can make the modulus length of the denominator 1. At this time, we can use Lagrange multiplier method. Finally, when the sum of the inter-class hash matrices has an inverse matrix, the projection vector is the inverse matrix of the characteristic vector of the outer product of the sum of the inter-class hash matrices and the average difference vector of the classes before projection. Moreover, we simplify the left side of the equation and get the inverse matrix of the inter-class hash matrix multiplied by the difference of the inter-class mean vector before projection, and then multiplied by a constant. Then, because the projection vector can be scaled many times without affecting the result, we omit the constants on both sides, and get that the inverse matrix of the projection vector is equal to the difference of the pre-projection class mean vector multiplied by the left side of the hash matrix, which is fisher discriminant analysis.

? PCA projects the original sample into a low-dimensional space, which preserves most of the information of the sample and reduces the dimension of features, making the model difficult to over-fit. The idea is: for the M-dimensional vector in the original space, find K projection vectors, so that the variance of M-dimensional vector projected onto these K projection vectors is the largest, and the most original sample information is retained. We can first find the first vector to maximize the projection variance in this direction. The steps are as follows:

1. Before the projection, we concentrated the data to make the average value of the original data zero.

2. Calculate the covariance matrix of the samples in the set, which is an m*m dimensional matrix, where m represents the number of original features. The elements in the I-th and J-th columns represent the covariance of the I-th and J-th columns in the data.

3. Calculate the eigenvalues and eigenvectors of covariance matrix. The eigenvector is a unit vector with the module length of 1.

4. Select k eigenvectors with the largest eigenvalues.

5. Calculate the k features corresponding to the k largest eigenvalues. For each feature, the original data matrix (n rows and m columns) is multiplied by the corresponding feature vector (m rows 1 column, where m is the number of original variables). Therefore, the final feature has n rows and one column, representing an eigenvalue of each sample.

? Centralize and normalize the data, and then project it onto a vector to calculate the variance of data points in this dimension. After simplification, the transposition of the projection vector is multiplied by the covariance matrix of the original data and then multiplied by the projection vector, provided that the projection vector is a unit vector, and then we maximize the variance λ, and the corresponding projection vector when the maximum variance is obtained is first principal component. So how to solve this vector? Because this projection vector is a unit vector, the left side of the equation is multiplied by the projection vector to get λU =σU =σU, which means that the direction of this projection vector U is actually the eigenvector of this covariance matrix, so the maximum variance λ corresponds to the direction of the eigenvector corresponding to the largest eigenvalue of σ, that is, the direction of first principal component, and the eigenvector corresponding to the second largest eigenvalue is the direction of the second principal component.

Data centralization is not necessary, but it is easy to represent and calculate. PCA calculates the covariance matrix of samples, so centralization or centralization will not change the direction of eigenvectors or the size of eigenvalues, so PCA works even if it is not centralized. However, the mathematical representation of sample covariance matrix will be simplified if the data are concentrated. If the data points are columns of the data matrix, the covariance matrix will be expressed as xx'. How simple it is! Technically, PCA includes the step of data centralization, but that is only to calculate the covariance matrix, and then decompose the covariance matrix into eigenvalues to get eigenvalues and eigenvectors.

? Standardization of data is unnecessary. If the variance of some variables is large or small, then PCA will tend to those variables with large variance. For example, if you increase the variance of a variable, maybe this variable will play a leading role in first principal component from a small influence, so if you want PCA to be independent of such a change, just normalize it. Of course, if your variables are important at that scale, then you can't normalize them. Normalization is very important in PCA, because PCA is an experiment of maximizing variance, that is, projecting your original data in the direction of maximizing variance.

(1) If the original features are highly correlated, the result of PCA is unstable;

(2) The new feature is a linear combination of the original features, so it lacks explanation.

(3) The original data is not necessarily multivariate Gaussian distribution, unless you use this technique to predict and model and calculate the confidence interval.

The function of matrix multiplication is linear transformation. Multiplying a vector by a matrix can make this vector expand and contract. We all know that the eigenvectors of symmetric matrices are orthogonal to each other. Given a symmetric matrix M, we can find some orthogonal vectors V such that Mv=λv, that is, this matrix M stretches the vector, and λ is a multiple of the stretching. So how to linearly transform an orthogonal grid plane into another orthogonal grid plane for ordinary matrices?

For an orthogonal matrix, the corresponding transformation is called orthogonal transformation. The function of this transformation is not to change the size of vectors and the angle between vectors. The rotation transformation in orthogonal transformation only uses another set of orthogonal bases to represent the transformation vector. In this process, the vector is not stretched, the spatial position of the vector has not changed, but the original coordinate system is rotated and a new coordinate system is obtained. So how do you find this rotation matrix? For a vector in two-dimensional space, the result of rotation transformation is expressed from one set of coordinate systems to another. Each component of the coordinate in the new coordinate system is equivalent to the projection of each component of the coordinate in the original coordinate system under two orthogonal bases of the new coordinate system, or rotating the original two-dimensional vector to the new coordinate system is equivalent to multiplying the vector by a rotation matrix, and finding out that this matrix is the matrix of rotation transformation. Just now, I said that orthogonal transformation does not change the spatial position of vectors, but the coordinates are relative. From the position of the base vector in the original coordinate system to the new coordinate system, the coordinates of the vector will change.

? The result of multiplying a matrix by a vector is still a vector with the same dimension. So matrix multiplication corresponds to a transformation, which changes one vector into another vector with the same dimension.

? For a specific vector, after the square matrix transformation, the direction of the vector remains unchanged (or just reversed), but only expands and contracts (the expansion value can be negative, which is equivalent to the direction reversal of the vector)? This is the definition of equivalent eigenvector.

? The geometric meaning of the feature vector is that the feature vector only expands and contracts through the square A transformation, while keeping the direction of the feature vector unchanged. Eigenvalue represents how important this feature is, similar to weight. Eigenvector is a point in geometry, and the direction from the origin to this point represents the direction of the vector.

? The eigenvector of a transformation (or matrix) is such a vector, which keeps the direction unchanged after this particular transformation, but only expands in length. Eigenvalue decomposition is a combination of rotation and scaling effects. Because A in eigenvalue decomposition is a square matrix, there is obviously no projection effect. In other words, we found a set of bases (eigenvectors), under which the function of the matrix is only scaling. That is, matrix A rotates a vector from the space of X group basis to the space of X group basis, and scales it in each direction. Because both sets of bases are x, there is no rotation and projection.

? The process of eigenvalue decomposition is analyzed in detail: firstly, because the eigenvectors are orthogonal, the matrix composed of eigenvectors is an orthogonal square matrix, and both sides are multiplied by the inverse matrix of this square matrix at the same time, so that the expression of matrix A is A = uλu', and both sides are multiplied by a vector at the same time, which is equivalent to multiplying this vector by matrix A to the left, and then the vector is rotated or stretched. The process of this transformation is divided into three mappings: the first is the rotation vector x, which represents x in the new coordinate system; The second transformation is stretching change, and each one-dimensional component of x is stretched or shrunk by eigenvalue. The third is the inverse transformation of the first transformation of X, because it is the inverse matrix of the first matrix and also the rotation transformation. In the second stretch transformation, it can be seen that if matrix A is not full rank, that is, the eigenvalue is 0, then it is equivalent to mapping X to the subspace of M-dimensional space (M is the dimension m*m of matrix A), and matrix A is an orthogonal projection matrix, which maps M-dimensional vector X to its column space. If a is two-dimensional, a rectangle can be found on the two-dimensional plane, so that the rectangle is still a rectangle after a transformation.

? In eigenvalue decomposition, matrix A is required to be a square matrix, so for any matrix m*n, can we find a set of orthogonal bases so that it is still orthogonal after transformation? This is the essence of SVD.

? A = u σ u', let's analyze the function of matrix A: 1. Rotation, the column vector of U is a set of standard orthogonal bases, and so is V, that is to say, we have found two sets of bases. The function of A is to rotate a vector from V orthogonal basis vector space to U orthogonal basis vector space; Second, zoom. When V rotates the vector X, it is equivalent to that the rotation vector X uses the orthogonal basis of V to represent the coordinates, and then σ scales each component of the vector X. The scaling degree is an element on the main diagonal of σ, which is a singular value. Finally, projection. If the dimension of u is less than that of v, then this process also includes projection.

? The purpose now is to find a set of orthogonal bases, so that it is still a set of orthogonal bases after a matrix transformation. Assuming that such a set of orthogonal bases has been found, how can this set of orthogonal bases be a set of orthogonal bases after a transformation? As long as the original orthogonal basis is the eigenvector of A'A, |AVi| is the root of the eigenvalue of A'A, that is, the singular value. Then we find the unit vector Ui of AVi, and these UIs are also orthogonal. Then we find two sets of orthogonal bases and transform V sets of orthogonal bases into U sets of orthogonal bases, where V is called right singular vector, U is called left singular vector, and the modulus of AVi is singular value. Vk extends VK+ 1, ..., Vn (VK+ 1, ... VN is a null space with Ax=0) so that V 1, ..., VN is a set of orthogonal bases in n-dimensional space, which extends U 1, ..., VN.

? Analysis of the mapping process of matrix A: If a hyper-rectangle is found in the N-dimensional space and all of it falls in the direction of the eigenvector of A'A, then a transformed shape is still a hyper-rectangle. Vi is the eigenvector of A'A, Ui is the eigenvector of AA 'and the unit vector of AVi. σ is the root of the eigenvalue of A 'a. According to this formula, the singular value decomposition matrix of matrix A can be calculated.

Singular value decomposition is to transform a mutually perpendicular grid into another mutually perpendicular grid. According to the positioning of U and V above, a transformation process between vector and matrix A can be realized. Firstly, the vector X is written in the form of orthogonal basis of V, then the vector X is left multiplied by the matrix A and brought into AVi=σiUi. Finally, the decomposition formula of A can be obtained, not the matrix decomposition formula, but the vector decomposition formula. It can be seen that if some singular values are small or even 0, then n items are added, and finally only the items with singular values other than 0 are added. If the k term is added, the closer k is to n, the closer the decomposition result is to a.

(1) can be used to reduce the storage of elements.

(2) It can be used for noise reduction: removing items with small singular values. We think that the items with small singular values contain little important information of samples, which are all noise, so we remove little information.

(3) Data analysis: For example, we have some sample points for modeling. We remove all the small singular values in the data by SVD, and finally get the decomposed data for analysis, which is more accurate.

We know that in PCA, reducing the dimension of variables is actually equivalent to multiplying the data matrix Am*n by a matrix Pn*r, and we get Am*r, which means that the eigenvector of each sample has only R dimension, and this matrix P represents the largest eigenvalue of the R column vector. The covariance matrix n*n of data matrix A corresponds to R eigenvectors, all of which are N dimensions. Compared with SVD, both sides of the expression of SVD are multiplied by a Vn*r at the same time, so that the product of Vr*n and Vn*r on the right side of the equation is the unit vector. Because Vn*r is the r eigenvectors of A'A, it is the eigenvectors corresponding to the first r eigenvectors that are not zero, and because A'A is symmetric, the eigenvectors are orthogonal, so the formula just derived from PCA is obtained.

Similarly, the data matrix Am*n is multiplied by a matrix Pr*m to get Ar*n, which means that there are only R samples corresponding to each feature, and the matrix P represents R M-dimensional vectors, and each vector is the direction vector to be projected by the sample vector corresponding to each feature. Compared with SVD, when both sides of SVD are multiplied by a matrix Ur*m at the same time, Ar*n is obtained, that is, the dimension is reduced in the row direction. On the right side of the equation, Ur*m and Um*r are multiplied to get a unit vector, because Um*r is the eigenvector of AA', and the eigenvectors corresponding to the first r eigenvalues of AA' are m-dimensional. Because AA' is a symmetric matrix, then each

It can be seen that:

-PCA is almost a package of SVD. If we realize SVD, we realize PCA.

-Better yet, with SVD, we can get PCA in two directions. If we decompose the eigenvalues of A'A, we can only get PCA in one direction.