Current location - Training Enrollment Network - Mathematics courses - Dimension reduction in chapter 1 1
Dimension reduction in chapter 1 1
Removing irrelevant and redundant data in data sets, reducing the calculation cost without over-adaptation, requires lossless specification features, which is mathematically called dimensionality reduction. Widely used in the fields of pattern recognition, text retrieval and machine learning, it is mainly divided into two categories, feature extraction and feature screening. The former is the projection of high-dimensional data into low-dimensional space, and the latter is a feature subset to replace the original feature set, including feature grading and feature screening. Scoring is to find the optimized feature subset.

Feature extraction can be divided into two methods: linear extraction and nonlinear extraction. The former tries to find an affine space that can best explain the change of data distribution, while the latter is very effective for high-dimensional nonlinear curve plane distribution data.

Extraction method of linear features;

First, set some standards, and then select the characteristics that meet the standards.

The algorithm first calls a weight function to get the weight value of each feature, and the weight evaluation index is the decline of average accuracy. Type = 1。 In addition to the random forest used above, you can also use χ2 and information.gain

Then the optimized feature subset is obtained. Firstly, the importance of feature subset is evaluated by 50% cross-validation. Hill-climbing search algorithm can select an optimized feature subset from the original feature set, or choose other algorithms, such as forward.search, or use caret package for feature screening. It is said that this set meal is a treasure, all inclusive.

Principal component analysis (PCA) is a widely used linear dimension reduction method, which is suitable for data sets with many features and redundant (related) features. By reducing the feature set to a few main feature components that can represent the most important changes of the original feature set, the mapping from high-dimensional data to low-dimensional data space is realized.

In the process of feature selection, some interrelated but valuable features will be removed, and these features need to be integrated into a single feature in the process of feature preparation. Principal component analysis uses orthogonal transformation to transform interrelated features into principal components, so that the variance trend can be determined.

The algorithm mainly includes the following steps: 1) find the data points of the average vector; 2) Calculation

Covariance matrix; 3) calculating a feature vector; 4) sorting the feature vectors and selecting the first k feature vectors; 5) constructing a feature vector matrix; Finally, the data samples are converted into new subsets.

develop

Principomp is another high non-component analysis function, which is different from the singular value decomposition of prcomp above. The eigenvalue calculation method of correlation matrix or covariance matrix is generally more used in the latter.

The above two functions are from the stats package, and you can also use the principal function in the psych package:

Kaiser method, scree (gravel test) and the use of explanatory variable ratio according to selection rules are all acceptable. The main purpose of gravel inspection is to express the principal component results in the form of gravel diagram, and find out the factors that cause the fastest change of curve slope from the diagram.

When the principal component is 2, the slope changes the fastest. You can also use nfactors to make Cattell gravel test in a parallel analysis and non-graphical way.

Biplot draws the projection of data and original elements on the first two principal components.

Biplot plots the projection of data and original features on the first two principal components, and the provinces with high agriculture, low education and inspection score high on PC 1 Provinces with high infant mortality and low agricultural level scored higher on the principal component PC2.

Multidimensional scale analysis graphically displays similar or different distances between multiple objects. Multidimensional refers to mapping to one-dimensional, two-dimensional or multidimensional space to represent the relative distance of CF family, generally using one-dimensional or two-dimensional space.

It is divided into two categories: metric and non-metric. The former mainly considers how to ensure that the distance between objects after dimensionality reduction is as close as possible to their distance in the original space, while the latter assumes that the distance ranking of objects in two spaces is known and the ranking remains unchanged after transformation.

We can compare the differences between MDS and PCA by drawing the projection dimension in the scatter plot. If MDS adopts Euclidean distance, the projection dimension will be completely consistent with PCA.

Singular value decomposition is a form of matrix decomposition, which can decompose a matrix into two orthogonal matrices and a diagonal matrix, and multiply these three matrices to get the original matrix. It can help to remove those matrices with linear correlation redundancy from the perspective of linear algebra, and can be applied to feature screening, image processing and clustering.

Singular value decomposition is a common method to decompose real matrix or complex matrix, and principal component analysis can be regarded as a special case of singular value decomposition:

The two matrices are basically the same.

[Image upload failed ... (Picture -be0ae8- 1639570485003)]

The most widely used standard test image in the field of image compression, the model picture of playboy!

I don't know why all the pictures I see are negative. First continue:

ISOMAP is a manifold learning method, which supports the transformation from linear space to nonlinear data structure. Similar to MDS, it can also graphically display similarities or differences (distances) between objects. However, because data is represented by nonlinear structure, geometric distance is used in MDS instead of Euclidean distance.

ISOMAP is a nonlinear dimension reduction method of equidistant mapping. ISOMAP can be regarded as an extension of measuring MDS method if Euclidean distance between data points in MDS method is replaced by geodesic distance between adjacent graphs.

The algorithm is divided into four steps: determining adjacency points, constructing adjacency graph, calculating shortest path and MDS analysis to find low-dimensional embedding between data.

develop

RnavGraph package can use graphics as the basic way of data browsing to realize the visualization of high-dimensional data.

LLE algorithm is an extension of PCA algorithm, which realizes data compression by mapping the manifold embedded in high-dimensional space to low-dimensional space. ISOMAP is a global nonlinear dimensionality reduction algorithm, while LLE is mainly a local matrix dimensionality reduction algorithm. It is assumed that each data point can be composed of the matrix combination of k adjacent points, and the original data attributes can be maintained after mapping.

LLE is a nonlinear dimension reduction algorithm. Based on LLE, we can get the mapping of high-dimensional data in low-dimensional space, which keeps the adjacent embedding relationship of the original data. The algorithm is mainly divided into three steps: calculating k neighbors of each point, and then calculating the weight of each neighbor point, so that each point can be optimally reconstructed by its neighbors, that is, the sum of residuals is minimum.

develop

You can also choose RDRTollbox package to realize nonlinear dimension reduction and support ISOMAP and LLE algorithms.