Current location - Training Enrollment Network - Mathematics courses - DALS02 1-MDS and PCA
DALS02 1-MDS and PCA
Title: DALS02 1-MDS and PCA

Date: 2019-08-2112: 0: 00.

Type: "Label"

Label:

This part is the second part of the statistical model in Chapter 8 of life science data analysis. The main content of this part involves MDS and PCA, and the corresponding Rmarkdown document can refer to the author's Github.

The full name of MDS is multidimensional scaling, that is, multidimensional data scaling. In this part, we will use the data of gene expression as a case to illustrate. To simplify the explanation, we only consider three organizations:

The results are as follows:

Now we will study this data set, and we want to know the similarity of gene expression profile data stored in mat column in different tissues. Because of the large amount of data, it is impossible to draw the corresponding multi-dimensional point map directly. Usually, we can only draw two-dimensional graphs, and it is unrealistic to draw the gene expression between every two samples. MDS Graphics is put forward to solve this problem.

We have known singular value decomposition and matrix algebra before, so we can understand MDS relatively clearly. To illustrate MDS, let's first look at SVD decomposition, as follows:

We assume the sum of squares of the first two columns and the sum of squares of the remaining columns. So it can be written as the I-th entry. When this happens, we will get the following formula:

This means that the first column is approximately equal to:

If we define the following two-dimensional vector:

So:

The above derivation tells us that the distance between samples is almost equal to the distance of the following two-dimensional data points:

Because it is a two-dimensional vector, we can expand the distance between these two samples by drawing a sum. Now let's draw their distance:

As we can see from the figure, the data points are divided according to the corresponding organizations. The exact approximation of the above separation depends on the degree to which the first two principal components explain the change. As shown above, we can draw the degree of variation that each principal component can explain:

Although the first two principal components explain more than 50% of the change, the previous chart still doesn't show much information. But this kind of graph is enough for visualizing a large amount of data. In addition, we can also notice that we can plot other principal components to study these data points, for example, we plot the third and fourth principal components:

As can be seen from the above figure, the fourth principal component can strongly separate kidney tissue samples. In the later part, we will talk about the batch effect, which will explain this situation.

Above we use svd () function to calculate, but there is a special function in R to calculate MDS and generate MDS graph. This function is the cmdscale () function, which takes distance objects as parameters and then uses principal component analysis to approximate these distances. This function is more efficient than using svd () function (because it takes time to complete the calculation of svd () function). By default, this function returns two-dimensional data, but we can change the dimension in the result by setting the parameter k (by default, k=2):

Look at another one:

SVD is not unique. As long as we use the sample column multiplied by-1, we can use any column multiplied by-1. We can see it through the following transformation (this paragraph is not understood):

In all calculations, when we calculate SVD, we will deduct the average value of rows. If we try to calculate the approximate distance between two columns, the distance between and is the same as that between and, because when we calculate, the middle part will be eliminated:

Because deducting the row average can reduce the total variation and make the results of SVD closer.

P357

The related materials of PCA can refer to the author's Github.

We mentioned PCA before, so let's go one step further and talk about the mathematical principle behind PCA.

Let's show a rotation with a case of simulated data, which has a lot to do with PCA:

Here, let's explain what the principla component is.

We use this matrix to represent our data. This is similar to the information of two groups of genes, and each column represents 1 sample. Now all we have to do is find a vector to satisfy it, which can maximize it. This process can be regarded as each sample, or as a projection to a subspace. Therefore, we need to replace the coordinate system so that the new coordinate system can show the maximum change.

I'll try first. This projection man can only give the height (orange) of twins 1. The title of the picture is sum of squares.

Can you find a direction so that the coordinate system can express higher variation after rotation? take for example

How's this? It is not satisfied, so we can use another vector, namely

This figure is related to the difference between twins, which we know is very small. Usually sum of squares we can confirm this. Finally, let's try this vector:

The graph is rescaled to the average height after rescaling, and it has the largest sum of squares. This is a mathematical calculation program, which can calculate one and maximize the sum of squares. SVD is such a program.

Orthogonal vectors can maximize the sum of squares:

Refers to 1PC. E the weight used to obtain PC refers to the factor load. With the rotation operation, it refers to the rotation direction of 1PC.

In order to get 2PC, we can repeat the above operation, but the residual is as follows:

The vector of 2PC contains the following properties:

It can be maximized,

When it is, we can repeatedly find the third, fourth, fifth and other principal components.

We have introduced how to use singular value decomposition to calculate PC. Mechanism, there is a special function in R that can be used to find the principal component, that is, prcomp (). In this case, the data is centralized by default, and this function is used as follows:

The calculation result is the same as SVD until the sign is reversed (it produces the same result as SVD until the arbitration sign is reversed, I really don't understand what this sentence means).

The factor load can be calculated as follows:

The calculation results are as follows:

It is equivalent to (at most one symbol flips? I don't understand this):

The calculation results are as follows:

The variance of the explanation is equivalent to:

The calculation results are as follows:

Now let's transpose y, because the prcomp () function is a little different from the Qualcomm data storage we usually use. At ordinary times, our data are listed as samples and behavior eigenvalues, while the prcomp () function is just the opposite. The data column it processes is the characteristic value, and the row is the sample name.