Current location - Training Enrollment Network - Mathematics courses - From chi-square test to the correlation between things
From chi-square test to the correlation between things
There are all kinds of things in life, such as basketball, cars, sports and stocks. They are either related or irrelevant (nonsense, haha), that is, there is uncertainty, and life is wonderful and interesting because of various uncertainties ~ Is there any way to determine whether things are related, such as writing and playing basketball? Even quantify this correlation?

The answer is: there is a way!

This method is called chi-square test. Before introducing this "method" in detail, let's start with some basic popular science knowledge. Chi-square test is a hypothesis test method, which is used to count the deviation between the actual observation value and the theoretical inference value of the sample. The greater the chi-square value, the greater the deviation, and the smaller the chi-square value, the smaller the deviation. If the chi-square value is 0, there is no deviation. To put it bluntly, it is the deviation between the actual situation of something and what you think.

The problem of quantifying the degree of correlation in the above article can be transformed into the problem of measuring the degree of deviation between samples, so how to describe the degree of deviation?

In high school mathematics, we learned expectation e and variance σ (remember? Briefly review the concept of vernacular version: expectation is the average of all samples, and variance is the dispersion degree of samples relative to the average). Can expectations be used to measure the degree of deviation? It is expected that it can represent the comprehensive level of the sample set, but it cannot represent the distribution of the sample. For example, in a class of 50 students, the average math score is 65 points, but it is not entirely reasonable to define the math level of this class with 65 points, because it is possible that 40 of these 50 students have more than 80 points, and the other 10 students have less than 10 points. There is no doubt that the overall level of this class is still relatively high, but a few people are dragging their feet. Variance can represent the deviation of the relative average of samples, which can just make up for the expected defects.

Therefore, a good indicator to measure the degree of deviation needs to synthesize two parameters: expectation and variance. We can define this deviation index as (we simply understand it as variance ratio expectation):

In the above formula, n represents the number of samples and I represents one of them. In fact, the degree of deviation can already be calculated by equation (1), but it seems ungrounded and the effect is very good. Let's change it again.

The most common problems in life can generally be solved by judging the correlation between two things (the term is called 1 degree of freedom chi-square test problem). Suppose there are two things-A and B, and there is such a table:

In the table (1), a, b, c and d respectively represent the number of samples in a sample set, the number of samples in which A and B exist at the same time, the number of samples in which B does not exist, the number of samples in which A does not exist, and the number of samples in which A and B do not exist, so the chi-square formula can be derived as follows:

The derivation process of formula (2) will not be expanded here, and those who are interested can do it by themselves (briefly introduced here: through the data in table 1, the accuracy of four cases in the table can actually be introduced, that is, the deviation between A/B/C/D and the true value can be obtained by applying formula 1; Because events A and B are independent of each other, the deviation between events A and B and the real situation is the sum of the deviations of all four sub-situations, and finally Formula 2, namely the chi-square values of A and B, can be derived. The greater the chi-square value of a and b, the more relevant a and b are, and a chi-square value of 0 means no correlation at all). (Tips: Because the expression of correlation is a relative value, the constant (A+B+C+D) is often ignored when calculating the chi-square value. )

Let's go back to the question at the beginning of the article: determine the correlation between writing and playing basketball.

Taking 50 students in a class as the sample space, the students who play basketball as event A and the students who write as event B, the number of students in the four quadrants in 1 is investigated respectively, and the corresponding values are brought into Formula 2 to get the chi-square value. The sum of this value indicates the correlation between playing basketball and writing in this class. The greater the value, the stronger the correlation.

Summarize the steps to calculate the correlation between two things:

Select the appropriate sample space->; Investigate the sample values of four quadrants->; A set of formulas.

Chi-square test of 1 degree of freedom can be used to solve many problems in daily life, such as preparing to open a fruit shop, which can be used to determine the correlation between the store location and the subway entrance. Then, the multi-degree-of-freedom chi-square test and its application scenarios are introduced.