Current location - Training Enrollment Network - Mathematics courses - Minimalist statistics-help you start statistics quickly.
Minimalist statistics-help you start statistics quickly.
Minimalist statistics has 2 1 chapter, which is divided into two parts to organize the full text. The first part mainly introduces some statistics, and the second part completes some inference statistics through these statistics. Reading through the book, we can find that the ultimate goal of this book is only to complete two very meaningful inferences:

Naturally, this paper will not complete such a derivation process. Like the original structure, this paper will be divided into two parts, one is statistics, and the other is interval estimation.

Average = sum (group value * relative frequency)

Average value = (total data)/(number of data)

The above two formulas are used to calculate the arithmetic mean, but the first one may be used more. However, it should be understood that the above two formulas are used to calculate the arithmetic average, but there is more than one way to take the average.

Generally speaking, if you want to keep the essence of data in the sense of total quantity, use arithmetic average; If you want to keep the essence of data in the sense of product, use geometric mean, such as growth rate; Generally speaking, harmonic average is used to treat speed.

Deviation = (numerical value of data)-(average value)

Variance = [(sum of squares of deviations)]/(number of data)

Standard deviation = square root of variance = root mean square of deviation

You can also calculate the variance by calculating the relative frequency after grouping:

Sum of squares of variance = (group value-average) * (relative frequency)

The average value is the representative number taken from the data distribution. Therefore, it can be considered that the data is based on the average value and is spread around the average value. The standard for evaluating this diffusion and dispersion is standard deviation. Discrete method of standard deviation average data average. At this time, whether it is scattered to large or small aspects, it is evaluated by positive numbers to avoid the average of mutual cancellation.

Here, as in the original, the standard deviation is expressed by S.D., which is a very important statistic in the original. Generally, S.D. will be used as the standard to judge the particularity of data. It can be considered that data with a distance of only1s.d. are ordinary data, and data with a distance of more than 2 s.d are special data.

S.d. has the following properties:

In the standard mathematical works, the normal distribution needs to be determined by the probability density function and deduced from the perspective of probability. For the sake of simplicity, the original book does not involve the knowledge of probability at all, and the same is true here. It only explains the essence of normal distribution from the perspective of application.

It can be considered that the data whose distribution law conforms to the following figure is normal distribution (μ represents the average value and σ represents the standard deviation):

The standard normal distribution is a normal distribution with a mean value of 0 and a standard deviation of 1.

From the above figure, we can see some properties of normal distribution:

With the knowledge of normal distribution, we can make "prediction". From the nature of the normal distribution described above, it can be seen that if we regard the uncertain phenomenon of concern as a normal distribution, it is possible to predict the data that will appear by using the nature of the normal distribution.

From the above normal distribution curve, we can know that if we want to improve the hit rate of prediction, we must expand the range. If you want to hit 100%, the prediction range will be from negative infinity to positive infinity. The common ones are "95% hit" or "99% hit", and the most commonly used "95% hit" in the world is selected in the original work. Subsequent explanations are based on this hit probability.

Starting from the 95% hit interval, two conclusions can be drawn:

The hypothesis test that the overall parameter of normal distribution (or approximate normal distribution) is a certain value can be carried out as follows:

When the parent population of its population parameters is normally distributed, the average value is μ and the standard deviation is σ. If the inequality of observation data x is:

Established, assuming not to abandon (accept); Otherwise, the assumption will be discarded.

In fact, there is no prediction here, but a test is made to see if a random overall parameter is reasonable. The basis of the test is that we generally think that all the data we observe will fall within the 95% confidence interval of the overall data distribution. If the overall parameters of the hypothesis are not satisfied, we will give up the hypothesis or accept it.

Interval estimation is such an estimation method: it only collects the overall parameters of the actual observation data within the "95% predicted hit interval" under its overall parameter assumption. The range of overall parameters determined by interval estimation is called "95% confidence interval". The interval obtained by interval estimation is a set in which all the overall parameters are tested and not given up in the above application of 1.

When the standard deviation σ of the normal population is known, the interval estimation method of the unknown mean μ is as follows: Using the observed data X, the linear inequality about μ is solved.

Get "*

95% confidence interval is an interval in which all kinds of observations are estimated by the same method, and 95% of them contain correct overall parameters.

The description of statistics ends here, mainly describing the characteristics of data through different statistics, and simply explaining the methods of "statistical test" and "interval estimation" of normal population.

In real life, it is almost impossible for us to observe all the data of population, and in many cases we can only get some data of population. However, we can also draw a conclusion from some phenomena that "if we make full observation, we can capture the situation of the mother group quite clearly". But our goal is to "infer the situation of the mother population without doing so many observations."

We know that the data observed from the mother is restricted by the overall characteristics of the mother. The original book gives such a conclusion:

Let's go back to the previous goal, that is, infer the parent population from the sample data, that is, the whole population. It depends on some mathematical properties of the matrix. As an introductory book of statistics, the original does not need to prove these mathematical theories, and it can be used directly here.

From the above properties, we can draw the conclusion that:

For the average value of n samples of a normal population with a mean value of μ and a standard deviation of σ, its 95% confidence interval is the interval for solving the following inequality, and A is the sample average value.

This part is actually to achieve four goals:

Let's explain it alone.

This estimate is very simple, through the formula:

It can be deduced that the 95% confidence interval of μ is:

The sample mean of the normal population conforms to the normal distribution, and the sample mean also reflects the nature of the population mean. The estimation of the population mean can be derived from the above inequality. Of course, sample variance also reflects the nature of population variance, but sample variance does not obey normal distribution. The sample variance obeys chi-square distribution.

Chi-square distribution if n independent random variables ξ? 、ξ? , ..., ξn, all obey the standard normal distribution (also called independent homomorphism distribution in the standard normal distribution), then the sum of squares of these n random variables that obey the standard normal distribution constitutes a new random variable, and its distribution law is called chi-square distribution.

The distribution curve of chi-square distribution is as follows:

According to the previous knowledge, we can know that for n samples observed from a normal population, the statistic V represented by the following formula is a chi-square distribution with n degrees of freedom:

The critical value table of chi-square distribution is as follows:

By looking up the table, we can know the value range of V within the 95% confidence interval, and then we can get the 95% confidence interval of population variance. For example, for a chi-square distribution v with 5 degrees of freedom. The 95% confidence interval can be 0.83.

From the above estimation of population variance based on the population mean of normal population, it can be seen that it is unnatural to estimate population variance only by knowing the population mean. It is also impossible to know the existence of population mean in practical application. So, if we don't know the overall mean, how can we estimate population variance?

The natural idea is whether the variance of the population can be estimated by the mean and variance of the sample. The fact is that statisticians have proved that the following statistic W is also a chi-square distribution, except that the degree of freedom is n- 1, not the number of sample data (where):

There is no need to consider how to prove that W is a chi-square distribution with n- 1, and this conclusion can be used directly. We have the formula of sample variance:

Therefore, it can be inferred that:

Because W is a chi-square distribution with n- 1, an inequality can be obtained by estimating the 95% confidence interval of W. By solving this inequality, the 95% confidence interval of population variance can be obtained, thus completing the estimation of population variance.

Now there is only one last problem, how to get the estimation of the overall mean with only sample data. From the previous discussion, we can actually see that, in addition to the overall mean μ, if its distribution can only be clearly defined by the statistics obtained from the sample data, the estimation of the overall mean can naturally be obtained.

British chemist gossett discovered such a distribution and named it T distribution. We can look at the definition and characteristics of T distribution.

The statistic t represented by the following formula obeys the t distribution with the degree of freedom of n- 1:

We know that the statistic z obtained from the data of n samples in the normal population obeys the standard normal distribution:

However, in reality, the statistic σ is often unknown, so it is impossible to estimate the overall mean with Z. In fact, the statistics T and Z are similar in form. It is not difficult to see that if n is large enough, the distribution of t is close to the standard normal distribution, but when n is not large enough, the deviation between the distribution of t and the standard normal distribution cannot be ignored.

The characteristics of probability density distribution and T distribution are as follows:

trait

T-distribution lookup table

For each specified degree of freedom, if the 95% confidence interval of t is required, we only need to remove 0.025 from the left and right sides respectively. For example, the 95% confidence interval of a T distribution with a degree of freedom of 10 is -2.228.

So far, when the population variance is unknown, the estimation of the population mean has been completed.

The above introduces some commonly used statistics and several commonly used methods to estimate the overall statistics through sample data. After all, the original book is just an introduction to statistics. Through this article, we can have a general understanding of the structure and content of the original work. If you want to know the specific application scenarios of these statistics, you can consult the original works. For more advanced statistical knowledge, readers can take more professional statistical courses.

In fact, as the two most basic statistics in statistics: mean and standard deviation, it has been able to describe some important characteristics of data. If we can deduce the overall mean and standard deviation from the sample data, it can be said that it is a great thing and can give us a lot of help in real life and production practice.

Finally, a mind map of the main contents of this book is attached.