Current location - Training Enrollment Network - Mathematics courses - # Statistical Basis of Big Data
# Statistical Basis of Big Data
Probability theory is the basis of statistics, statistical charge is the first line of application, and probability theory provides weapons.

When we study R, we will do hypothesis testing. When doing hypothesis testing, there will be a basic technology to construct statistics, which should satisfy a certain probability density distribution, and then I will calculate the value of this statistic to determine how high it is in this density distribution, which area it is distributed in and how likely it is to appear in this area. If the possibility is too low, we will judge that our hypothesis test is not established. So how to construct this statistic is a very technical thing, and it is also done by mathematicians, so this work is done by probability theory.

Classical probability theory: coin throwing, heads 1/2 tails 1/2, the throwing times are independent of each other. But this equal probability event is really not a very rigorous thing. It's actually quite interesting to think about it. Andre Andrey Kolmogorov founded modern probability theory. He put forward many axioms about probability theory, so he turned probability theory into a very rigorous subject.

Learning and using probability will make people smarter and make more accurate decisions.

Statistics: Statistics can be divided into descriptive statistics and inferential statistics. Descriptive statistics reflect the concentration and dispersion of data with specific figures or charts. For example, the average score, the highest score and the number of people in each score segment are also within the scope of descriptive statistics. Inference statistics: infer the overall data characteristics from the sample data. For example, in product quality inspection, sampling inspection is generally adopted, and the quality qualification rate of sampled books is regarded as an estimate of the overall quality qualification rate. Statistics are widely used. It can be said that as long as there is data, there is a place for statistics. Currently popular applications: economics, medicine, psychology, big data in IT industry, etc.

For example, for the data set of 1 2 3 4 5, which number will you use as the representative? The answer is 3. Because 3 is the center of this set of data. For a set of data, if only one number is allowed to represent this set of data, how should this number be selected? -Select the center of the data, that is, statistics reflecting trends in the data set. Concentration trend: in statistics, it refers to the degree to which any kind of data is close to the central value. It can reflect the location of the data center point. We often use statistics that can reflect the concentration trend: mean: arithmetic mean and describe the average level. Median: the number in the middle after the data are arranged in size, describing the medium level. Mode: the number with the most data types, which describes the general level.

Average: Arithmetic average For example, in a math test, the scores of members in Group A and Group B are: A: 70, 85, 62, 98, 92b: 82, 87, 95, 80, 83, respectively. Compare the scores of the two groups.

The average score of group b is higher than that of group a, that is, the overall score of group b is higher than that of group a.

Median: the number in the middle after arranging data in order of size (from large to small or from small to large). For example: 58, 32, 46, 92, 73, 88, 23 1. Sort first: 23, 32, 46, 58, 73, 88, 92 2. Find the number of intermediate positions 23, 32, 46, 58, 73, 88 and 92. For example: 58, 32, 46, 92, 73, 88, 23, 63 1. Sort first: 23, 32, 46, 58, 63, 73, 88, 92 2. Find the middle number: 23, 32, 46, 58.

Mode: the number with the highest frequency in the data (the number with the largest proportion). In a set of data, there may or may not be multiple patterns. In 1 2 2 3 3, the patterns are 2 and 3 1 2 3 4 5, and there is no pattern in 1 1 2 2 3 3 4. As long as the frequencies are the same, there is no pattern. The model is not only suitable for numerical data, but also for non-numerical data {apples, apples, bananas, oranges. However, there is no built-in function to directly calculate patterns in R language, but patterns can be discovered in disguised form through the frequency of statistical data.

Let's compare the advantages and disadvantages of mean, median and mode statistics [image-57f18-1586015539906].

For example, the employees and salaries of the two companies are as follows: a: managers 1 person, monthly salary100000; Senior staff 15, monthly salary10000; 20 ordinary employees, with a monthly salary of 7500 B: 1 manager, with a monthly salary of 20000; 20 senior employees with a monthly salary of11000; Ordinary employees 15, with a monthly salary of 9000. Please compare the salary levels of the two companies. If you only consider salary, which company would you choose?

A 7500 B 1 1000

a 7500 B 1 1000 & lt; /pre & gt;

From the average point of view, the average monthly salary of company A is obviously higher than that of company B, but there is an extreme value in company A, which greatly raises the average value of company A. At this time, it is obviously unscientific to consider only from the average value. Judging from the median and mode, the salary level of Company B is relatively high. If it is an ordinary employee, it is more reasonable to choose Company B..

Compare the following two groups of data: A: 12589b: 34567. The average value of the two groups of data is 5, but you can see that the data in group B is closer to 5. However, it is not enough to describe the statistics of concentration trend, but also to describe the statistics of data dispersion.

Range: maximum-minimum, which simply describes the range of data. A: 9- 1 = 8B: 7-3 = 4 For the same five numbers, the range of A is larger than that of B, so it is also more dispersed than that of B. However, it is also insufficient to measure the degree of dispersion only by using the range. For example, A: 12589B: 14569 Although the values of two groups of data,

Variance: Statistically, variance is more used to describe the degree of dispersion of data: the farther the data is from the center, the more dispersed it is. The larger the variance, the more discrete the data set.

For the previous set of data 1 2 5 8 9, the variance of the previous set of data is 12.5. Comparing 12.5 with the original data, we can see that 12.5 is larger than the original data. Does this mean that this set of data is very discrete? In fact, the unit of variance is different from the unit of metadata, so the comparison is meaningless. If the unit of original data is m, then the unit of variance is m 2. In order to keep the consistency of the unit, we introduce a new statistic: standard deviation: sqrt(var ()), which effectively avoids the measurement problem caused by the square of the unit. Like variance, the larger the value of standard deviation, the more dispersed the data. A: 1 2 5 8 9 b: 3 4 5 6 7

The math test results of 40 students in a class are as follows:

63, 84, 91,53, 69, 81,61,69, 78, 75, 81,67, 76, 81,79, 99.

The variance or standard deviation of this set of data may be calculated.

But even after counting the above data, we still don't have a comprehensive understanding of the score distribution of the whole class. The original data is too messy to see the regularity, and it is difficult to give an intuitive impression of the data only by relying on numbers to describe the concentration trend and dispersion degree, which is why we need to use icons to represent these numbers.

1. Find the maximum and minimum values in the above data and determine the data range.

After ranking the results, it is easy to get the maximum value of 95 and the minimum value of 53.

2. Organize the data and divide the data into several groups according to the results. Grades are generally divided into 50-60, 60-70, 70-80, 80-90, 90- 100 (generally divided into 5- 10 groups), and then the internal frequencies of these segments are counted. It can be seen that paragraphs 80-90 have the largest number of people. Note that when drawing a histogram, you must know whether it is left closed or right open or left open and right closed. Because this may directly affect the frequency statistics.

The above picture is: frequency histogram. Frequency is the ordinate and achievement is the abscissa. We can have a very intuitive impression of the results through the histogram. Besides frequency histogram, there is another kind of histogram: frequency histogram. Compared with the frequency histogram, the ordinate of the frequency histogram is changed and the frequency/group distance is used. Frequency = frequency/total; The group distance is the extreme range of grouping, where the group distance is 10.

In addition to histogram, drawing a simple box chart can also roughly see the distribution of data.

?

If you want to understand the box chart, you must learn some technical terms of the box chart: the lower quartile: Q 1, sort all the data from small to large, and rank the number 25th. Upper quartile: Q3, the number that ranks all data in the 75th place from small to large. Quartile: IQR, equal to Q3-Q 1, is a statistic to measure the degree of data dispersion. Abnormal point: the value less than Q 1- 1.5IQR or greater than Q3+ 1.5IQR (note that IQR is 1.5 times) Upper edge: the maximum value except abnormal value in the data; Lower edge: the minimum value of data except outliers.

?

The stem-leaf diagram can display the distribution of data intuitively, while retaining all the data information. On the left is the stem, and on the right is the leaf. If you rotate the stem leaf diagram by 90 degrees, you can get a diagram similar to histogram. Like histogram, we can know the distribution of data intuitively. All data information can be retained. The drawing method of stem-leaf diagram is also very simple: the data is divided into two parts: stem and leaf, in which stem refers to the tenth digit and leaf refers to the given digit. From small to large, write the stem part (ten) from top to bottom. Write the same stem (ten) from left to right from small to large relative to their respective stems.

However, the stem-leaf diagram is also flawed, because these 110 figures are drawn on the stem-leaf diagram at the same time, which is easy to distinguish. At the same time, leaves may fall.

Taking time as the abscissa and variables as the ordinate, it reflects the changing trend of variables with time.

Show data changes over a period of time or show comparisons between items.

According to the percentage of each item, determine the area of the pie chart. Simple and easy to understand, popular and clear. You can see the proportion of each project more vividly. Proper use of some statistical charts can be more vivid and vivid, and it is no longer just a boring pure digital description.

Learning link:/video/bv1ut411r7rg.