The Internet has also matured, and the Internet of Things is under construction.
Everyone produces data, but only a few people have the ability to play with data.
With the data, experts take the lead in opening the perspective of the prophet, but we can't even touch the north!
From the accurate placement of advertisements to the prediction and influence of the US presidential election, why is the data so amazing?
First, simple data values
1. Value of data
A. What is data
Everything that can be recorded electronically is data.
This is not limited to numbers, but also includes voice input, photos taken by digital cameras, videos recorded by mobile phones and other electronic records. This definition seems narrow, but it can help us better understand the changes in the data industry and cultivate the era view of data.
B.what's the use of data
The value of data to individuals must be related to the core needs of their own business. Only when the commercial value of data is clear, customers can easily pay for data, data companies can easily generate income, and the data industry will be less chaotic. So, what is the value of data?
We can look at this problem from three aspects: income, expenditure and risk:
Income. The most typical example is Baidu paid search advertising. Through in-depth analysis and accurate matching of user search data, it brings a large wave of traffic to advertisers, and the revenue growth it creates is the value of data.
Expenditure. According to the information collected by the Internet of Things technology, TV manufacturers found that only 1% users of a certain TV are still using the old VGA video interface. So they decided to cancel this interface setting, saving hundreds of millions of yuan in costs for enterprises every year. This is also the value brought by data analysis.
Risk. Many commercial banks have online application systems, and the risks are generally higher than offline signing. Data analysis can help them more accurately distinguish which online applications are good and which are not. This is the indirect value that data brings to the company by reducing business risks.
2. What is data thinking?
In order to explain the most important concept "data thinking" in this book clearly, we have to introduce an appropriate statistical term-regression analysis, which is a statistical analysis method to determine the quantitative relationship between two or more variables.
As the ancients said: Dao control technique, technique drives Dao. On the level of Tao, regression analysis is a way of thinking. Under its guidance, we can define "business problems" as "problems with analyzable data". On the "technical" level, regression analysis is an available data analysis tool, which will be introduced in the last chapter of this interpretation.
What kind of problems can be regarded as data analyzable problems? You need to find two variables:
Dependent variable y: a variable that changes due to the change of others, which is the core requirement of the business.
Independent variable X: used to explain the related variables of dependent variable Y. Generally speaking, the change of independent variable X affects the change of dependent variable Y. X represents the insight of data analysts into business.
case study
Suppose that a gentleman borrows 1 10,000 yuan from you, you might as well start with the analysis of his usual behavior, and comprehensively consider whether your relationship is strong enough, whether he has signed an iou, his family situation and other factors to measure the possibility of repayment. Here, the possibility of paying back money is the dependent variable y; People, relationships, IOUs and family circumstances are all independent variables X.
Data thinking is to define "business problems" as "problems with analyzable data". The specific method is to accurately locate the core demands (dependent variable Y) of the business in a mess of business problems, find the related factors (independent variable X) that affect the core demands, and then use various data analysis tools for further research.
In the next chapter, we will focus on solving a problem. Why is it so important to have data thinking?
Second, what is big data?
Without understanding data analysis, it is easy for us to myth big data and think how magical it is. In fact, big data is not so mysterious. It is inextricably linked with statistics that many people have come into contact with.
1. The relationship between big data and statistics
In this programme, Professor Wang Hansheng mentioned that there are at least two relationships between big data and statistics:
A. The core of statistical concern is the analysis and modeling of data, and the business uncertainty is characterized by modeling, which has made great contributions to big data.
B. Big data can't replace sampling. On the contrary, it is more important to sample with big data.
2. How accurate is big data?
"It is normal to predict incorrectly, and it is abnormal to predict accurately." Professor Wang's words punctured many people's good expectations for prediction.
Why are you so eager for accuracy? This is the essence of science. Statistical research includes a large number of correlations, only a very small part of which is very rare causality, but the importance of causality is still irreplaceable.
Correlation: the uncertain interdependence between objective phenomena. Example: The cock crows and the sun rises.
Causality: the relationship between the first event (cause) and the second event (result), in which the latter event is considered as the result of the previous event. Example: Press the power button and the computer will light up.
We often confuse this pair of concepts, and sometimes even event A and event B, which have nothing to do with each other, often happen together, so we blindly think that they have a causal relationship and make a lot of jokes.
Therefore, identifying the concepts of relevance and causality is not only the golden key for us to understand big data, but also a key step to cultivate scientific literacy-say no to pseudoscience!
Third, everyone should have data thinking.
Data thinking is a necessary accomplishment. Because we live in the information age, it will be related to data to some extent. Without data thinking, we will be subject to IQ tax as easily as people who don't know the economic knowledge of stock trading!
1. Improve communication efficiency
In our work, we often encounter such a situation: data experts talk about technical language, and demand departments talk about business issues (including analyzable and unanalyzable data), and communication between the two sides is always difficult to proceed smoothly.
To solve this problem, not only professionals need to get rid of the curse of their own knowledge, but also demand departments need to overcome their fear of data. Need to cultivate data thinking from top to bottom within the company. Decision-makers should realize what is data-related, and demand departments should have the ability to make clear the core requirements.
In this regard, Teacher Fan vividly described that having data thinking is "opening your mouth and ordering Sichuan pork".
This can greatly improve communication efficiency and maximize the value of data analysis!
2. Seize business opportunities
On the other hand, data thinking may also be helpful to entrepreneurs, especially in those entrepreneurial projects closely related to data. Having data thinking can help entrepreneurs seize business opportunities, but it takes the following three steps:
A. Where do I start my business? Can the data help me?
B if the data is important, sort out the dependent variable y and independent variable x in the business.
C. At the strategic level, ensure the high-quality supply and long-term accumulation of Y and X. ..
3. Data thinking in life
If a person is not an entrepreneur and the business problems involved have nothing to do with data analysis, what is the use of cultivating data thinking? In fact, most small things in life, data thinking can inspire you, the key depends on how you use it.
First of all, cultivating data thinking helps you develop a clear-cut thinking habit: what is the purpose of analysis? What is the core appeal? What is the dependent variable y?
Secondly, after understanding the purpose, you can focus on the related independent variable X, and you won't fall into the confusion that "everything is the key".
Finally, you can try the simplest analysis. Aside from professional modeling, at least you can tell which ones are related and which ones are causal.
Fourth, various data analysis methods.
Have you become interested in data analysis after reading this? This book also introduces several commonly used data analysis tools. If you are interested, you can study them and try to use them to solve the problem of data analysis.
1. Regression analysis
On the technical level, regression analysis is a variety of statistical models. There are five main types: linear regression, 0- 1 regression, ordered regression, counting regression and survival regression.
Linear regression, more strictly speaking, is an ordinary linear regression. Its main feature is that the dependent variable y must be continuous data, and there is no great requirement for explaining the variable x. In the data world, linear regression can be applied to stock investment, customer lifetime value, medical care and other fields.
0- 1 regression is a regression analysis model with dependent variable y of 0- 1 data (only two possible values). For example, the gender is only "male" or "female". The purchase decision is only "buy" or "don't buy". The diagnosis of cancer is only "cancer" or "non-cancer". 0- 1 regression can be applied to internet credit inquiry, personalized recommendation, social friend recommendation, etc.
Ordered regression is a regression analysis model in which the dependent variable y is ordered data (data related to order). For example, now please evaluate the attendance of the author in this issue. According to their preferences, 1 means they like it very much, 2 means they like it a little, 3 means they feel average, 4 means they don't like it a little, and 5 means they don't like it very much. This is a sequencing data. The common application scenarios of ordered regression are: movie rating (1~5 stars); E-commerce product satisfaction score (1~5 stars), etc.
Counting regression. If the dependent variable y is counting data (non-negative integer), then the corresponding regression analysis model is counting regression. Counting regression is commonly used in: RFM model in customer relationship management, that is, the number of customer visits in a certain period; In the study of the second child policy, the number of children a couple chooses to have.
Survival regression is the abbreviation of survival data regression, that is, dependent variable Y is a regression analysis model of survival data (describing how long a phenomenon or individual has survived), such as the life span of people, the service life of electronic products, the duration of start-up companies, etc.
2. Data visualization
The most basic data visualization method is statistical chart, and a good statistical chart should meet four standards: accuracy, effectiveness, conciseness and beauty. Common statistical charts include: histogram, stacked histogram, pie chart, histogram, line chart, scatter chart, box chart, stem and leaf chart, etc.
3. Machine learning
Machine learning represents a large class of excellent data model analysis methods and is a compulsory course for book lovers who are determined to become data scientists. It mainly involves naive Bayes, decision tree (including random forest), neural network (including deep learning), K-means clustering and other methods.
4. Unstructured data
Whether data is structured or unstructured is a relative subjective concept. Of course, some of them have reached a * * * understanding, and the recognized unstructured data includes Chinese texts, data structures, images and so on.
case study
Unstructured text data does not mean that we can't analyze it. Take "Eternal Dragon Slayer" as an example. Who does Zhang Wuji love most, Zhao Min, Zhou Zhiruo, Li Yin or Xiao Zhao? This book uses data analysis to get the answer!
The first step is to extract the main characters and their titles. Next, determine the analysis unit, here is the natural section. So who does Zhang Wuji really love? How to define it as a data analyzable problem? In this book, the characters are analyzed from different angles such as frequency, time and intimacy. The most important intimacy analysis here is briefly described by the number of times they and Zhang Wuji appeared in the same natural section (at the same time):
As the saying goes, seeing the truth after a long time, in this respect, Zhang Wuji and Zhao Min have the most opportunities to get close to each other, and they are also most likely to fall in love with Zhao Min.
Note: Details of this case can be obtained from the official account of WeChat, Bear Club (ID: Club).
label
This is a book that can improve cognition. It doesn't bring you much methodology, and it can't change your life immediately. Even when listening to books, I feel a little laborious. However, occasionally go out of your comfort zone, try to understand the science problems that you didn't dare to touch before, and then be pleasantly surprised to find "Oh! I see! " Isn't this an improvement for us?
Brief introduction of the author
Wang hansheng
Professor, doctoral supervisor and head of the Department of Business Statistics and Econometrics, Guanghua School of Management, Peking University, director of Peking University Business Intelligence Research Center, founder of WeChat official account "Bear Club". Researcher of American Statistical Society (20 14), winner of National Outstanding Youth Fund (20 16), editor-in-chief of JASA, JBES and China Science: Mathematics.
Jinghua Du Jie
The following content is the essential interpretation of data thinking for book lovers' reference. Welcome to share, and it may not be used for commercial purposes without permission.
catalogue
First, simple data values
Second, what is big data?
Third, everyone should have data thinking.
Fourth, various data analysis methods.
Upper guide bearing
A car meets a driver who can't touch the north, and no matter how big the engine is, it can't reach its destination. The same is true of big data. If there is no data thinking that turns business problems into data analyzable problems, no matter how mythical big data is, it will not create business value.
Big data is very hot, and few people really know how to do it. Professor Wang Hansheng is one of them. In the noisy new media context, Professor Wang found a new way to help us develop data thinking in our work and life with sincere and realistic academic temperament.