I just got in touch with an old friend. She has always been interested in data science, but only set foot in this field 10 months ago-she joined an organization as a data scientist. I obviously feel that she has learned a lot in her new position. However, when we were chatting, she mentioned a fact, or a problem that still haunts my mind. She said that no matter how she performed, every project or analysis task had to be done many times before the manager was satisfied. She also mentioned that it was often found afterwards that it didn't take so much time!
Does it sound like what happened to you? Will you analyze it many times before you get a decent answer? Or write code for similar activities over and over again? If so, this article is just for you. Let me share some ways to improve efficiency and reduce unnecessary duplication of work.
Remarks: Please don't get me wrong. I'm not saying that iteration is not good. This paper focuses on how to identify which iterations are necessary and which are unnecessary and need to be avoided.
What caused the duplication of work in data analysis? I don't think it is necessary to repeat the analysis without adding new information (an exception is mentioned later). The following repetitive tasks can be avoided:
The diagnosis of customer problems is biased, which can't meet the demand and needs to be redone. The purpose of repeated analysis is to collect more variables that you didn't think you needed before. Deviations or assumptions that affect your analysis activities have not been considered before, but have been considered later, so you have to redo them. What iterations are necessary? Here are two examples. First, you build a model after 6 months, and then you have new information, so the generated iteration is healthy. Second, you deliberately start with a simple model and gradually understand and build a complex model.
The above does not cover all possible situations, but I believe these examples are enough to help you judge whether your analysis iteration is healthy or not.
The influence of these productivity killers? We know one thing-no one wants unhealthy iterations and productivity killers to appear in the analysis. Not every data scientist is happy to repeat the whole analysis process while adding variables.
Analysts and data scientists will be deeply frustrated and lack a sense of accomplishment because of unhealthy iterations and inefficiency. Then let's do everything we can to avoid them.
Hint: How to avoid unhealthy iterations and improve efficiency Hint 1: Focus only on major issues.
Every organization has many small problems that can be solved with data! But the main purpose of hiring a data scientist is not to solve these small problems. If we want to use good steel on the cutting edge, we should select three or four data problems that have the greatest impact on the whole organization and hand them over to data scientists to solve. These questions are generally challenging and will bring the greatest leverage to your analysis activities (either full of income or no income, imagine borrowing and stock trading). You shouldn't solve small problems when bigger problems are not solved.
It doesn't sound like much, but in fact many organizations are not doing well! I see that many banks do marketing instead of data analysis to improve their risk scores. Some insurance companies do not use data analysis to improve customer retention rate, but try to establish incentive plans for agencies.
Tip 2: Create a presentation of data analysis (possible layout and structure) from the beginning.
I've been doing this and I've benefited a lot. Establishing the framework of analysis report should be the first thing after the project starts. It may sound counterintuitive, but once you get into this habit, you can save time.
How to build a framework? You can use ppt, word, or a paragraph to build a framework. The form doesn't matter. It is important to list all possible situations from the beginning. For example, if you try to reduce the bad debt write-off rate, you can demonstrate it as follows:
Next, you can consider how each factor affects the bad debt write-off rate. For example, the bank's bad debt write-off rate has increased due to the increase in customer credit lines. You can:
First of all, ensure that those customers whose credit lines have not increased have not led to an increase in the write-off rate of bad debts.
Next, use a mathematical formula to measure this influence.
Once you have considered every branch of analysis, you have created a good starting point for yourself.
Tip 3: Define data requirements in advance.
Data requirements are directly derived from the final analysis results. If you have comprehensively planned what analysis to do and what results to produce, then you know what the data requirements are. Here are some tips to help you:
Try to structure the data requirements: instead of just writing down a list of variables, you should clearly consider which tables are needed for analysis activities. For example, to increase the write-off rate of bad debts, you will need customer demographics, statistics of past marketing activities, customer transaction records of the past 12 months, bank credit policy change documents and other information.
Collect all the data you may need: Even if you are not 100% sure whether you need all the variables, you should collect all the data at this stage. Doing this requires a lot of work, but it is more efficient than adding variables to collect data in later links.
Define the time interval of the data you are interested in.
Tip 4: Make sure your analysis is repeatable.
This tip may sound simple-but it is difficult for beginners and senior analysts to grasp. Beginners will use Excel for every step of activities, including copying and pasting data. For advanced users, any work done through the command line interface may not be reproduced.
Similarly, you need to be extra careful when using a notebook. You should limit yourself to modifying the previous steps, especially if the previous data has been used by the following steps. Notepad is very powerful in maintaining this kind of data flow involving cross-checking relationship between front and back data. But if this data stream is not maintained in notepad, it will be useless.
Tip 5: Build a standard code base.
Simple operations do not require repeated rewriting of the code. This is not only a waste of time, but also may cause grammatical errors. Another trick is to create a standard code base for common operations and share it with the whole team.
This not only ensures that the whole team uses the same code, but also makes them more efficient.
Tip 6: Build an intermediate data mart.
Many times, you will need the same information over and over again. For example, you will use all customers' credit card consumption records in multiple analyses and reports. Although data can be extracted from transaction record tables every time, creating intermediate data marts containing these tables can effectively save time and effort. Similarly, there is no need to query and extract the summary table of marketing activities every time.
Tip 7: Use reserved samples and cross-validation to prevent over-fitting.
Many beginners underestimate the power of sample retention and cross-validation. Many people tend to think that as long as the training set is large enough, it will hardly be over-fitted, so there is no need for cross-validation or retention of samples.
With this idea, things often go wrong in the end. I'm not the only one who says this-you can look at the public or private rankings of any game on Kaggle. You will find that when some people in the current top ten are no longer suitable, their rankings will no longer drop. You can imagine that these are senior data scientists.
Tip 8: Concentrate on working for a while and take regular breaks.
For me, the best working condition is to concentrate on solving a problem or project for 2-3 hours. As a data scientist, it is difficult for you to accomplish multiple tasks at the same time. You need to do your best to deal with a problem. For me, the time window of 2-3 hours is the most efficient. You can set it according to your own situation.
Postscript The above are some ways for me to improve my work efficiency. I don't emphasize doing things well the first time, but you must get into the habit of doing things well every time-so that you can become a professional data scientist.
Do you have any good ways to improve work efficiency? Please leave a message in the comments below.
Original title: 8 productivity skills of data scientists &; business analyst
Translation notes 1, catch? Up? With what? Someone (short for someone) also means to get back in touch with someone, which is equivalent to becoming? Currently? With what? What? Going? Open? Are you online? Someone's? Life? What time? What about you? No? Have you been there? Are you online? Touch? For what? Answer? During ...
So what does this sentence mean? "It's good to contact (meet/meet) you again", especially when I haven't seen you for a while.
2. Productivity? Killer, productivity killer, factors that reduce productivity, factors that hinder productivity improvement.
3. Bad debt write-off rate is an important indicator of the credit card industry, which is mainly used to measure the credit level of assets by dividing it by the annualized proportion of the total credit card accounts receivable at the beginning of the month.
4. The brand in the illustration? Strategy? Changes, brand strategy changes may lead to an increase in bad debt write-off rate. For example, when adopting competitive brand or marginal brand strategy, it may lead to an increase in bad debt write-off rate.
5. Brand strategy:
Image brand. In brand competition, image brand can effectively win the public's trust and form a good "word of mouth" effect, which plays an extremely important role in the accumulation and promotion of brand capital and can promote the promotion of other brands of enterprises. For example, Nestle's "Nestle", as parent brand, is an image brand, which has played an effective role in promoting many of its sub-brands. Therefore, the brand management strategy of enterprises can not be without image brand. Competitive brands are usually launched for similar products in the market, and will tear apart competitors' defense lines or open up new target markets through their special market positioning such as technology, price or service characteristics. Obviously, the main purpose of competitive brands is to win more market share and create competitive advantages for enterprises. This kind of brand may not bring much profit to enterprises now, but it has great development potential, which is the key and hope for enterprises to participate in brand competition in the future market. Profit brand is the center of multi-brand management of enterprises. Profit brand creates profits for enterprises, which is an important feature of modern brand management. Profit brand is generally the representative of enterprise's unique technology (enterprise's core competitiveness), and it is difficult for competitors to enter this field in a short time to create larger profit space or even excess profits for enterprises. Of course, if such brands are not upgraded and improved, they may enter a recession. Marginal brand is a necessary supplement to the multi-brand management strategy of enterprises. Marginal brands are not corporate image brands or competitive brands, so it is difficult to create profits from the appearance, but because they have a certain customer base, they do not need to invest as much as other brands. Therefore, even if the sales of this brand stagnate or slowly decline, there are still a group of loyal consumers who will not give up this brand. The role of marginal brands is to create surplus resources, provide resource support for competitive brands, image brands and profit brands of enterprises, and help offset the fixed operating expenses of enterprises. 6. The "acquisition" in the illustration? Drive ",acquisition is (1) acquisition and merger; (2) Acquisition of books and materials (by purchasing and exchanging books, etc.). ); Books (or newspapers and magazines) obtained; (3) acquisition (knowledge, skills, etc. ). such as data? Acquisition refers to data acquisition.
7. The flower in the illustration? Simulation ",the translator just translates it into" cost simulation "according to the context. In ask.com search engine, there is no corresponding content. Did the website prompt search for flowers? Simulation, spent is an interactive game initiated by a non-profit organization to help the homeless and the poor. Players spend $65,438+0,000 yuan for a month to simulate the life of the poor. Players will face many choices when participating in interactive games, such as cover? That? Lowest? Open? Yours Credit? Cards? Or? Pay? That? Rent? Credit card or rent? This game was first held in February of 20 1 1, and has been played by 2 million people in 2 18 countries for more than 4 million times. If customers participate in such activities, it may lead to overdue repayment of credit cards. Reference link: http://umdurham.org/? https://en . Wikipedia . org/wiki/spend _(online _ game)# cite _ note-2
8. Data? Demand, data demand, related to the market? Demand, production? Demand, in which product demand is closely related to data demand. Because data requirements develop with the product business logic. To collect the data of a product, we need to understand the business logic of the product, such as the interaction between functions and the business logic of a single function. Secondly, node the business logic, identify the important nodes and list the priorities. Thirdly, the coding of node-based services is mainly to add statistical events and parameters to the listed important nodes (nodes that need statistics). Finally, a data requirement document is formed.
9. More? Often? Than? Not often
After reading and translating this article, I feel that data analysts can learn from two aspects. One is to learn from the traditional management consulting industry. The abilities that DA needs include the problem solving ability of traditional consulting industry plus data processing ability. For example, the second tip in this paper is similar to the important method of consulting industry-structured thinking. You can refer to barbara minto's Logic? Are you online? Writing? Thinking? And then what? Question? Solving (Chinese translation: Golden Pagoda Principle-the logic of thinking, expressing and solving problems), this book is a classic training material of McKinsey, which introduces many practical methods to help readers focus on thinking and expressing clearly, with clear logic and clear focus. Second, it can be inspired by the traditional data resource planning. The third point of this paper suggests that how to determine data requirements can refer to the systematic method of obtaining data requirements from business requirements and modeling business and data in traditional data resource planning, and can refer to Professor Gao Fuxian's "Information Resource Planning: Basic Engineering of Information Construction" for details.
At the end of this paper, the work and rest are mentioned, which varies from person to person. I think we should pay attention to the following points:
The first is to evaluate the comprehensive efficiency. Once or twice a week, the efficiency is extremely high, but the comprehensive efficiency may not be as good as maintaining a stable rhythm all week. You can try to use the tomato clock as a time management tool to quantitatively analyze your own situation;
The second is to adjust living habits. Data analysis needs plenty of energy, and there are many factors that affect energy, such as overeating, which may have a negative impact.
The third is to pay attention to breathing. If we are comfortable physically and mentally and breathe naturally when we are efficient, then this state is sustainable. If you often hold your breath when you are focused, this way is more inclined to consume. Meditation and mindfulness training may help.
Work is like running a marathon. Some people's goal is not to run fast, but to run for a long time, hoping to run to 60 years old. Such people need to control their heart rate more than speed up. Some people want to improve their performance as soon as possible and sprint for several important events, so they voluntarily bear the price of increasing free radicals. So is data analysis. What kind of goals are set, then how to run.
The above are the related contents of eight skills that Bian Xiao shared for you to improve the efficiency of data analysis. For more information, you can pay attention to Global Ivy and share more dry goods.