Current location - Training Enrollment Network - Education and training - How to become a data analyst? What skills are needed?
How to become a data analyst? What skills are needed?
Before learning data analyst, you must know what you want to achieve. In other words, what problems or plans do you want to solve with this technology? With this goal, you can clearly make your own study plan and define its knowledge system. Only by clarifying the goal orientation and learning the most useful part can we avoid invalid information and reduce learning efficiency.

1. Define the knowledge framework and learning path.

Data analysis, if you want to be a data analyst, then you can go to the recruitment website to see what the requirements of the corresponding position are. Generally speaking, you will have a preliminary understanding of the knowledge structure you should master. You can look at the position of data analyst. Enterprises' demand for skills can be summarized as follows:

Basic operation of SQL database, basic data management;

Able to use Excel/SQL to extract, analyze and display basic data;

Will use scripting language for data analysis, Python or R; ;

Increased the ability to obtain external data, such as reptiles or familiar with public data sets;

Have basic data visualization skills and be able to write data reports;

Familiar with commonly used data mining algorithms: regression analysis, decision tree, classification and clustering methods;

What is an efficient way to learn? Is the process of data analysis. Generally, a data analyst's learning journey can be realized through the steps of "data collection-data storage and extraction-data preprocessing-data modeling and analysis-data visualization". Step by step in this order, you will know what each part needs to complete, what knowledge you need to learn and what knowledge you don't need for the time being. Then every time you study a part, you will have some actual output, positive feedback and a sense of accomplishment, and you will be willing to spend more time in it. With the goal of solving problems, the efficiency will naturally not be low.

According to the above process, we are divided into analysts who need to obtain external data and analysts who don't need to obtain external data. The learning path is summarized as follows:

1. Analysts who need external data:

Python foundation

Python reptile

SQL language

Python scientific computing package: pandas, numpy, scipy, scikit-learn.

Basic statistics

Regression analysis method

Basic algorithms of data mining: classification and clustering

Model optimization: feature extraction

Data visualization: seaborn, matplotlib

2. Analysts who do not need to obtain external data:

SQL language

Python foundation

Python scientific computing package: pandas, numpy, scipy, scikit-learn.

Basic statistics

Regression analysis method

Basic algorithms of data mining: classification and clustering

Model optimization: feature extraction

Data visualization: seaborn, matplotlib

Next, let's talk about what each part should learn and how to learn.

Data collection: open data, Python crawler

If you only touch the data in the enterprise database, you don't need to get external data, and this part can be ignored.

There are two main ways to obtain external data.

The first is to obtain external public data sets. Some scientific research institutions, enterprises and governments will open some data, and you need to go to a specific website to download these data. These data sets are usually relatively complete and of relatively high quality.

Another way to get external data is a crawler.

For example, you can get the recruitment information of a position on the recruitment website, the rental information of a city on the rental website, the list of movies with the highest douban rating, the likes of Zhihu and the list of comments on Netease Cloud Music through the crawler. Based on the data captured on the network, we can analyze a certain industry and a certain population.

Before crawling, you need to know some basic knowledge of Python: elements (list, dictionary, tuple, etc. ), variables, loops, functions (the linked novice tutorial is very good) ... and how to realize a web crawler with a mature Python library (URL, BeautifulSoup, requests, scrapy). If you are a beginner, it is recommended to start with urllib and BeautifulSoup. (PS: Python knowledge is also needed for subsequent data analysis, and problems encountered in the future can also be viewed in this tutorial)

Don't have too many online crawler tutorials. Crawlers can recommend Douban's web pages to crawl. On the one hand, the web page structure is relatively simple, on the other hand, watercress is relatively friendly to reptiles.

After mastering the basic crawler, you need some advanced skills, such as regular expression, simulating user login, using proxy, setting crawling frequency, using cookie information and so on. To deal with the anti-crawler restrictions of different websites.

In addition, the data of commonly used e-commerce websites, question and answer websites, comment websites, second-hand trading websites, marriage websites and recruitment websites are all good practice methods. These websites can get very analytical data, and the most important thing is that there are many mature codes for reference.

Data access: SQL language

You may have a question why you didn't talk about Excel. When dealing with data within 10 thousand, Excel generally has no problem in analysis. Once the amount of data is large, it will be insufficient, and the database can solve this problem well. Moreover, most enterprises will store data in the form of SQL. If you are an analyst, you also need to know the operation of SQL and be able to query and extract data.

As the most classic database tool, SQL makes it possible to store and manage massive data and greatly improves the efficiency of data extraction. You need to master the following skills:

Extracting data under certain circumstances: The data in the enterprise database must be large and complicated, so you need to extract the parts you need. For example, you can extract all the sales data of 20 18, the top 50 products sold this year, the consumption data of users in Shanghai and Guangdong ... SQL can help you complete these tasks with simple commands.

Database addition, deletion, query and modification: these are the most basic operations of the database, but they can be realized with simple commands, so you just need to remember the commands.

Grouping and aggregation of data, how to establish the relationship between multiple tables: this part is an advanced operation of SQL, and the relationship between multiple tables is very useful when you deal with multidimensional and multi-data sets, which also allows you to deal with more complex data.

Data preprocessing: Python (Panda)

Many times, the data we get are unclean, with repeated data, missing data, abnormal values and so on. At this time, it is necessary to clean up the data and deal with the data of these impact analysis well in order to get more accurate analysis results.

For example, air quality data, many days of data are not monitored due to equipment reasons, some data are repeatedly recorded, and some data are invalid when equipment fails. For example, there are many invalid operations in user behavior data that are meaningless for analysis and need to be deleted.

Then we need to use corresponding methods to deal with it, such as incomplete data, whether we directly remove this data or use adjacent values to complete it. These are all issues that need to be considered.

For data preprocessing, learn the usage of panda and deal with general data cleaning. The knowledge points to be mastered are as follows:

Selection: data access (label, specific value, Boolean index, etc. )

Missing value processing: delete or fill missing data rows.

Duplicate value processing: judgment and deletion of duplicate values

Handling of spaces and abnormal values: Clear unnecessary spaces and extreme abnormal data.

Related operations: descriptive statistics, applications, histograms, etc.

Merge: a merge operation that conforms to various logical relationships.

Grouping: data division, independent function execution and data reorganization.

Refresh: Quickly Generate PivotTables

Probability theory and statistical knowledge

What is the overall distribution of data? What are population and sample? How to apply basic statistics such as median, mode, mean and variance? If there is a time dimension, how does it change with time? How to do hypothesis testing in different scenarios? Data analysis methods mostly come from the concept of statistics, so statistical knowledge is also essential. The knowledge points to be mastered are as follows:

Basic statistics: mean, median, mode, percentile, extreme value, etc.

Other descriptive statistics: skewness, variance, standard deviation, significance, etc

Other statistical knowledge: population and sample, parameter and statistics, error line.

Probability distribution and hypothesis testing: various distribution and hypothesis testing processes

Other knowledge of probability theory: conditional probability, Bayes, etc.

With the basic knowledge of statistics, you can use these statistical data for basic analysis. By describing the indicators of data in a visual way, we can actually draw many conclusions, such as 100, the average level and the changing trend in recent years. ...

You can use the python package Seaborn(python) to do these visual analysis, and you can easily draw various visual graphics and get instructive results. After understanding the hypothesis test, we can judge whether there are differences between the sample indicators and the assumed overall indicators, and whether the verification results are within the acceptable range.

Python data analysis

If you have some knowledge, you will know that there are actually many books about Python data analysis on the market at present, but each one is very thick and has great learning resistance. But in fact, the most useful information is only a small part of these books. For example, testing hypotheses in different situations with Python can actually verify the data well.

For example, mastering the method of regression analysis, through linear regression and logical regression, we can actually conduct regression analysis on most data and draw relatively accurate conclusions. For example, DataCastle's training competition "house price forecast" and "position forecast" can be realized through regression analysis. The knowledge points that need to be mastered in this part are as follows:

Regression analysis: linear regression and logical regression.

Basic classification algorithms: decision tree, random forest ...

Basic clustering algorithm: k-means ...

Basis of feature engineering: how to optimize the model through feature selection

Parameter adjustment method: how to adjust the parameter optimization model

Python data analysis packages: scipy, numpy, scikit-learn, etc.

At this stage of data analysis, most problems can be solved by focusing on regression analysis. Using descriptive statistical analysis and regression analysis, you can get a good analysis conclusion.

Of course, with the increase of your practice, you may encounter some complicated problems, so you may need to know some more advanced algorithms: classification and clustering, and then you will know which algorithm model is more suitable for different types of problems. For model optimization, you need to learn how to improve the prediction accuracy through feature extraction and parameter adjustment. It's a bit like data mining and machine learning. In fact, a good data analyst should be regarded as a junior data mining engineer.

Systematic actual combat

At this time, you already have the basic data analysis ability. However, it is necessary to conduct actual combat according to different cases and different business scenarios. If you can complete the analysis task independently, then you have defeated most data analysts in the market.

How to carry out actual combat?

The open data set mentioned above, you can find some data in the direction you are interested in, try to analyze it from different angles and see what valuable conclusions you can draw.

Another angle is that you can find some problems that can be analyzed from your life and work. For example, there are many problems to be discussed in the direction of e-commerce, recruitment, social networking and other platforms mentioned above.

At the beginning, you may not consider all the problems thoroughly, but with the accumulation of your experience, you will gradually find the direction of analysis, and what are the approximate dimensions of analysis, such as ranking, average level, regional distribution, age distribution, correlation analysis, future trend prediction and so on. With the increase of experience, you will have some feelings about data, which is what we usually call data thinking.

You can also look at the industry analysis report, look at the perspective of excellent analysts, and analyze the dimensions of the problem. In fact, this is not a difficult thing.

After mastering the primary analysis methods, you can also try to do some data analysis competitions, such as three competitions customized by DataCastle for data analysts, and you can get scores and rankings by submitting answers:

Employee turnover prediction training competition

Jinxian County Housing Price Forecasting Training Competition, USA

Beijing PM2.5 concentration analysis training competition

The best time to plant a tree is ten years ago, followed by now. Go find a data set now and get started! !