Current location - Training Enrollment Network - Mathematics courses - Why use Python for data analysis?
Why use Python for data analysis?
I have been using python for three years, and I fell in love with it because of its simple, readable and powerful library. Its pythonic language features are extremely friendly to people. It can be said that it is not difficult for a person who doesn't know programming language at all to understand python language.

Compared with R, MATLAB, SAS, Stata and other tools, Python has its advantages in data analysis and interaction, exploratory computing, data visualization and so on. In recent years, due to the continuous development of Python libraries (such as pandas), it has emerged in the field of data mining. Combined with its great strength in general programming, we can only use Python as a language to build data-centric applications.

Because python is an explanatory language, most compiled languages run faster than python codes, so some students despise python. But Bian Xiao believes that python is a high-level language and its production efficiency is higher. Programmer's time is usually more precious than CPU's time. In order to weigh the pros and cons, it is worthwhile to consider python.

Python's powerful computing power depends on its rich and powerful library:

Numpy

The abbreviation of numerical Python is the basic package of Python scientific calculation. Its function:

1. fast and efficient multi-dimensional array object ndarray.

2. Functions for performing element-level calculations on arrays and performing mathematical operations directly on arrays.

3. Linear algebraic operation, Fourier transform and random number generation.

4. Tools for integrating C, C++ and Fortran codes into Python.

In addition to providing Python with fast array processing capabilities, NumPy has another major role in data analysis, which is to serve as a container for transferring data between algorithms. For numerical data, NumPy array is much more efficient than the built-in Python data structure when storing and processing data. In addition, libraries written in low-level languages (such as C and Fortran) can directly manipulate the data in NumPy arrays without any data replication.

SciPy

Is a set of software packages specially used to solve various standard problem domains in scientific computing, mainly including the following software packages:

1.scipy.integrate: numerical integration routine and differential equation solver.

2.scipy.linalg: It extends the linear algebra routine and matrix decomposition function provided by numpy.linalg.

3.scipy.optimize: function optimizer (minimization) and root search algorithm.

4. Signal processing tools.

5.scipy.sparse: Solver of sparse matrix and sparse linear system.

6. Scipy. Special: Wrapper of Specfun (a Fortran library that implements many commonly used mathematical functions, such as gamma function).

7.scipy.stats: standard continuous and discrete probability distribution (such as density function, sampler, continuous distribution function, etc.). ), various statistical testing methods and better descriptive statistical methods.

8.scipy.weave: A tool to speed up array calculation by using inline C++ code.

Note: The organic combination of NumPy and SciPy can completely replace the computing function of MATLAB (including its plug-in toolbox).

symphony

Python is a mathematical symbol calculation library, which can be used to deduce and calculate symbols of mathematical expressions.

panda

A large number of data structures and functions are provided, which enables us to process structured data quickly and conveniently. You will soon find that this is one of the important factors that make Python a powerful and efficient data analysis environment.

Pandas combines the high-performance array computing function of NumPy with the flexible data processing function of spreadsheets and relational databases (such as SQL). It provides a complex and fine indexing function, so that operations such as shaping, slicing and dicing, aggregation and selection of data subsets can be completed more conveniently.

For users who use R language for statistical calculation, the name DataFrame is certainly not unfamiliar, because it comes from the data.frame object of R, but these two objects are not the same. The function provided by the data.frame object of R is only a subset of the function provided by the DataFrame object. That is to say, Panda's DataFrame function is more powerful than R's Dataframe function.

matplotlib

Is the most popular Python library for drawing data charts. It was originally created by john D Hunt (JDH) and is currently maintained by a large development team. It is ideal for creating charts for use in publications. It is well integrated with IPython (which will be discussed soon), thus providing a very useful interactive data drawing environment. The chart you draw is also interactive. You can use the toolbar in the drawing window to enlarge an area in the chart, or to pan and browse the whole chart.

TVTK

It is a 3D visualization library of python data and a set of powerful 3D visualization libraries. It provides Python-style API and supports Trait attribute (Python is a dynamic programming language, and its variables have no types, which is helpful for rapid development, but it also has shortcomings. Feature library can add checking function to the attributes of objects, thus improving the readability of programs and reducing the error rate. ) and NumPy arrays. This library is very large, so the development company provides a query document that users can run through the following statements:

& gt& gt& gt import tvtk_doc from entthought.tvtk.tools.

& gt& gt& gttvtk_doc.main()

Sci kit- learning

Is a python-based machine learning library, based on NumPy, SciPy and matplotlib, which is simple and efficient in data mining and data analysis. Its documents and examples are relatively complete.

Bian Xiao's suggestion: Beginners should use python(x, y), which is a free development package for science and engineering, providing mathematical calculation, data analysis and visual display. Very convenient!