Some basic knowledge of mathematics necessary for programmers.

As a standard programmer, you should have some basic mathematical literacy, especially now many people are learning artificial intelligence and want to seize a wave of artificial intelligence opportunities. Many programmers may not even be able to answer such basic math questions.

As a proud programmer, you should master the basic knowledge of mathematics, and then you are more likely to code a great product.

A vector is an ordered array of real numbers, which has both size and direction. The N-dimensional vector A consists of n ordered real numbers, which is expressed as A = [A 1, A2, ..., An].

matrix

Linear mapping matrix usually represents the mapping f: v-> from n-dimensional linear space v to m-dimensional linear space w; w

Note: For the convenience of writing, X.T means the transposition of vector x, where: X(x 1, x2, ..., xn). t，y(y 1，y2，...ym)。 T is a column vector. Represents two vectors in two linear spaces v and w, respectively. A(m, n) is a matrix of m*n, which describes the linear mapping from V to W.

Transpose the rows and columns of a matrix.

Addition If both A and B are matrices of m × n, then the addition of A and B is also a matrix of m × n, and each element is the addition of the corresponding elements of A and B ... [A+B]ij = aij+bij.

Multiplication If A is a k × m matrix and B is a m × n matrix, then the product AB is a k × n matrix.

Diagonal matrix Diagonal matrix refers to a matrix whose elements are all 0 except the main diagonal. The elements on the diagonal can be 0 or other values. An n × n diagonal matrix A satisfies: [A]ij = 0 If I? = j？ I，j ∈ { 1，，n}

Eigenvalues and eigenvectors If a scalar λ and a nonzero vector V satisfy Av = λv, λ and V are called eigenvalues and eigenvectors of matrix A respectively.

Matrix decomposition A matrix can usually be expressed by some relatively simple matrices, which is called matrix decomposition.

Singular value decomposition of m×n matrix a

Where u and v are orthogonal matrices of m × m and n×n respectively, σ is diagonal matrix of m × n, and the elements on the diagonal are called singular values.

Characteristic decomposition The characteristic decomposition of n × n square matrix A is defined as

Where q is an n × n square matrix, each column is the eigenvector of A, the diagonal matrix is, and each diagonal element is the eigenvalue of A. If A is a symmetric matrix, A can be decomposed into

Where q is an orthogonal matrix.

For both the domain of definition and the domain of value, the derivative is the function f: R → R in the real number domain. If f(x) is in the neighborhood of point x0? Within x, limit

If it exists, the function f(x) is said to be derivable at point x0, and f'(x0) is called its derivative, or derivative function. If the function f(x) is derivable at every point in the interval contained in its domain, it can also be said that the function f(x) is derivable in this interval. A continuous function is not necessarily derivable, but a derivable function must be continuous. For example, the function |x| is a continuous function, but it is non-differentiable at x = 0.

Addition rule

Y = f(x), z = g(x) then

Multiplication rule

Chain rule is the rule of finding the derivative of compound function, and it is a common method to calculate the derivative in calculus. If x ∈ R, y = g(x) ∈ R and z = f(y) ∈ R, then

Logical function is a commonly used S-shaped function, which was put forward by Pierre Fran, a Belgian mathematician. Ois Verhulst put forward this name when studying the population growth model in 1844- 1845, which was originally used as a biological model. Logistics function is defined as:

When the parameter is (k = 1, x0 = 0, L = 1), the logic function is called standard logic function, and it is denoted as σ(x).

Standard logic functions are widely used in machine learning, and are usually used to map a number in real number space to (0, 1) interval. The derivative of the standard logic function is:

The Softmax function maps multiple scalars into a probability distribution. For k scalar x 1, ..., xk, the softmax function is defined as

In this way, we can transform k variables X 1, ..., XK into distribution: Z 1, ..., ZK, which is satisfactory.

When the input of the softmax function is the k-dimensional vector x,

Where 1k = [1, ..., 1] k × 1 are all k-dimensional 1 vectors. Its derivative is

Discrete optimization and continuous optimization: Mathematical optimization problems can be divided into discrete optimization problems and continuous optimization problems according to whether the range of input variable X is a real number field.

Unconstrained optimization and constrained optimization: in continuous optimization problems, optimization problems can be divided into unconstrained optimization problems and constrained optimization problems according to whether there are variable constraints. # # # Optimization algorithm

Global optimization and local optimization

Hess matrix

There is a saying in operational research. The previous article also used gradient step calculation: gradient descent algorithm.

The original meaning of gradient is vector, which means that the directional derivative of the function at that point gets the maximum value in that direction, that is, the function changes the fastest in that direction (the direction of gradient) at that point, and the rate of change is the largest (the modulus of gradient).

Gradient descent method

Gradient descent method, also known as steepest descent method, is often used to solve the minimum problem of unconstrained optimization.

The process of gradient descent method is shown in the figure. A curve is a contour line (level set), that is, the function f is a curve composed of a set of different constants. The red arrow points in the opposite direction of the point gradient (the gradient direction is perpendicular to the contour line passing through the point). Along the gradient descent direction, the local optimal solution of the function f value will eventually be reached.

Gradient rising method

If you want to solve a maximum problem, you need to search in the positive direction of the gradient and gradually approach the local maximum point of the function. This process is called gradient ascent method.

Probability theory mainly studies the quantitative laws in a large number of random phenomena, and it is widely used, covering almost all fields.

Discrete random variable

If the possible values of the random variable X are finite and enumerated, and there are n finite values {x 1, ..., xn}, then X is called a discrete random variable. In order to understand the statistical law of x, we must know the probability that it takes every possible value xi, that is,

It is called the probability distribution or distribution of the discrete random variable X, which satisfies

Common discrete random probability distributions are:

Binomial distribution

Continuous random variable

Different from discrete random variables, some random variables X are uncountable and consist of all real numbers or some intervals, such as

X is called a continuous random variable.

Probability density function

The probability distribution of continuous random variable X is generally described by probability density function p(x). P(x) is an integrable function that satisfies:

Uniform distribution If A and B are finite numbers, the probability density function of uniform distribution on [a, b] is defined as

Normal distribution, also known as Gaussian distribution, is the most common distribution in nature, with many good properties and very important influence in many fields. Its probability density function is

Where σ >; 0,? And σ are constants. What if the random variable x obeys a parameter? And σ probability distribution, abbreviated as

cumulative distribution function

For random variable x, its cumulative distribution function is the probability that the value of random variable x is less than or equal to X.

Taking the continuous random variable x as an example, the cumulative distribution function is defined as:

Where p(x) is the probability density function and the cumulative distribution function of standard normal distribution:

random vector

A random vector is a vector composed of a set of random variables. If x 1, x2, ..., xn are n random variables, then [x 1, x2, ..., xn] is called an n-dimensional random vector. One-dimensional random vectors are called random variables. Random vectors are divided into discrete random vectors and continuous random vectors. Conditional probability distribution of discrete random vector (x, y), given that X = x, the conditional probability of random variable Y = y is:

For two-dimensional continuous random vector (x, y), the conditional probability density function of random variable Y = y is as follows.

The expectation is a discrete variable X, and its probability distribution is p (x 1), ..., p (xn). The expected value or average value of X is defined as

For continuous random variable x, the probability density function is p(x), and its expectation is defined as

Variance The variance of a random variable X is used to define the dispersion degree of its probability distribution, which is defined as

The variance of the standard deviation random variable X is also called its second moment. The root variance or standard deviation of x.

The covariance of two continuous random variables x and y is used to measure the overall variability between the distributions of two random variables, and is defined as

Covariance is usually used to measure the linear correlation between two random variables. If the covariance of two random variables is 0, the two random variables are said to be linearly uncorrelated. There is no linear correlation between two random variables, which does not mean that they are independent. There may be some nonlinear functional relationship. On the contrary, if X and Y are statistically independent, the covariance between them must be 0.

A random process is a set of random variables Xt, in which t belongs to an index set T. The index set T can be defined in time domain or space domain, but it is generally expressed in time domain by real numbers or positive numbers. When t is a real number, the stochastic process is continuous; When t is an integer, it is a discrete random process. Many examples in daily life, including stock fluctuations, voice signals and height changes, can be regarded as random processes. Common time-related stochastic process models include Bayesian process, random walk, Markov process and so on.

Markov process means that the conditional probability distribution of the future state of a stochastic process depends only on the current state given the current state and all the past states.

Where X0:t represents the variable set X0, x 1, ..., XT, x0:t is the state sequence in the state space.

Markov chain discrete-time Markov process is also called Markov chain. If the conditional probability of Markov chain

The use of Markov can be seen in an interesting article written earlier: Can you guess your girlfriend's mind? -Markov chain tells you that stochastic process and Gaussian process are very complicated, so I won't go into details here.

Information theory is an interdisciplinary field such as mathematics, physics, statistics and computer science. Information theory was first put forward by claude shannon, which mainly studies the methods of information quantification, storage and communication. In the field of machine learning, information theory also has many applications. Such as feature extraction, statistical inference, natural language processing, etc.

In information theory, entropy is used to measure the uncertainty of random events. Suppose a random variable X (whose value set is C and its probability distribution is P (X) and X ∈ C) is coded, and the information quantity or coding length when the self-information quantity I(x) is variable X = x is defined as I(x) =? Log(p(x)), the average coding length of random variable x, namely entropy, is defined as

When p(x) = 0, we define 0log0 = 0 entropy as the average coding length of random variables, that is, the mathematical expectation of self-information. The higher the entropy, the more information of random variables; The lower the entropy, the less information. If the variable x is p(x) = 1 if and only if it is x, the entropy is 0. That is to say, for a certain information, its entropy is 0, and the amount of information is also 0. If its probability distribution is uniform, the entropy is the largest. Assume that the random variable X has three possible values: x 1, x2 and x3, and the entropy corresponding to different probability distributions is as follows:

Joint entropy and conditional entropy of two discrete random variables x and y, assuming that the value set of x is x; The value set of y is y, and its joint probability distribution satisfies p(x, y), then the joint entropy of x and y is

The conditional entropy of x and y is

Mutual information Mutual information is a measure of how much the uncertainty of one variable is reduced when the other variable is known. The mutual information of two discrete random variables x and y is defined as

Cross entropy and divergence cross entropy correspond to random variables with p(x) distribution, and entropy H(p) represents its optimal coding length. Cross entropy is the optimal coding according to probability distribution Q, and the length of coded information with true distribution P is defined as

Given p, the closer q and p are, the smaller the cross entropy is. The farther q and p are, the greater the cross entropy is.

Shandong dialect mathematics

High school plane vector math problems

Answers to Baidu Library of Mathematics Exercise Book

How to draw mathematics in unit 1235 of the first volume of grade four

What is the level of 85 points in middle school mathematics test?

Time problem in relativity

Shandong Province College Entrance Examination Scoring Rules Table

Which is better, examination questions or exams?

How can the college entrance examination mathematics reach more than 90 points?

Tu Tu Mathematics Store