As a proud programmer, you should master the basic knowledge of mathematics, and then you are more likely to code a great product.
A vector is an ordered array of real numbers, which has both size and direction. The N-dimensional vector A consists of n ordered real numbers, which is expressed as A = [A 1, A2, ..., An].
matrix
Linear mapping matrix usually represents the mapping f: v-> from n-dimensional linear space v to m-dimensional linear space w; w
Note: For the convenience of writing, X.T means the transposition of vector x, where: X(x 1, x2, ..., xn). t,y(y 1,y2,...ym)。 T is a column vector. Represents two vectors in two linear spaces v and w, respectively. A(m, n) is a matrix of m*n, which describes the linear mapping from V to W.
Transpose the rows and columns of a matrix.
Addition If both A and B are matrices of m × n, then the addition of A and B is also a matrix of m × n, and each element is the addition of the corresponding elements of A and B ... [A+B]ij = aij+bij.
Multiplication If A is a k × m matrix and B is a m × n matrix, then the product AB is a k × n matrix.
Diagonal matrix Diagonal matrix refers to a matrix whose elements are all 0 except the main diagonal. The elements on the diagonal can be 0 or other values. An n × n diagonal matrix A satisfies: [A]ij = 0 If I? = j? I,j ∈ { 1,,n}
Eigenvalues and eigenvectors If a scalar λ and a nonzero vector V satisfy Av = λv, λ and V are called eigenvalues and eigenvectors of matrix A respectively.
Matrix decomposition A matrix can usually be expressed by some relatively simple matrices, which is called matrix decomposition.
Singular value decomposition of m×n matrix a
Where u and v are orthogonal matrices of m × m and n×n respectively, σ is diagonal matrix of m × n, and the elements on the diagonal are called singular values.
Characteristic decomposition The characteristic decomposition of n × n square matrix A is defined as
Where q is an n × n square matrix, each column is the eigenvector of A, the diagonal matrix is, and each diagonal element is the eigenvalue of A. If A is a symmetric matrix, A can be decomposed into
Where q is an orthogonal matrix.
For both the domain of definition and the domain of value, the derivative is the function f: R → R in the real number domain. If f(x) is in the neighborhood of point x0? Within x, limit
If it exists, the function f(x) is said to be derivable at point x0, and f'(x0) is called its derivative, or derivative function. If the function f(x) is derivable at every point in the interval contained in its domain, it can also be said that the function f(x) is derivable in this interval. A continuous function is not necessarily derivable, but a derivable function must be continuous. For example, the function |x| is a continuous function, but it is non-differentiable at x = 0.
Addition rule
Y = f(x), z = g(x) then
Multiplication rule
Chain rule is the rule of finding the derivative of compound function, and it is a common method to calculate the derivative in calculus. If x ∈ R, y = g(x) ∈ R and z = f(y) ∈ R, then
Logical function is a commonly used S-shaped function, which was put forward by Pierre Fran, a Belgian mathematician. Ois Verhulst put forward this name when studying the population growth model in 1844- 1845, which was originally used as a biological model. Logistics function is defined as:
When the parameter is (k = 1, x0 = 0, L = 1), the logic function is called standard logic function, and it is denoted as σ(x).
Standard logic functions are widely used in machine learning, and are usually used to map a number in real number space to (0, 1) interval. The derivative of the standard logic function is:
The Softmax function maps multiple scalars into a probability distribution. For k scalar x 1, ..., xk, the softmax function is defined as
In this way, we can transform k variables X 1, ..., XK into distribution: Z 1, ..., ZK, which is satisfactory.
When the input of the softmax function is the k-dimensional vector x,
Where 1k = [1, ..., 1] k × 1 are all k-dimensional 1 vectors. Its derivative is
Discrete optimization and continuous optimization: Mathematical optimization problems can be divided into discrete optimization problems and continuous optimization problems according to whether the range of input variable X is a real number field.
Unconstrained optimization and constrained optimization: in continuous optimization problems, optimization problems can be divided into unconstrained optimization problems and constrained optimization problems according to whether there are variable constraints. # # # Optimization algorithm
Global optimization and local optimization
Hess matrix
There is a saying in operational research. The previous article also used gradient step calculation: gradient descent algorithm.
The original meaning of gradient is vector, which means that the directional derivative of the function at that point gets the maximum value in that direction, that is, the function changes the fastest in that direction (the direction of gradient) at that point, and the rate of change is the largest (the modulus of gradient).
Gradient descent method
Gradient descent method, also known as steepest descent method, is often used to solve the minimum problem of unconstrained optimization.
The process of gradient descent method is shown in the figure. A curve is a contour line (level set), that is, the function f is a curve composed of a set of different constants. The red arrow points in the opposite direction of the point gradient (the gradient direction is perpendicular to the contour line passing through the point). Along the gradient descent direction, the local optimal solution of the function f value will eventually be reached.
Gradient rising method
If you want to solve a maximum problem, you need to search in the positive direction of the gradient and gradually approach the local maximum point of the function. This process is called gradient ascent method.
Probability theory mainly studies the quantitative laws in a large number of random phenomena, and it is widely used, covering almost all fields.
Discrete random variable
If the possible values of the random variable X are finite and enumerated, and there are n finite values {x 1, ..., xn}, then X is called a discrete random variable. In order to understand the statistical law of x, we must know the probability that it takes every possible value xi, that is,
It is called the probability distribution or distribution of the discrete random variable X, which satisfies
Common discrete random probability distributions are:
Binomial distribution
Binomial distribution
Continuous random variable
Different from discrete random variables, some random variables X are uncountable and consist of all real numbers or some intervals, such as
X is called a continuous random variable.
Probability density function
The probability distribution of continuous random variable X is generally described by probability density function p(x). P(x) is an integrable function that satisfies:
Uniform distribution If A and B are finite numbers, the probability density function of uniform distribution on [a, b] is defined as
Normal distribution, also known as Gaussian distribution, is the most common distribution in nature, with many good properties and very important influence in many fields. Its probability density function is
Where σ >; 0,? And σ are constants. What if the random variable x obeys a parameter? And σ probability distribution, abbreviated as
cumulative distribution function
For random variable x, its cumulative distribution function is the probability that the value of random variable x is less than or equal to X.
Taking the continuous random variable x as an example, the cumulative distribution function is defined as:
Where p(x) is the probability density function and the cumulative distribution function of standard normal distribution:
random vector
A random vector is a vector composed of a set of random variables. If x 1, x2, ..., xn are n random variables, then [x 1, x2, ..., xn] is called an n-dimensional random vector. One-dimensional random vectors are called random variables. Random vectors are divided into discrete random vectors and continuous random vectors. Conditional probability distribution of discrete random vector (x, y), given that X = x, the conditional probability of random variable Y = y is:
For two-dimensional continuous random vector (x, y), the conditional probability density function of random variable Y = y is as follows.
The expectation is a discrete variable X, and its probability distribution is p (x 1), ..., p (xn). The expected value or average value of X is defined as
For continuous random variable x, the probability density function is p(x), and its expectation is defined as
Variance The variance of a random variable X is used to define the dispersion degree of its probability distribution, which is defined as
The variance of the standard deviation random variable X is also called its second moment. The root variance or standard deviation of x.
The covariance of two continuous random variables x and y is used to measure the overall variability between the distributions of two random variables, and is defined as
Covariance is usually used to measure the linear correlation between two random variables. If the covariance of two random variables is 0, the two random variables are said to be linearly uncorrelated. There is no linear correlation between two random variables, which does not mean that they are independent. There may be some nonlinear functional relationship. On the contrary, if X and Y are statistically independent, the covariance between them must be 0.
A random process is a set of random variables Xt, in which t belongs to an index set T. The index set T can be defined in time domain or space domain, but it is generally expressed in time domain by real numbers or positive numbers. When t is a real number, the stochastic process is continuous; When t is an integer, it is a discrete random process. Many examples in daily life, including stock fluctuations, voice signals and height changes, can be regarded as random processes. Common time-related stochastic process models include Bayesian process, random walk, Markov process and so on.
Markov process means that the conditional probability distribution of the future state of a stochastic process depends only on the current state given the current state and all the past states.
Where X0:t represents the variable set X0, x 1, ..., XT, x0:t is the state sequence in the state space.
Markov chain discrete-time Markov process is also called Markov chain. If the conditional probability of Markov chain
The use of Markov can be seen in an interesting article written earlier: Can you guess your girlfriend's mind? -Markov chain tells you that stochastic process and Gaussian process are very complicated, so I won't go into details here.
Information theory is an interdisciplinary field such as mathematics, physics, statistics and computer science. Information theory was first put forward by claude shannon, which mainly studies the methods of information quantification, storage and communication. In the field of machine learning, information theory also has many applications. Such as feature extraction, statistical inference, natural language processing, etc.
In information theory, entropy is used to measure the uncertainty of random events. Suppose a random variable X (whose value set is C and its probability distribution is P (X) and X ∈ C) is coded, and the information quantity or coding length when the self-information quantity I(x) is variable X = x is defined as I(x) =? Log(p(x)), the average coding length of random variable x, namely entropy, is defined as
When p(x) = 0, we define 0log0 = 0 entropy as the average coding length of random variables, that is, the mathematical expectation of self-information. The higher the entropy, the more information of random variables; The lower the entropy, the less information. If the variable x is p(x) = 1 if and only if it is x, the entropy is 0. That is to say, for a certain information, its entropy is 0, and the amount of information is also 0. If its probability distribution is uniform, the entropy is the largest. Assume that the random variable X has three possible values: x 1, x2 and x3, and the entropy corresponding to different probability distributions is as follows:
Joint entropy and conditional entropy of two discrete random variables x and y, assuming that the value set of x is x; The value set of y is y, and its joint probability distribution satisfies p(x, y), then the joint entropy of x and y is
The conditional entropy of x and y is
Mutual information Mutual information is a measure of how much the uncertainty of one variable is reduced when the other variable is known. The mutual information of two discrete random variables x and y is defined as
Cross entropy and divergence cross entropy correspond to random variables with p(x) distribution, and entropy H(p) represents its optimal coding length. Cross entropy is the optimal coding according to probability distribution Q, and the length of coded information with true distribution P is defined as
Given p, the closer q and p are, the smaller the cross entropy is. The farther q and p are, the greater the cross entropy is.