Current location - Training Enrollment Network - Mathematics courses - Circulating neural network
Circulating neural network
The contents of RNN in the Flower Book are recorded in /p/206090600f 13.

In the feedforward neural network, the information transmission is unidirectional. Although this limitation makes the network easier to learn, it also weakens the ability of neural network model to some extent. In biological neural networks, the connection between neurons is much more complicated. Feedforward neural network can be regarded as a complex function, and each input is independent, that is, the output of the network depends only on the current input. However, in many practical tasks, the input of the network is not only related to the current input, but also related to its past output. Therefore, it is difficult for feedforward networks to process time series data, such as video, voice, text and so on. The length of time series data is generally not fixed, while feedforward neural network requires that the dimensions of input and output are fixed and cannot be changed at will. Therefore, a more powerful model is needed when dealing with such problems related to time series.

Recurrent Neural Network (RNN) is a neural network with short-term memory. In the circulating neural network, neurons can not only receive information from other neurons, but also their own information, forming a network structure with loops. Compared with feedforward neural network, circular neural network is more in line with the structure of biological neural network. Cyclic neural network has been widely used in speech recognition, language modeling and natural language generation. The parameter learning of circular neural network can be learned by back propagation algorithm with time.

In order to process these time series data and use their historical information, we need to make the network have short-term memory ability. Feedforward network is a static network and does not have this memory ability.

A simple way to use historical information is to set up an extra delay unit to store the historical information of the network (including input, output, hidden state, etc.). ). The typical model is delayed neural network.

Delayed neural network adds a delay to the non-output layer of feedforward network to record the output of neurons at the latest moment. At time, the latest output of the layer neuron is related to the output of the layer neuron, namely:

Delayed neural networks have weights in the time dimension to reduce the number of parameters. Therefore, for sequence input, delayed neural network is equivalent to convolutional neural network.

Autoregressive model (AR) is a time series model commonly used in statistics, which uses the historical information of a variable to predict itself:

Where is the hyperparameter, the parameter and the noise of the first moment, and the variance has nothing to do with time.

The nonlinear exogenous input autoregressive model (NARX) with external input is an extension of the autoregressive model, and an external input produces an output every moment. NARX records the latest external input and output through a delay device, and the output at the first moment is:

In which it represents a nonlinear function, which can be a feedforward network and a hyperparameter.

By using neurons with self-feedback, the recurrent neural network can process time series data of any length.

Given an input sequence, the recurrent neural network goes through the following process.

This formula uses the feedback edge to update the active value of the hidden layer:

Where it is a nonlinear function or a feedforward network.

Mathematically, the above formula can be regarded as a dynamic system. Dynamic system is a mathematical concept, which refers to a system whose state changes with time according to certain laws. Specifically, a dynamic system uses a function to describe the changes of all points in a given space (such as the state space of a physical system) with time. Therefore, the active value of hidden layer is also called state or hidden state in many documents. Theoretically, the cyclic neural network can approximate any nonlinear dynamic system.

Simple recurrent network (SRN) is a very simple recurrent neural network with only one hidden layer.

In the two-layer feedforward neural network, there are connections between adjacent layers, and the nodes in the hidden layer are not connected. The simple cyclic network increases the feedback connection from hidden layer to hidden layer.

Suppose that at a moment, the input of the network is in the hidden layer state (that is, the activity value of hidden layer neurons), which is not only related to the input at the current moment, but also related to the hidden layer state at the previous moment:

Among them, it is the net input of the hidden layer, the nonlinear activation function which is usually a logical function or a hyperbolic tangent function, the state-state weight matrix, the state-input weight matrix and the offset. The above two types are often written directly:

If we regard the state of each moment as a layer of feedforward neural network, then the cyclic neural network can be regarded as a neural network with * * * weights in the time dimension. The following figure shows the cyclic neural network expanded in time.

Recursive neural network is very powerful because of its short-term memory ability, which is equivalent to a storage device. Feedforward neural network can simulate any continuous function, while cyclic neural network can simulate any program.

Define a fully connected circular neural network, whose input is and output is:

Among them are hidden state, nonlinear activation function and network parameters.

Such a fully connected cyclic neural network can approximately solve all computable problems.

Cyclic neural network can be applied to many different types of machine learning tasks. According to the characteristics of these tasks, they can be divided into the following modes: sequence to category mode, synchronous sequence to sequence mode and asynchronous sequence to sequence mode.

The sequence-to-category mode is mainly used for the classification of sequence data: the input is sequence and the output is category. For example, in text classification, the input data is a sequence of words and the output is the category of text.

Suppose a sample is a length sequence and the output is a category. We can input the samples into the recurrent neural network at different times and get the hidden states at different times. We can regard it as the final representation (or feature) of the whole sequence and input it into the classifier for classification:

It can be a simple linear classifier (such as Logistic regression) or a complex classifier (such as multilayer feedforward neural network).

In addition to representing the state at the last moment as a sequence, we can also average all the states of the whole sequence and take this average state as the representation of the whole sequence:

Synchronous sequence-to-sequence mode is mainly used for sequence marking, that is, there are inputs and outputs at every moment, and the length of input sequence and output sequence is the same. For example, in part-of-speech tagging, each word needs to be tagged with its corresponding part-of-speech tag.

The input is a sequence and the output is a sequence. Input the samples into the recurrent neural network at different times, and get the hidden states at different times. The hidden state of each moment represents the current and historical information, and the label of the current moment is obtained by inputting it into the classifier.

Asynchronous sequence-to-sequence mode is also called coding and decoding model, that is, the input sequence and the output sequence do not need to have a strict correspondence, nor do they need to keep the same length. For example, in machine translation, word sequences in the source language are input and word sequences in the target language are output.

In asynchronous sequence-to-sequence mode, the input is a length sequence and the output is a length sequence. It is often realized by coding first and then decoding. Firstly, samples are input into the circular neural network (encoder) at different times, and their codes are obtained. Then it is used in another cyclic neural network (decoder) to get the output sequence. In order to establish the correlation between output sequences, nonlinear autoregressive models are usually used in decoders.

Among them, the cyclic neural network is used as the vector representation of encoder and decoder, classifier and prediction output respectively.

The parameters of circular neural network can be learned by gradient descent method. Given a training sample, where is the input sequence of length and the tag sequence of length. That is, at every moment, there is a monitoring information, and we define the loss function at that moment as:

Where the output of the first moment is a differentiable loss function, such as cross entropy. Then that los function of the whole sequence is:

The gradient of the loss function of the whole sequence relative to the parameters is:

That is, the sum of the partial derivatives of the loss to the parameters at each moment.

Recursive neural network has a recursively called function, so the way to calculate parameter gradient is different from feedforward neural network. There are two main methods to calculate gradient in recurrent neural networks: Back Propagation with Time (BPTT) and Real-time Cyclic Learning (RTRL).

The main idea of Time Back Propagation (BPTT) algorithm is to calculate the gradient through error back propagation similar to feedforward neural network.

BPTT algorithm regards the cyclic neural network as an extended multi-layer feedforward network, in which "each layer" corresponds to "every moment" in the cyclic network. In the "extended" feedforward network, the parameters of all layers are shared by * * *, so the real gradient of parameters is the sum of the parameter gradients of all "extended layers".

Because this parameter is related to the net input of the hidden layer at each moment, the gradient of the loss function relative to this parameter at the first moment is:

Where it means "direct" partial derivative, that is, the formula is unchanged, and the partial derivative is obtained:

Where is the first dimension of the hidden state at the first moment; Everything is except the first value.

The error term is defined as the derivative of the loss at the first moment to the net input of the hidden neural layer at the first moment, and then:

Therefore:

Write a matrix:

Therefore, the gradient of the loss function of the whole sequence with respect to the parameters is obtained:

Similarly, the gradient of weight and deviation is:

In BPTT algorithm, the gradient of parameters needs a complete "forward" calculation and "reverse" calculation to obtain and update. As shown in the figure below.

Different from BPTT algorithm of back propagation, real-time recursive learning calculates gradient through forward propagation.

Suppose that the state at the first moment in the circular neural network is:

Its partial derivative to the parameter is:

From the 1 th moment, RTRL algorithm not only calculates the hidden state of recurrent neural network, but also calculates the partial derivative forward in turn.

Comparison of two learning algorithms;

Both RTRL algorithm and BPTT algorithm are based on gradient descent, and the chain rule is used to calculate the gradient in forward mode and reverse mode respectively. In recurrent neural networks, the output dimension of the general network is much lower than the input dimension, so the calculation of BPTT algorithm will be smaller, but BPTT algorithm needs to keep the intermediate gradient at all times, which has high spatial complexity. RTRL algorithm does not need gradient feedback, so it is very suitable for tasks that need online learning or infinite sequences.

The main problem in the learning process of recurrent neural network is that it is difficult to model the dependence between States in a long range because of the disappearance or explosion of gradient.

In the BPTT algorithm, we have:

If defined, then:

If, when, it will cause system instability, which is called gradient explosion problem; On the other hand, if, when, there will be a gradient disappearance problem similar to the depth feedforward neural network.

Although the simple cyclic network can theoretically establish the dependence between long-time interval States, in fact, due to the problem of gradient explosion or disappearance, only short-time dependence can be learned. In this way, if the output of time t depends on the input of time, it is difficult for a simple neural network to model this long-distance dependence when the interval is large, which is the so-called long-term dependence problem.

Generally speaking, the problem of gradient explosion of ring network is relatively easy to solve, and it is generally avoided by weight attenuation or gradient truncation. Weight attenuation is to limit the range of parameters by adding regular terms to parameters or norms, so that. Gradient truncation is another effective heuristic method. When the modulus of the gradient is greater than a certain threshold, it is truncated to a smaller number.

The disappearance of gradient is the main problem of ring network. In addition to using some optimization techniques, a more effective method is to change the model, for example, to use it at the same time, that is:

Where is a nonlinear function and a parameter.

In the above formula, there is a linear dependence between and, and the weight coefficient is 1, so there is no problem of gradient explosion or disappearance. However, this change also loses the nonlinear activation of feedback neurons, so it also reduces the representation ability of the model.

In order to avoid this shortcoming, we can adopt more effective improvement strategies:

In this way, the relationship between sum and sum is both linear and nonlinear, and the problem of gradient disappearance can be alleviated. However, there are still two problems with this improvement:

In order to solve these two problems, the model can be further improved by introducing gating mechanism.

In order to improve the long-range dependence of recurrent neural network, a very good solution is to introduce gating mechanism to control the speed of information accumulation, including selectively adding new information and selectively forgetting previously accumulated information. This network can be called gated RNN. In this section, we mainly introduce two kinds of gated circulatory neural networks: long-term memory network, short-term memory network and gated circulatory cell network.

Long-term and short-term memory (LSTM) network is a variant of recurrent neural network, which can effectively solve the problem of gradient explosion or disappearance of simple recurrent neural network.

On this basis, the LSTM network has been improved in the following two aspects:

Among them, there are three gates to control the path of information transmission; Is the product of vector elements; Is the last memory unit; Is a candidate state obtained by a nonlinear function:

At each moment, the internal state of the LSTM network records the historical information up to the current moment.

In digital circuits, the gate is a binary variable {0, 1}, where 0 represents the closed state and no information is allowed to pass; 1 stands for open state, allowing all information to pass through. The "gate" in the LSTM network is a kind of "soft" gate, and its value is between (0, 1), indicating that information passes through in a certain proportion. The functions of the three doors in the LSTM network are:

(1) The forgetting gate controls how much information needs to be forgotten in the internal state at the last moment.

(2) The input gate controls how much information the candidate state needs to save at the current moment.

(3) output gate