The above figure shows that the artificial neural network is a hierarchical model, which can be logically divided into three layers:
Input layer: the input layer receives the feature vector X.
Output layer: the output layer generates the final prediction H.
Hidden layer: The hidden layer is located between the input layer and the output layer, so it is called hidden layer because the generated values are not as directly visible as the sample matrix X used by the input layer or the label matrix Y used by the output layer.
Here are some symbols to help describe the model:
! $ a {(j)} _ {i} $ stands for the i-th activation unit on the j-th floor. ! $ \ theta {(j)} $ represents the weight matrix when mapping from layer J to layer j+ 1, for example! $ \ theta {( 1)} $ represents the weight matrix mapped from the first layer to the second layer. Its size is: j+ 1 the number of active units in the layer is the number of rows, and the number of active units in the layer j plus 1 is the matrix of the number of columns. Such as the neural network shown above! The size of $ \ theta {( 1)} $ is 3*4.
For the model shown in the above figure, the activation unit and output are expressed as:
! $ a^{(2)}_{ 1} = g(\theta^{( 1)}_{ 10}x_0+\theta^{( 1)}_{ 1 1}x_ 1+\theta^{( 1)}_{ 12}x_2+\theta^{( 1)}_{ 13}x_3)$
! $a^{(2)}_{2} = g(\theta^{( 1)}_{20}x_0+\theta^{( 1)}_{2 1}x_ 1+\theta^{( 1)}_{22}x_2+\theta^{( 1)}_{23}x_3)$
! $a^{(2)}_{3} = g(\theta^{( 1)}_{30}x_0+\theta^{( 1)}_{3 1}x_ 1+\theta^{( 1)}_{32}x_2+\theta^{( 1)}_{33}x_3)$
! $ h _ { \ theta } {(x)} = g(\theta^{(2)}_{ 10}a^{2}_{0}+\theta^{(2)}_{ 1 1}a^{2}_{ 1}+\theta^{(2)}_{ 12}a^{2}_{2}+\theta^{(2)}_{ 13}a^{2}_{3})$
Let's take the above neural network as an example and try to calculate the value of the second layer by vectorization:
For multi-class classification problems:
We can define the classification of neural networks as two cases: two-class classification and multi-class classification.
The second category:! $ S_{L} = 0, y = 0, y = 1$ yuan.
Multi-category classification:! $ s _ {l} = k, y _ {i} = 1 indicates that it belongs to class I; (k & gt2)$
In the neural network, we can have many output variables, ours! $h_{\theta}{(x)} $ is a vector with dimension k, and the dependent variable in our training set is also a vector with the same dimension, so our cost function will be more complicated than logistic regression, which is:! $ h _ {\ theta} {(x)} \ Output $ at r {k} (h _ {\ theta} {(x)}) _ {i} = i {th}.
We hope to observe the error between the prediction result and the real situation through the cost function. The only difference is that for each line of features, we will give k predictions. Basically, we can use the cycle to predict k different results of each line feature, and then use the cycle to choose the most likely one among the k predictions and compare it with the actual data in Y.
Regularization only excludes each layer! The sum of the matrices of each layer after $\theta_0$. The innermost loop J loops all the rows (determined by the number of active units of+1 layer), while the loop I loops all the columns, which is determined by the layer (! $ s_l$ layer). Namely:! The distance between $h_{\theta}{(x)}$ and the true value is the sum of each sample-each class output, and the regularized deviation term is used to deal with the sum of squares of all parameters.
Because neural network allows multiple hidden layers, that is, neurons in each layer will produce predictions, it cannot be directly minimized by the gradient descent method of traditional regression problems! $J(\theta)$ But the prediction error should be considered layer by layer and optimized layer by layer. Therefore, in the multi-layer neural network, the back propagation algorithm is used to optimize the prediction. Firstly, the prediction error of each layer is defined as a vector! δ {(l)} dollars
Training process:
When we use gradient descent algorithm for a more complex model (such as neural network), there may be some imperceptible errors, which means that although the cost seems to be decreasing, the final result is not necessarily the optimal solution.
In order to avoid this problem, we adopt a numerical gradient test method called gradient. The idea of this method is to test whether the calculated derivative value is really what we need by estimating the gradient value.
The method used to estimate the gradient is to select two very close points along the tangent direction on the cost function, and then calculate the average value of these two points to estimate the gradient. That is, for a specific one, we count it! $\theta-\epsilon$ and! The cost of $\theta+\epsilon$ (a very small value, usually 0.00 1), and then calculate the average of the two costs to estimate the cost in! The value of $\theta$.
When! When $\theta$ is a vector, we need to test the partial derivative. Because the partial derivative test of the cost function only tests the change of one parameter, the following test is only for! Example of $\theta_ 1$ used for inspection:
If the above formula holds, it proves that the BP algorithm in the network is effective. At this time, the gradient check algorithm is closed (because the approximate calculation efficiency of gradient is very slow) and the training process of the network is continued.