Why can neural networks fit functions with arbitrary complexity with arbitrary accuracy?

Before we begin, let's take a look at the description of the universal approximation theorem given by Wikipedia:

Universal approximation theorem (Hornik et al.,1989; Cybenko, 1989) theorem shows that feedforward neural networks can fit functions with arbitrary complexity with arbitrary accuracy as long as they have a single hidden layer and finite neurons. This is a theorem that has been proved. Let's use a simple way to explain why neural networks can (theoretically) fit any function.

Anyone who has seen the movie Despicable Me knows that Minions likes bananas very much. However, now it only has 12 apples, but it is not interested in apples. I thought it would be nice if someone could exchange bananas for its apples. Inadvertently, it found a magical hut.

Wannabe put an apple on the window of the magic hut, and the magic hut gave it nothing. Minions tried to put five apples in the window of the hut, and the magical hut suddenly spit out 16 bananas! The minions were overjoyed. Then, the minions tried to throw six apples at the magic hut, and the magic hut spit out 20 bananas.

Now, Minions has used up 12 apples. He took the bananas he bought and thought, If I give him three apples, how many bananas will the hut spit out?

This is a primary school question (finding the law). How to solve it?

You may blurt out, 8 bananas! ! Ok, well, it means that your IQ can learn such a profound subject as AI ~

How to use the steps of machine learning to solve the problem done by this primary school student (you may think this is killing the chicken with a knife).

We use the variable x? Represents the number of apples thrown into the magic hut (input), and represents the number of bananas spit out by the magic hut (output) with variables, then we get a data set:

Our goal is to establish a mathematical model, so that the model can meet the implicit laws of the data set. Which is the input to the model? x？ Value, the model will output the corresponding? Value.

Pupils should have learned the unary function (y = wx+b). Since it is a primary school topic, it should be possible to simulate the laws of data sets with relatively simple functions. So we define a linear function model:

Then the question comes, how can we determine the two parameters w and b of the function?

If you are smart, you may blurt it out again, and it is y = 4x+(-4)! ! Well, you once again proved that your IQ has surpassed that of primary school students or junior high school students.

But minions are not as clever as you. It can only guess. If w= 1 and b=0, what will be the result?

Obviously, the output (prediction) value of the model is y? It is quite different from the real value in the actual data set. Minions was not satisfied, so he made another wild guess. W=2, b=2, what is the result?

So, what is the output value y of this model and the real value in the data set? The difference doesn't seem that big. Minions thought, which of the two candidate models is better (which can better simulate the laws of data sets)? How to quantify "better"?

Then, we introduce the concept of loss function.

Compare the predicted value y with the real value? The sum of squares of the difference between the two is used as the quantification of "better". The smaller the loss function is, the smaller the difference between the predicted value and the real value is, which shows that parameters W and B can better simulate the law in the data set.

With the loss function, let's take a look at the loss function values of the above two candidate models.

The loss function value L (2,2) = 68 of model y = 2x+2 is less than L( 1, 0) = 3 18, so the candidate model y = 2x+2 wins.

Minions is a man who pursues the ultimate. Although the loss function value 68 is less than 3 18, it is still very large. Are there other parameters w and b that make the loss function L (w, b) less than 68?

Therefore, we introduce the concept of optimizer.

Try to find the parameter w, b that minimizes the loss function value L(w, b). Because Minions has never learned the gradient descent method (convex function optimization algorithm), it doesn't matter if you don't understand it, you can only use the "random trial and error method".

Minions starts with parameters w=2 and b=2, and makes random attempts in steps of 1. That is, within the range of "plus one minus one", try four points around the coordinate point (2,2): (3,2), (2,3), (1, 2), (1, 1). It is found that the loss function value at point (3,2) is smaller than that at other three points and the origin.

So Minions found a better candidate model y = 3x+2, and its loss function value is 26, far less than 68. The minions were very excited and tried again in the same way. By analogy, it then found two coordinate points, L(3, 1) = 17 and L (3,0) =14. However, none of the attempts around point (3,0) found the loss function value less than 14.

Is it over?

Because of your high IQ, you can certainly think that at point (4, -4), the loss function value is the smallest: L(4, -4) =0. However, the coordinate point (4, -4) cannot be found by the above-mentioned attempt.

What's the problem? This is a question of initial point selection.

Minions found that if we start from the coordinate point (-2, -4), we will eventually find the point (4, -4) that minimizes the loss function. If we study it in depth, it will involve the optimal search problem, which is beyond the scope of this paper.

At present, we only need to know that the model parameters w and b that minimize the loss function can be found by optimization methods (such as least square method).

The above story is linear regression.

We need to give a slightly stricter definition to explain what linear regression is. The following is a sentence given in Machine Learning (by Zhou Zhihua):

Match this sentence with our model. The model function y = 4x-4 is the "linear model" learned in the sentence. Then, in our story, we don't predict the true value output sign as accurately as possible, but 100% predicts the true value output sign ... the loss function value can reach the minimum value of 0.

Actually, it's not that simple ... let's expand it a little.

One day, wannabe discovered that if he gave the magic hut 1 apple, 2 bananas and 3 pears, the magic hut would vomit a cat. It's really amazing. . . .

At this point, the model function is no longer a simple univariate function, but a ternary function, with three input variables (x 1, x2, x3) and four parameters (w 1, w2, w3, b) to be optimized. We call this situation "multiple linear regression". In fact, this is the prototype model of image recognition, so I won't discuss it further.

When Minions discovered the law of exchanging bananas in the magic hut, he was very, very happy. It found a lot of apples and was going to exchange bananas with the magic hut. But ... this is life. When you are most proud, you are often thrown a pot of cold water.

(Note that the previous data set has been adjusted from x= 1, 5,6 to X = 1, 2,3. Convenient drawing)

Minions tried to give the magic hut four apples and five apples, and got nine bananas and 10 bananas respectively. There seems to be a problem! If we follow the previously discovered rules, we should get 12 and 20 bananas respectively. Minions, a mystery.

At this time, the magic hut spit out a note that said: If you throw too many apples, I will give you less bananas. Slaves, a little depressed.

If you model according to the previous unary function, you will get the following function model.

You may be much smarter than Minions, and you can see at a glance that the above model function seems inappropriate. The loss function will never get the minimum value of 0.

If only the model function is like this, then the corresponding loss function value will take the minimum value of 0. However, these seem to be two model functions. Two tigers are not allowed in one mountain. Can you combine these two functions into one function?

Then you blurted out again, piecewise function! ! Facts have proved that your IQ has reached the level of high school students.

When x

When x > When = 3, s 1 equals 0, s2 equals 1, and the function y =1x+5;

This is a perfect combination function.

Then, the problem comes again. What are s 1 and s2? How can we be sure?

If s2 is regarded as a function, then ideally, it should be such a step function.

However, the step function is discontinuous and not smooth.

At this time, Minions said leisurely, I seem to have seen a continuous function that is a bit like this function, called Sigmoid function.

After seeing this sigmoid function, you are angry. Say to the minions: talk less if you are stupid. This function is similar to the step function, but the difference is too far! ! what do you think? It doesn't look like it.

Minions: You can just give the variable t one parameter and change it to σ( 1000t). (Pick your nose)

If you don't look carefully, you can hardly see a very steep curve between the longitudinal axis 0 and 1. Everyone was speechless, and the minions sat up and took notice.

When x = 0. 1, s = σ (100) ≈1;

When x =-0. 1, s = σ (100) ≈ 0;

By adjusting this Sigmoid function, we can get all kinds of step functions we need.

So we get a new model function, y = (4x-4) σ (-1000x+3000)+(1x+5) σ (1000x-3000);

For example, when x = 4, y = (12) σ (-1000)+(9) σ (1000) =12 * 0+9 *1= 9.

In this process, Minions is still credited with putting forward the concept of activation function.

Let's look at a slightly stricter definition of logistic regression.

This sentence is enough. In the first section, we have learned the linear regression model y = WX+B. Observing the graph, we can find that logistic regression is actually using the activation function Y =? σ(wx+b). The predicted value y of linear regression model (y = wx+b) can be any real number {-∞, ∞}, while logistic regression model (y =? The predicted value y of σ(wx+b) can only be a real number between {0, 1}. If you can understand the relationship between linear regression and logical regression, you can grasp the essential meaning of both.

Minions believes that although the number of bananas given is less, at least more bananas are spit out than apples are thrown in. So, wannabe tried to throw seven and nine apples into the magic hut.

As a result, the magic hut only returned 10 bananas twice. The minions were dumbfounded.

Although Minions is stupid in other things, he is much smarter as long as it is related to bananas. Just five apples can be exchanged for 10 bananas, and now nine apples can be exchanged for 10 bananas! ! Obviously, I have suffered. But it doesn't like apples very much, so it can only hold back its resentment and try to find out the rules inside.

After the previous routine, you are resourceful and will definitely think of a solution.

Yes, that's it. Divide the data set into three blocks, construct linear model functions respectively, and then combine them with activation functions.

The problem has arisen again.

When x

When x > = 5, S3 = σ (1000x-5000) =1,otherwise it is 0;

When 3

I don't know if you have noticed that functions s 1 and s3 all take X as an unknown variable. If you change your mind, consider s2 as a binary function of s 1 and s3. That is, whether s2 is equal to 1 or 0 is determined by the values of s 1 and s2.

S2 =σ(- 1000s 1- 1000 S2+500)

Although the number of bananas will not increase any more, Minions is a little happy that such a complicated problem can be solved (using linear regression and logistic regression to model the data set). Anyway, it still has some apples in its hand, so it tries to throw 10, 1 1, 12 apples into the magic hut. As a result ... the slave collapsed! !

A note in the magic hut said: Don't be insatiable. You should stop when you are good and satisfied. The minions collapsed. Now there is only one unsolved problem-how to model the data set.

No matter how clever you are, it seems that you can only solve two of them.

Let s 1 = σ(- 1000x+3000), that is, when x

S4 = σ( 1000x-9000), that is, when x >; = 9，S4 = 0； ;

So how do you determine s 1 and s2?

According to previous experience, you can roughly determine that s 1 and s2 should be determined by the values of s 1 and s4.

follow-up action ......

Suppose we have many data sets now,

Carding process

Here comes the point.

No spray statement: this article draws lessons from (North Korea) Oxford University xDeepMind? Open course of natural language processing

My most unforgettable person (choose one between teachers and classmates this semester)

Case problems in mathematics?