In a regression problem, the program predicts real-valued output. In linear regression, we have a training set or dataset in which we have to fit the best possible straight line or say the best possible linear relation with the least chance of error.
Let's say that we have a data set of used vehicles that consists of a relation between kilometers driven and price. We have observed the sale of used vehicles for 6 months and came up with this data, which we will term as our training set. Using the training set a linear relation(straight line) has to be generated with error as low as possible. The linear relation( let's name it hypothesis) will now be able to predict prices for other vehicles whenever we give it the input of kilometers driven.
Linear relation can be represented as:
O = a1 + (a2)*x { or y = mx+c equation of a straight line } ……………..(1)
Our dataset includes the sale statement of 50 used vehicles. So let's denote our number of training examples by T,
T = 50
And, x = input (which we will enter), and
O = output (predicted price by the program)
Now as you can see in the first equation above a1
and a2
are the parameters which will determine the linear relation or in simple words the orientation of the line (like vertical, horizontal etc).
The values of the parameters a1 and a2 should be selected or chosen such that the error possibility is least. Here the sum of squared error will be our first source to evaluate the relation or model, after which we have to work on our hypothesis which generate the best possible line with least error chances.
Here the squaring of error is necessary for eliminating the negative value. And after squaring the error we will get more accurate value.
Our hypothesis will be,
H(x) = a1 + (a2)*x
So our squared error cost function or mean squared error would look like this:
About H(x),
a2
this is a function of x,For different values of a2
slope of this function varies (something like this),
About E(a2) {let's assume E
for only one parameter for better understanding}
a2
,Controls the slope of the line (as seen below),
Now for minimizing the squared error cost function E, we have an algorithm called Gradient Descent. It is used all over in machine learning. Now let's look at the problem, we have E(a1, a2) and where we have a1
and a2
such that E(a1, a2) is minimum.
So what we are going to do is:
Start with some random values of a1
and a2
.
Keep changing a1
, a2
to reduce E(a1, a2)
until we reach the minimum.
Repeat until convergence {
}
Now we also have to update the values of our parameters simultaneously.
The correct way is:
a1 := Variable1
a2 := Variable2
here :=
is the assignment operator and =
is considered as truth assertion and ?
is the learning rate.
The incorrect way is:
About parameter a2
,
If a2 > 0, then x(input or feature or predictor) and O(output or target) have a positive relationship, i.e. if x increase then O will also increase.
If a2 < 0, then x(input or feature or predictor) and O(output or target) have a negative relationship, i.e. if x increase then O will decrease.
About parameter a1
,
About learning rate ?
,
If the learning rate is too small, gradient descent can be very slow.
If the learning rate is too large then the gradient descent can exceed or overshoot the minimum point in the function. As a result of which it may fail to converge or even diverge in some cases.
Gradient descent can converge to a local minimum, even with the learning rate ?
is fixed.
As we approach a local minimum, gradient descent will automatically take a smaller step, hence we don't need to decrease ?
overtime. We can see that in the image below,
Putting it all together,
Gradient Descent Algorithm is,
Repeat until convergence {
}
Now lets combine them together, for that simplification we need to do the partial derivative of E(a1, a2) with respect to a1
and a2
,
[ substituting the value of H(x)]
After doing the partial derivative we get,
so here's how our algorithm looks like,
Repeat until convergence {
} and we have to update it simultaneously.