Machine Learning: Cost Function
Machine learning is the ability of computer algorithms to improve continuously through experience. One of the most common types of machine learning techniques include supervised learning. In learning algorithms, we have two sets of values - the Input Features and the Output Variables. Most of the supervised learning algorithms are classified into two types of problems:
- Regression: In regression problems, we have a set of continuous input features, mapped against the output variables. The problem is to predict a real-valued output against an anonyomous input feature, as close to the actual value as possible.
- Classification: In classification problems, we have a set of inputs belonging to a given category. The problem is to map anonymous input values into discreete categories.
In this article, we shall discuss about the Regression type problems in machine learning, the definition of cost function and the need for minimizing it.
What is Regression? What scenarios require regression method of problem solving in machine learning?
Regression is a mathematical problem solving method in which a we try to formulate a function through which an unknown variable can be predicted, whose value depends upon the values of known variables.
Assume that we have the problem of predicting the monetary value of house based on three factors: Dimensions, Number of bedrooms, and Age of the house. One can say that, the value of the house increases if the Dimensions and Number of bedrooms increases. On the otherhand, the value of house decreases if the Age increases
In this problem,
- Unknown/Dependent variable(y): Value of the house
- Known/Independent variable/s(x1, x2, x3): Dimension, Number of bedrooms, Age
- Regression function example: f(x) = θ1x1 + θ2x2 + θ2x2
Where, x1, x2, and x3 represent Dimension, No.of bedrooms and Age respectively. y is the correct value of the house. The co-efficients θ1, θ2, and θ3 of the independent variables x1, x2, and x3 will change according to the training samples given while formulating the regression function f(x).
If we have only one independent variable(x), we call this as linear regression. For the sake of simplicity, I will be using linear regression in the rest of the article.
Consider the linear regression problem containing only one independent variable. Let's define the linear regression function by:
f(x) = θ0 + θ1x
This function is a hypothesis function in which we say that, for certain value of θ0 and θ1, given the value of x, we get the predictions of f(x) very close to the actual value.
Let us consider two graphs. Graph 1 contains the plot of output y versus the input values x as dots on a graph. Graph 2 contains the plot of hypothesis function f(x) plotted on the graph as a straight line. If we merge these two graphs and represent them in a single graph, the distance between the predicted value(a point on the hypothesis function) and the actual value(training sample value) represents the cost of individual predictions.
Let us view this graph:
In the above graph:
- Individual points on the graph represent each training sample y v/s x.
- The line on the graph represents the hypothesis function f(x)
- The distance between the points and the line represents the cost for the individual training sample
Cost representation: We have seen the representation of cost graphically. Let us try to derrive a mathematical equation out of this graphical representation. We define the following parameters used in the cost funciton.
- J(θi) => Individual Cost function.
- f(xi) => Hypothesis function for ith training set.
- yi => ith value in training set.
The individual costs can defined as the difference between the value of hypothesis function and the actual value: J(θ1) = f(x1) - y1, J(θ2) = f(x2) - y2
The total cost of all the values present in the training set can be represented as: Total cost = Σ0-i (f(xi) - yi)
The final goal of linear regression is to find the hypothesis function using training samples, such that the final total cost of the hypothesis function is minimal. Let us derrive the equation that represents cost function to be minimized.
- We know that the total cost of the hypothesis function, given a training set can be defined as: Total cost = Σ0-i (f(xi) - yi)
- We want the cost to be minimum, in other words, the difference between (f(xi) and yi) should be minimum. Note that when we square an integer, its value increases, however, if we square a fraction, its value decreases. Squaring the the difference will make sure that the cost is at its absolute minimum: Total cost = J(θi) = Σ0-i (f(xi) - yi)2
- Let us consider that the training sample has
mnumber of values in it. The average cost can be represented as: (Σ0-i (f(xi) - yi)2)/m
Hence the average cost function to be minimized can be represented as: J(θ) = (Σ0-i (f(xi) - yi)2)/m
If we substitute the hypothesis function with the actual values of θ and x, we get the cost function as:
J(θ) = (Σ0-i ((θ0 + θ1xi) - yi)2)/m
There are many algorithms that can be implemented to minimize this cost function. Gradient descent is one such algorithm commonly used, however, note that there are more than one ways to reduce the cost function. I hope this article gave an insight on understanding how the average cost function is derrived from the hypothesis function in linear regression.