Machine Learning: Cost Function

May 22, 2020 — 5 Min Read

Machine Learning - The Supervised kind

Machine learning is the ability of computer algorithms to improve continuously through experience. One of the most common types of machine learning techniques include supervised learning. In learning algorithms, we have two sets of values - the Input Features and the Output Variables. Most of the supervised learning algorithms are classified into two types of problems:

Regression: In regression problems, we have a set of continuous input features, mapped against the output variables. The problem is to predict a real-valued output against an anonyomous input feature, as close to the actual value as possible.
Classification: In classification problems, we have a set of inputs belonging to a given category. The problem is to map anonymous input values into discreete categories.

In this article, we shall discuss about the Regression type problems in machine learning, the definition of cost function and the need for minimizing it.

What is Regression? What scenarios require regression method of problem solving in machine learning?

Regression is a mathematical problem solving method in which a we try to formulate a function through which an unknown variable can be predicted, whose value depends upon the values of known variables.

Assume that we have the problem of predicting the monetary value of house based on three factors: Dimensions, Number of bedrooms, and Age of the house. One can say that, the value of the house increases if the Dimensions and Number of bedrooms increases. On the otherhand, the value of house decreases if the Age increases

In this problem,

Unknown/Dependent variable(y): Value of the house
Known/Independent variable/s(x₁, x₂, x₃): Dimension, Number of bedrooms, Age
Regression function example: f(x) = θ₁x₁ + θ₂x₂ + θ₂x₂

Where, x₁, x₂, and x₃ represent Dimension, No.of bedrooms and Age respectively. y is the correct value of the house. The co-efficients θ₁, θ₂, and θ₃ of the independent variables x₁, x₂, and x₃ will change according to the training samples given while formulating the regression function f(x).

If we have only one independent variable(x), we call this as linear regression. For the sake of simplicity, I will be using linear regression in the rest of the article.

Cost of the hypothesis function

source

Consider the linear regression problem containing only one independent variable. Let's define the linear regression function by:

f(x) = θ₀ + θ₁x

This function is a hypothesis function in which we say that, for certain value of θ₀ and θ₁, given the value of x, we get the predictions of f(x) very close to the actual value.

Let us consider two graphs. Graph 1 contains the plot of output y versus the input values x as dots on a graph. Graph 2 contains the plot of hypothesis function f(x) plotted on the graph as a straight line. If we merge these two graphs and represent them in a single graph, the distance between the predicted value(a point on the hypothesis function) and the actual value(training sample value) represents the cost of individual predictions.

Let us view this graph:

In the above graph:

Individual points on the graph represent each training sample y v/s x.
The line on the graph represents the hypothesis function f(x)
The distance between the points and the line represents the cost for the individual training sample

Cost representation: We have seen the representation of cost graphically. Let us try to derrive a mathematical equation out of this graphical representation. We define the following parameters used in the cost funciton.

J(θ_i) => Individual Cost function.
f(x_i) => Hypothesis function for i_th training set.
y_i => i_th value in training set.

The individual costs can defined as the difference between the value of hypothesis function and the actual value: J(θ₁) = f(x₁) - y₁, J(θ₂) = f(x₂) - y₂

The total cost of all the values present in the training set can be represented as: Total cost = Σ_0-i (f(x_i) - y_i)

Minimizing cost function: We don't wana spend too much now, Do we?

The final goal of linear regression is to find the hypothesis function using training samples, such that the final total cost of the hypothesis function is minimal. Let us derrive the equation that represents cost function to be minimized.

We know that the total cost of the hypothesis function, given a training set can be defined as: Total cost = Σ_0-i (f(x_i) - y_i)
We want the cost to be minimum, in other words, the difference between (f(x_i) and y_i) should be minimum. Note that when we square an integer, its value increases, however, if we square a fraction, its value decreases. Squaring the the difference will make sure that the cost is at its absolute minimum: Total cost = J(θ_i) = Σ_0-i (f(x_i) - y_i)²
Let us consider that the training sample has m number of values in it. The average cost can be represented as: (Σ_0-i (f(x_i) - y_i)²)/m

Hence the average cost function to be minimized can be represented as: J(θ) = (Σ_0-i (f(x_i) - y_i)²)/m

If we substitute the hypothesis function with the actual values of θ and x, we get the cost function as:

J(θ) = (Σ_0-i ((θ₀ + θ₁x_i) - y_i)²)/m

There are many algorithms that can be implemented to minimize this cost function. Gradient descent is one such algorithm commonly used, however, note that there are more than one ways to reduce the cost function. I hope this article gave an insight on understanding how the average cost function is derrived from the hypothesis function in linear regression.

Written by Aparna Joshi who works as a software engineer in Bangalore. Aparna is also a technology enthusiast, writer, and artist. She has an immense passion and curiosity towards psychology and its implications on human behavior. Her links: Blog, Twitter, Email, Newsletter