Machine Learning

DECEMBER 8, 2018
by **SmritiS**

In my previous posts(this, this, this, this and this), we got familiar with various keywords and a few concepts related to machine learning. We also understood **Linear Regression** along with an example of predicting the prices of cars, provided a set of cars and their prices (also known as the "dataset").

In today's post, we will explore and learn yet another machine learning algorithm, which is a regression algorithm - **Logistic Regression**.

**Logistic regression** is a process that **classifies **the given dataset into discrete outputs (outcomes). It is one of the most popular and widely used algorithms in the industry as of now. It is a supervised machine learning algorithm since it uses input variables and produces output variables. Even though the name suggests that this is a regression algorithm, it is NOT one. It is a **classification algorithm**.

**Examples:**

- We classify an email as spam or not spam using Logistic Regression.
- We can classify tumors as malignant or benign using Logistic Regression.
- We can use a student's previous grades to determine if the student would pass or fail.
- We can determine whether a transaction made online is legit or fraudulent.

If you observe the above-given examples carefully, you can see a pattern. In all the above-mentioned examples, we are trying to predict if a variable is either **zero** or **one**, **false** or **true**, **yes** or **no**, **fraudulent** or **legit**, **spam** or **not spam** etc. This means the variable we are trying to predict can take any of the 2 values, **0** or **1**.

Consequently, we are trying to **CLASSIFY **the variable as **one **value or **zero **value, which means **logistic regression is a classification algorithm**. The outcomes/predictions of Logistic Regression are always between 0 and 1. They can't be less than 0 and can't be greater than 1.

**Important NOTE: **Never apply Linear Regression when you encounter a classification problem. The reason is that linear regression can output values much larger than 1 or much smaller than 0, whereas the range of Logistic Regression (i.e the only possible outputs lie between 0 and 1) is between 0 and 1, including both the values. When Linear regression is used to classify, variable greater than 1 or less than 0 also gets classified under the wings of 0 and 1 only. Basically, our model needs to represent an equation like this:

where `h`

is the **hypothesis ****function**,

`T`

represents **transpose**,

(This equation was obtained by replacing the value for `g`

in the previous equation)

When we learned about Linear Regression, we used an expression to represent our hypothesis function. On similar lines, we need to define a function for logistic regression as well. This function `g`

is known as "**sigmoid function**", or "**logistic function**". The classification algorithm got its name because of the term logistic function. As a trivial fact, both sigmoid and logistic function means the same. When we combine the model equation as well as the `g(z)`

equation, we get an expression like this:

Using this function, we need to select the right values for the parameter **theta** and find the value of . This expression will help us in classifying the values present in the dataset.

Graphical representation of sigmoid function:

Since the value needs to be strictly between 0 and 1, the function starts at **0** and goes all the way up to **1**. In between, it rises up to **0.5** and then goes flat up to 1. This basically means that the function **asymptotes at 0 and 1**.

Consider an example where a student has scored 3 out of 10 (failed) in his/her first test and 5 out of 10 (passed) in the second test. These values can be used by the hypothesis function to predict if the student will pass (1) or fail (0) in the third test (provided the student is putting exactly the same effort as he used to and is consistent in his preparation like before). Suppose the hypothesis function outputs a value of `0.4`

given that `x=1`

, which means we only want to find the probability of the student passing the third test (remember that the output should be strictly between 0 and 1). This `0.4`

means that there is a **40 percent** chance of the student passing and **60 percent** chance of the student failing in his third test.

Also, remember that the sum of probabilities of the 2 values should add up to 1.

When our hypothesis function is trying to predict the probability of an outcome that is 1, then, should be greater than or equal to 0. Similarly, when our hypothesis function is trying to predict the probability of an outcome which is 0, then should be less than 0. This can be inferred from the graphical representation of the sigmoid function.

It is a property of the hypothesis function which takes into consideration the coefficients of the variables in question. Let us consider an example to understand the concept of the decision boundary.

Consider a hypothesis function like this:

where theta 1, theta 2 and theta 3 are the coefficients/parameters that are chosen (we will see how they are chosen in the consequent posts) and x1 and x2 are variables. Let us randomly assume that the values of to be 1, to be 1 and to be -1. When we substitute the values of the parameters in the hypothesis function, we get the following expression-

, which can further be re-written as follows:

When the above equation is plotted on a graph, it looks like below:

This line is known as the **decision boundary** because it separates the regions that are used to predict outcomes 1 and 0.

When the prediction is done to find the probability of an outcome as 1, the region to the right of the line, which is shaded in **pink** is considered. On the other hand, when we need to find values for the outcome of 0, we consider the region below the line, which is shaded in **purple**. In general, after the values for theta is found, there is no need to plot this graph, because the values for theta itself helps us define the decision boundary.

In the upcoming posts, we will see how to find out the values for theta and fit these parameters and use it on a sample dataset.