Dark Mode On/Off

LAST UPDATED: JULY 18, 2018

We now know what supervised learning means and how a line can be fitted using the cost function. Now we will dive into **unsupervised learning**.

Consider the following real-life example - You visit a new country and you are absolutely clueless about the lifestyle, food, dressing etc, about the place. What do you do? You slowly observe things, people, and food and draw conclusions about them. Also, after drawing conclusions, you slowly try to adapt to the surroundings, environment, language and much more. Whenever you see that your observation or judgment about something is wrong, you re-conclude it based on your current observations. This, in simple words, is unsupervised learning.

As mentioned earlier, in unsupervised learning, there is no "correct answer" provided for the given dataset. In fact, for observation, just a random dataset is provided and no relationship between the variables is generally established. The algorithm needs to figure out the relationship and learn more about the dataset provided. There is no teacher here, the algorithm needs to learn and discover facts and figures about the dataset on its own. Unsupervised learning can be further classified into **clustering** and **association problems**.

A task where you need to figure out the inherent groups or the inside groups from the dataset provided. For example - Grouping people based on their age, weight, etc.

A task where the algorithm needs to figure out the relationship and facts from the dataset provided, with the condition that this relationship applies to a large portion of data in the dataset.

Unsupervised learning is considerably complicated for some use cases and some people even call it "**closely associated with true artificial intelligence**". But, in reality, even though it is complicated, it has given us goals to solve problems that humans normally wouldn’t be able to solve. The fact that a machine can learn to identify complicated tasks and work on them, without any external guidance from human beings is truly fascinating.

Some commonly used unsupervised algorithms are as follows:

1. K-Means algorithm

2. Apriori algorithm

This family of algorithm solves the clustering problem by considering the items in the real world as points in **n** dimension. It is apt to be used when the dataset being evaluated is large. Furthermore, it categorizes items (or points) into **k** different groups (or clusters) based on the relationship/similarity between these items. Here, we used **Euclidean distance** to find out the relationship/similarity between the items.

Based on the type of item, we can use different metrics (instead of Euclidean distance) to calculate the similarity.

- Read the dataset.
- Pick a value for
**k**(which tells the number of grouping or clusters we want in the end). - Initially, pick one item (or point) for each of the
**k**clusters randomly. - Examine the dataset and place the items of the dataset into the cluster, such that, the distance between the item in the cluster (the centroid) and the item in the dataset is minimal.
- Repeat the above step for each item in the dataset.

Now let us get into the details of this clustering algorithm,

- Pick a value for
**k**randomly (instead of randomness, the value for**k**can also be picked in a different way). - Initialize
**k**points randomly (instead of being random, initializing the value for**k**can be done in a better way). These points are known as "**means**", hence the name**K-means**. Also, these points (means) hold the mean value of items of that category. Initially, they just hold one value (also known as the**centroid**), but further into the algorithm, they will hold many values. - Now, examine the dataset. At this time, each of the
**k**clusters will have just one item inside of them. After examining the dataset, group each of the item from the dataset into the**k**clusters (centroids) wherein, the distance between the centroid and the data point is minimal. - When the number of items in the cluster increases, the centroid also changes. Hence, the value of the centroid needs to be updated when new items are added to each of the
**k**clusters. - After the centroids have been updated, the distance between the centroid and the item also would have changed. Now re-assign all items to the centroids that are closest to them. In this phase, some items may move in between clusters too.
- Follow these steps until convergence, i.e. Until the points don't move between clusters and the centroids don't change.

It's more or less a game of trial and error.

- Pick a value for
**k**, check the average value of the point from the centroid. This average value should be a small value. - When the value of
**k**is too small, the average value from the centroid to data points will be too large. - When the value of
**k**is too large, you won't see a significant decrease in the average value from the centroid to data points. - When we finally figure out the right value for
**"k"**, the decrease in the average value when the**"k"**value was too small and when**"k"**value was just right is pretty much significant.

Make sure that the points in the dataset are not too far from the centroid, this way you can have a higher value for **k** so that no point is too far from the centroid.

The final clustering depends largely on the initial **k** points which we pick. Hence it is very essential to pick apt values.

- Don't pick the
**"k"**points which are all in the same cluster. - Don't pick outliers (meaning those points which are not nearby to any of the points in the clusters).

In the upcoming posts, you will see an example of this algorithm and many more supervised and unsupervised algorithms.

Advertisement