Consider the following real-life example - You visit a new country and you are absolutely clueless about the lifestyle, food, dressing etc, about the place. What do you do? You slowly observe things, people, and food and draw conclusions about them. Also, after drawing conclusions, you slowly try to adapt to the surroundings, environment, language and much more. Whenever you see that your observation or judgment about something is wrong, you re-conclude it based on your current observations. This, in simple words, is unsupervised learning.
As mentioned earlier, in unsupervised learning, there is no "correct answer" provided for the given dataset. In fact, for observation, just a random dataset is provided and no relationship between the variables is generally established. The algorithm needs to figure out the relationship and learn more about the dataset provided. There is no teacher here, the algorithm needs to learn and discover facts and figures about the dataset on its own. Unsupervised learning can be further classified into clustering and association problems.
A task where you need to figure out the inherent groups or the inside groups from the dataset provided. For example - Grouping people based on their age, weight, etc.
A task where the algorithm needs to figure out the relationship and facts from the dataset provided, with the condition that this relationship applies to a large portion of data in the dataset.
Unsupervised learning is considerably complicated for some use cases and some people even call it "closely associated with true artificial intelligence". But, in reality, even though it is complicated, it has given us goals to solve problems that humans normally wouldn’t be able to solve. The fact that a machine can learn to identify complicated tasks and work on them, without any external guidance from human beings is truly fascinating.
Some commonly used unsupervised algorithms are as follows:
1. K-Means algorithm
2. Apriori algorithm
This family of algorithm solves the clustering problem by considering the items in the real world as points in n dimension. It is apt to be used when the dataset being evaluated is large. Furthermore, it categorizes items (or points) into k different groups (or clusters) based on the relationship/similarity between these items. Here, we used Euclidean distance to find out the relationship/similarity between the items.
Based on the type of item, we can use different metrics (instead of Euclidean distance) to calculate the similarity.
Now let us get into the details of this clustering algorithm,
It's more or less a game of trial and error.
Make sure that the points in the dataset are not too far from the centroid, this way you can have a higher value for k so that no point is too far from the centroid.
The final clustering depends largely on the initial k points which we pick. Hence it is very essential to pick apt values.
In the upcoming posts, you will see an example of this algorithm and many more supervised and unsupervised algorithms.
SHARE YOUR THOUGHTS WITH US!