What is Elasticsearch?
If you ask a group of developers to find some data by searching and matching some text from a traditional RDBMS, a beginner would use the LIKE clause in SQL query, a bit more enthusiastic developer may write a PL/SQL procedure, or some one might even SELECT all the data, put it in a data structure and then apply some searching algorithm, and there are many such techniques. But all this will slowly fail to impress, as the size of the data grows. And once the data is of the order of million rows, it may take a couple of seconds to successfully execute a text search on the data.
To store and search(full text search) large amount of semi-structured or unstructured data, very fast, we can use the Elasticsearch service.
Elasticsearch is not a database, but a distributed search engine service which is used to store data in form of JSON object(documents) for super fast full text search, highly scalable, and very easy to use even for a beginner.
Elasticsearch comes packed with a simple but elaborate API which can be used to perfrom various opertations, preconfigured default values which can be used in production without any changes for simple usecases, and a small learning curve.
You can install Elasticsearch either on your Windows laptop, MacOS, on any Linux machine, or it can be deployed on multiple physical or virtual servers in a cluster.
Elasticsearch as a service can be installed on Docker using its docker image and can be run in a Kubernetes environment.
Why we need Elasticsearch?
As mentioned in the introduction of this tutorial, the RDBMS falls short when it comes to querying huge data to get results, which is a valid use case these days with growing size of data.
For example, any e-commerce website easily has thousands or millions of products listed and every customer before buying anything, search for it, and if the search takes time, the user will, in most cases leave for some other e-commerce website. Hence, having a fast search system, has become a necessity to save the business.
RDBMS, by its nature, keeps data distributed in normalised tables, which provides a proper structure to the data, but makes it difficult to search anything. Hence, the software industry slowly shifted towards NoSQL, and hence Elasticsearch, which is also based on NoSQL.
Elasticsearch provides a query language Query DSL (Domain specific language), which can be used to query stored data.
Elasticsearch service is distributed, which means it is deployed in a cluster with multiple nodes running on different physical or virtual servers. Although we can run a single node setup as well. Let's try to understand what do we mean by all this, in rather simple language.
An Elasticsearch cluster refers to a group a elasticsearch nodes running on different physical or virtual machines. When we setup an Elasticsearch cluster, one node is assigned as the master node, which is used to make configurations changes to the cluster and other nodes in the cluster knows which is the master node.
We can communicate with any node using the REST API by making an HTTP request, while internally each node communicates via the transport layer.
Below we have a simple diagram to explain a simple Elasticsearch service cluster.
The default name of the Elasticsearch cluster is elasticsearch. We can specify the number of nodes to start initially while running Elasticsearch and can also specify the IP address of the physical or virtual server in the config/elasticsearch.yml file, where you will find all the other configurations too.
Following are the main logical units of Elasticsearch service:
Cluster - An Elasticsearch cluster consists of one or more nodes and is identifiable by its cluster name.
Node - A single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine.
Index - In Elasticsearch, an index is a collection of documents. You can understand it, like a table is for RDBMS, similarly we have index in Elasticsearch.
Shards - As already discussed, Elasticsearch is a distributed search engine, hence an index is usually broken into smaller units known as shards that are distributed across multiple nodes.
Replicas - Elasticsearch also keeps a copy of all the shards which is called as Replica as a fail safe mechanism.
We will cover all these in details in separate tutorials.
Applications of Elasticsearch
I first crossed path with Elasticsearch when we were moving an enterprise software on cloud using Kubernetes and we had a usecase of aggregating logs from each docker container into a data storage engine from which it can be shown on some UI. Obviosuly, we went for Elasticsearch for storing the logs and used Kibana(another product from same company) as the UI for visualization.
But that is a very specific usecase for Elasticsearch, but a popular one.
Following are some of the main use cases where Elasticsearch can be used and it will pass with flying colors.
Application or Website search for social networking websites with huge data or e-commerce websites which has a large product catalogue, etc.
Enterprise search for any Saas product which provides large data to other products.
Logging and log analytics, as I mentioned above.
System metrics and container monitoring in Docker to keep an eye on the health of the container.
Application performance monitoring
And various other form of analytics where realtime data is collected and later analysed.
Programming Languages supported Elasticsearch
It supports a lot of programming languages, but the official clients are available in the following programming languages:
Download and Install
You can download and install Elasticsearch from here: https://www.elastic.co/downloads/elasticsearch
There are multiple different ways in which you can do so, they are:
1. By downloading the .zip file and executing the shell file /bin/elasticsearch (Install Elasticsearch on Linux Machine with Kibana and Fluent Bit) to run th Elasticsearch service or in case of windows run the /bin/elasticsearch.bat file.
2. By using any package manager like yum, apt-get or Homebrew for MacOS.
3. In Docker, using the Elasticsearch docker image.
So Elasticsearch is a distributed storage engine which is super fast in searching data stored in it even if the amount of data stored is in tens of GBs or hundreds of GB. In the next tutorial we will try to understand its architecture so that we have a better understanding of its working and how it is so fast in searching data.