Understanding Elasticsearch Architecture
Elasticsearch has a quite simple and straightforward architecture. If you have worked with any distributed service you will find the base architecture of Elasticsearch service easy to understand.
In this tutorial we will cover the Elasticsearch architecture where we will explain what is a cluster and nodes, then how Elasticsearch processes the incoming data, what happens when we try to search some data in the Elasticsearch cluster, and we will also cover some basics of Index, Documents, Shards and Replicas.
So let's start with the basic architecture of Elasticsearch service deployment.
Elasticsearch Architecture: Cluster and Nodes
Elasticsearch is deployed in a cluster, with a minimum of 1 node. A node is a physical or virtual server on which an instance of the Elasticsearch service runs and the node is a part of the cluster. The data stored in Elasticsearch is divided amongst the nodes in a cluster, thereby distributing the load.
So a cluster is nothing but a group of Nodes amongst which the entire data or the cluster is divided. The default name for the Elasticsearch cluster is elasticsearch. When we start a new node, it automatically joins the cluster with name elasticsearch if available on the same network.
As the data is divided amongst the nodes, hence when we query Elasticsearch for some data, the query is run parallely on all the nodes.
The Ports 9200 and 9300
Another interesting point which is generally confusing for many is why we have two ports 9200 and 9300 for the nodes.
The port 9200, is for REST API which is available for the incoming HTTP requests which are generally for the Elasticsearch API like query, create an index, list all indices, etc. It is recommended that we hit the master node to access the Elasticsearch API and then the master node internally communicates with the other nodes in the cluster, and that inter-node communication is done via the port 9300.
So we can say that for the outside world, i.e. for the request coming from outside the cluster (HTTP requests) port 9200 is used. And for inter-node communication the port 9300 is used. The inter node communication takes place via the transport layer.
More about Elasticsearch Nodes
The idea behind having multiple nodes in Elasticsearch cluster is not just to divide the data to be stored, but we can also add different roles/tasks to different nodes. Following are some of the types of nodes we can have in an elasticsearch cluster:
Data nodes - It is used for storing data and for performing data-related operations like search queries, etc.
Master nodes - This node is in charge of cluster-wide management and configuration actions such as adding and removing nodes. Also, we should use elasticsearch API from this node.
Client nodes - It is not required to have this node, but in case of heavy data load this node can be used to forward cluster requests to the master node and data-related requests to data nodes.
Ingest nodes - We can have an ingest node, which acts as the gateway for the incoming data where we can pre-process the documents before indexing them.