World of Hadoop on AWS through the path of Terraform
❗❗ Hello everyone ❗❗
In this article , we are going to explain a great and interesting integration of most popular technologies named Hadoop, AWS and Terraform i.e., we have created a HAT that is capable of storing huge amount of data with the concept of distributed Storage and the most interesting thing is that it is fully automated program🤩. You all might have heard about these technologies separately and today we will combine them and create an interesting integration, Lets come and take an overview of all of them.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
AWS (Amazon web Services) is a secured cloud services platform that offers compute power, database storage, content delivery and various other functionalities. Here we are going to use its resources like Computing , storage.
Terraform (by HashiCorp), an AWS Partner Network (APN) Advanced technology Partner and member of the AWS DevOps Competency, is an “infrastructure as code” tool similar to AWS CloudFormation that allows you to create, update, and version your Amazon Web Services (AWS) infrastructure.
Overview
🎯In this world of Automation, we engineers always try to make everything automated. Such a setup we have created using Terraform that will launch a Hadoop Cluster on AWS cloud that will contain :
- 1 Master Node
- 3 Slave Nodes
- 2 Client Nodes
In this cluster, all the slave node will share their storage of 10 Gb each (in my case) so that whenever client through some data into the cluster then it will be distributively stored in the cluster. The default block size of each file that can be dumped into the cluster would be 64Mb and the replication factor is 3 i.e., by default 3 copies of each file in order to have a backup in case of any data loss. Here we will dump two empty files for just demonstration but in practical you can try with large amount of data.
Implementation
We will cover each and every thing in step by step manner. So lets start implementing our HAT Cluster,
Step 1 : Creating an IAM user and access the credentials to automate the process using terraform
For creating IAM user, refer to the following article :
Step 2 : Create a Master Node
a). To make a connection to the AWS account , we use
b). Creating a security group, since we are creating a public cluster here, so we will all the traffic, we use
c). Creating the instance to launch Master Node here, we use
d). Creating and attaching volume for the master node, we use
e). Provisioning the instance to setup and launch Master Node , we use
Since to setup Master node, we need to configure both the hdfs-site.xml and core-site.xml files. So we will clone both the files from our GitHub repository.
We want as soon as the Master Node, its dashboard will be automatically get open:
The complete code will be :
After the complete code setup, to apply this terraform code we need to initialize it, we use the command “terraform init ”
After initializing, we will apply the terraform code to setup the Master Node.
after few seconds, we see that it is successfully setup and the dashboard will be launch.
Lets see that it has been created in the AWS cloud or not ? Obviously it will be !!
Now lets move to the next step for the Hadoop cluster setup.
Step 3: Create the Slave Nodes
After creating connection to the AWS account,with the same that we have already share while creating Master Node, i.e.,
a). Creating ec2 instance to launch the slave node, slave node also called as data node so don’t get confused if i use data node instead of slave node.
b). Creating and Attaching volume for Slave node , we use
c). Provisioning the instance to setup the Slave Node using terraform code, we use
Since to setup slave node, we need to configure both the hdfs-site.xml and core-site.xml files. So we will clone both the files from our GitHub repository.
This code can setup one Slave Node, If you wants to setup more use the same. We have created more than one Slave nodes. The complete code for multiple Slave Nodes is :
After the code setup, we need to initialize the terraform for the SlaveNode directory using the command “terraform init ”
After initializing it, we will move to apply the terraform code to setup multiple SlaveNodes :
After few seconds, all the slaves will be setup successfully.
Lets check whether it is created or not ? on the AWS dashboard, Obviously it will !!
Lets check whether they are connected to the master or not ? we will check this through the master node dashboard.
Now we will move to the next step of creating the client nodes that will use this above created cluster for using its distributed storage to store their data.
Step 4 : Create the Client Nodes
In this step we will setup the client node similarly the way we have setup the slave node, by adding connection to the AWS account i.e.,
a). Creating Instance to setup Client Node here
Since we are going to use the storage of the Hadoop cluster that we have created above, so we didn’t required some extra volume.
b). Provisioning the instance for the setup of client node, we use
Since we need not to configure hdfs-ste.xml file here, so we only clone the core-site.xml file.
This is the code to setup single client, you can create more as per your requirements. In my case, i have created two client nodes. The complete code to setup two client nodes :
Lets check whether the client node has been created on the AWS dashboard or not ? Obviously it will !!
As we have also dumped two empty files, one from each client node named a.txt, b.txt while provisioning. Lets check whether that has been uploaded or not?
Finally, we have successfully created a complete Distributed hadoop Cluster in Advanced way i.e., using AWS and Terraform. Hence the name we given to this cluster is HAT cluster i.e., Hadoop Cluster Setup over AWS Cloud through Terraform.
Since we have used this cluster or we have completed our work, so it is a good practice to close the setup but here we need not to close each service separately, we can directly destroy the Master, Slave and the Client nodes through terraform only.
Step 5 : Destroy the Complete Setup
a). Destroy the Client Nodes
b). Destroy the Slave Nodes
c). Destroy the Master Node
Hence, we have successfully destroyed our HAT Cluster after achieving our target🎯.
For Provisioning Hadoop and jdk in the cluster instances, we need to use the concept of git-lfs, to know more about and how to use git-lfs, we use :
To access the complete code, refer this :
To access the configuration files for Master Node, refer this :
To access the configuration files for Slave and Client Node , refer this :
Hope you like the setup!! Any queries and suggestions are accepted.
Thank You !!!😇