World of Hadoop on AWS through the path of Terraform

7 min readOct 21, 2020

❗❗ Hello everyone ❗❗

In this article , we are going to explain a great and interesting integration of most popular technologies named Hadoop, AWS and Terraform i.e., we have created a HAT that is capable of storing huge amount of data with the concept of distributed Storage and the most interesting thing is that it is fully automated program🤩. You all might have heard about these technologies separately and today we will combine them and create an interesting integration, Lets come and take an overview of all of them.

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

AWS (Amazon web Services) is a secured cloud services platform that offers compute power, database storage, content delivery and various other functionalities. Here we are going to use its resources like Computing , storage.

Terraform (by HashiCorp), an AWS Partner Network (APN) Advanced technology Partner and member of the AWS DevOps Competency, is an “infrastructure as code” tool similar to AWS CloudFormation that allows you to create, update, and version your Amazon Web Services (AWS) infrastructure.

Overview

🎯In this world of Automation, we engineers always try to make everything automated. Such a setup we have created using Terraform that will launch a Hadoop Cluster on AWS cloud that will contain :

1 Master Node
3 Slave Nodes
2 Client Nodes

In this cluster, all the slave node will share their storage of 10 Gb each (in my case) so that whenever client through some data into the cluster then it will be distributively stored in the cluster. The default block size of each file that can be dumped into the cluster would be 64Mb and the replication factor is 3 i.e., by default 3 copies of each file in order to have a backup in case of any data loss. Here we will dump two empty files for just demonstration but in practical you can try with large amount of data.

Implementation

We will cover each and every thing in step by step manner. So lets start implementing our HAT Cluster,

Step 1 : Creating an IAM user and access the credentials to automate the process using terraform

For creating IAM user, refer to the following article :

Power of CLI for managing Instances in AWS Cloud

H

medium.com

Step 2 : Create a Master Node

a). To make a connection to the AWS account , we use

Connection to AWS

b). Creating a security group, since we are creating a public cluster here, so we will all the traffic, we use

adding security to the cluster

c). Creating the instance to launch Master Node here, we use

launching instance to be Master

d). Creating and attaching volume for the master node, we use

creating and attaching volume

e). Provisioning the instance to setup and launch Master Node , we use

Since to setup Master node, we need to configure both the hdfs-site.xml and core-site.xml files. So we will clone both the files from our GitHub repository.

Provisioning Master Node

We want as soon as the Master Node, its dashboard will be automatically get open:

Launching Master node Dashboard

The complete code will be :

Code to setup Master using Terraform

After the complete code setup, to apply this terraform code we need to initialize it, we use the command “terraform init ”

After initializing, we will apply the terraform code to setup the Master Node.

after few seconds, we see that it is successfully setup and the dashboard will be launch.

Lets see that it has been created in the AWS cloud or not ? Obviously it will be !!

Now lets move to the next step for the Hadoop cluster setup.

Step 3: Create the Slave Nodes

After creating connection to the AWS account,with the same that we have already share while creating Master Node, i.e.,

Creating Connection

a). Creating ec2 instance to launch the slave node, slave node also called as data node so don’t get confused if i use data node instead of slave node.

Creating dataNode instance

b). Creating and Attaching volume for Slave node , we use

Creating and attaching volume for slave nodes

c). Provisioning the instance to setup the Slave Node using terraform code, we use

Since to setup slave node, we need to configure both the hdfs-site.xml and core-site.xml files. So we will clone both the files from our GitHub repository.

Provisioning the slave node

This code can setup one Slave Node, If you wants to setup more use the same. We have created more than one Slave nodes. The complete code for multiple Slave Nodes is :

Code to setup multiple Slave Nodes with just one click

After the code setup, we need to initialize the terraform for the SlaveNode directory using the command “terraform init ”

After initializing it, we will move to apply the terraform code to setup multiple SlaveNodes :

Applying terraform code to setup Slave Nodes

After few seconds, all the slaves will be setup successfully.

Successful creation of multiple slave nodes

Lets check whether it is created or not ? on the AWS dashboard, Obviously it will !!

Successfully launched multiple Slave nodes

Lets check whether they are connected to the master or not ? we will check this through the master node dashboard.

Successful connection of all the slave node to the master node

Now we will move to the next step of creating the client nodes that will use this above created cluster for using its distributed storage to store their data.

Step 4 : Create the Client Nodes

In this step we will setup the client node similarly the way we have setup the slave node, by adding connection to the AWS account i.e.,

Creating connection to the AWS

a). Creating Instance to setup Client Node here

Creating Instance to setup Client Node

Since we are going to use the storage of the Hadoop cluster that we have created above, so we didn’t required some extra volume.

b). Provisioning the instance for the setup of client node, we use

Since we need not to configure hdfs-ste.xml file here, so we only clone the core-site.xml file.

Provisioning the Client Node

This is the code to setup single client, you can create more as per your requirements. In my case, i have created two client nodes. The complete code to setup two client nodes :

Code to setup two client nodes

Lets check whether the client node has been created on the AWS dashboard or not ? Obviously it will !!

As we have also dumped two empty files, one from each client node named a.txt, b.txt while provisioning. Lets check whether that has been uploaded or not?

Successfully Dumped both the files to the distributed hadoop cluster

Finally, we have successfully created a complete Distributed hadoop Cluster in Advanced way i.e., using AWS and Terraform. Hence the name we given to this cluster is HAT cluster i.e., Hadoop Cluster Setup over AWS Cloud through Terraform.

Since we have used this cluster or we have completed our work, so it is a good practice to close the setup but here we need not to close each service separately, we can directly destroy the Master, Slave and the Client nodes through terraform only.

Step 5 : Destroy the Complete Setup

a). Destroy the Client Nodes

b). Destroy the Slave Nodes

Data Nodes destroyed Completed Successfully

c). Destroy the Master Node

Hence, we have successfully destroyed our HAT Cluster after achieving our target🎯.

For Provisioning Hadoop and jdk in the cluster instances, we need to use the concept of git-lfs, to know more about and how to use git-lfs, we use :

Git LFS - large file storage | Atlassian Git Tutorial

Git is a distributed version control system, meaning the entire history of the repository is transferred to the client…

www.atlassian.com

To access the complete code, refer this :

Anshika-Sharma-as/HAT_cluster

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

To access the configuration files for Master Node, refer this :

Anshika-Sharma-as/hadoop01

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

To access the configuration files for Slave and Client Node , refer this :

Anshika-Sharma-as/hadoop-slave

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Hope you like the setup!! Any queries and suggestions are accepted.

Thank You !!!😇

World of Hadoop on AWS through the path of Terraform

Overview

Implementation

Power of CLI for managing Instances in AWS Cloud

H

Git LFS - large file storage | Atlassian Git Tutorial

Git is a distributed version control system, meaning the entire history of the repository is transferred to the client…

Anshika-Sharma-as/HAT_cluster

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Anshika-Sharma-as/hadoop01

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Anshika-Sharma-as/hadoop-slave

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Written by Anshika Sharma

Responses (1)