Tuesday, 18 August 2015

Hadoop Interview Questions


Which operating system(s) are supported for production Hadoop deployment?
The main supported operating system is Linux. However, with some additional software Hadoop can be deployed on Windows.
What is the role of the namenode?
The namenode is the "brain" of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests.
What happen on the namenode when a client tries to read a data file?
The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot. Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.
What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?
There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image.
What mode(s) can Hadoop code be run in?
Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode. Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes

How would an Hadoop administrator deploy various components of Hadoop in production?
Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes. There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware
What is the best practice to deploy the secondary namenode
Deploy secondary namenode on a separate standalone machine. The secondary namenode needs to be deployed on a separate machine. It will not interfere with primary namenode operations in this way. The secondary namenode must have the same memory requirements as the main namenode.

Is there a standard procedure to deploy Hadoop?
No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine. There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software
What is the role of the secondary namenode?
Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots. The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up


What are the side effects of not running a secondary name node?
The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.

What happen if a datanode loses network connection for a few minutes?
The namenode will detect that a datanode is not responsive and will start replication of the data from remaining replicas. When datanode comes back online, the extra replicas will be The replication factor is actively maintained by the namenode. The namenode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment the datanode is not avaialble it will trigger replication of the data from the existing replicas. However, if the datanode comes back up, overreplicated data will be deleted. Note: the data might be deleted from the original datanode.

What happen if one of the datanodes has much slower CPU?
The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed.


What is speculative execution?
If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.
The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.

How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?
In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.
Are there any special requirements for namenode?
Yes, the namenode holds information about all files in the system and needs to be extra reliable. The namenode is a single point of failure. It needs to be extra reliable and metadata need to be replicated in multiple places. Note that the community is working on solving the single point of failure issue with the namenode.
If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?
6
Based on the configuration settings the file will be divided into multiple blocks according to the default block size of 64M. 128M / 64M = 2 . Each block will be replicated according to replication factor settings (default 3). 2 * 3 = 6 .
What is distributed copy (distcp)?
Distcp is a Hadoop utility for launching MapReduce jobs to copy data. The primary usage is for copying a large amount of data. One of the major challenges in the Hadoop enviroment is copying data across multiple clusters and distcp will allow multiple datanodes to be leveraged for parallel copying of the data.
What is replication factor?
Replication factor controls how many times each individual block can be replicated –
Data is replicated in the Hadoop cluster based on the replication factor. The high replication factor guarantees data availability in the event of failure.
What daemons run on Master nodes?
NameNode, Secondary NameNode and JobTracker
Hadoop is comprised of five separate daemons and each of these daemon run in its own JVM. NameNode, Secondary NameNode and JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave nodes.
What is rack awareness?
Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions. Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness


What is the role of the jobtracker in an Hadoop cluster?
The jobtracker is responsible for scheduling tasks on slave nodes, collecting results, retrying failed tasks. The job tracker is the main component of the map-reduce execution. It control the division of the job into smaller tasks, submits tasks to individual tasktracker, tracks the progress of the jobs and reports results back to calling code.
How does the Hadoop cluster tolerate datanode failures?
Since Hadoop is design to run on commodity hardware, the datanode failures are expected. Namenode keeps track of all available datanodes and actively maintains replication factor on all data.
The namenode actively tracks the status of all datanodes and acts immediately if the datanodes become non-responsive. The namenode is the central "brain" of the HDFS and starts replication of the data the moment a disconnect is detected.
What is the procedure for namenode recovery?
A namenode can be recovered in two ways: starting new namenode from backup metadata or promoting secondary namenode to primary namenode.
The namenode recovery procedure is very important to ensure the reliability of the data.It can be accomplished by starting a new namenode using backup data or by promoting the secondary namenode to primary.


Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?
This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decomissioning finished .
Due to replication strategy it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remaining datanodes.
What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?
Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.
Hadoop cluster will detect new datanodes automatically. However, in order to optimize the cluster performance it is recommended to start rebalancer to redistribute the data between datanodes evenly.
If the Hadoop administrator needs to make a change, which configuration file does he need to change?
Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.

Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?
The cluster is in a safe mode. The administrator needs to wait for namenode to exit the safe mode before restarting the jobs again
This is a very common mistake by Hadoop administrators when there is no secondary namenode on the cluster and the cluster has not been restarted in a long time. The namenode will go into safemode and combine the edit log and current file system timestamp
Map Reduce jobs take too long. What can be done to improve the performance of the cluster?
One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster
Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.
How often do you need to reformat the namenode?
Never. The namenode needs to formatted only once in the beginning. Reformatting of the namenode will lead to lost of the data on entire
The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system.
After increasing the replication level, I still see that data is under replicated. What could be wrong?
Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication

Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.

Hive Installation

Before you start installing hive, you should have already installed hadoop.
Step 1: Download Hive from apache website (check with comparability with hadoop version)
Step 2: Go to downloads and extract hive.tar.gz
Step 3: Copy the extracted jar into /home/hduser/hive.
Step 4: Edit /etc/bash.bashrc
                $gedit bash.bashrc
                $sudo gedit  /etc/bash.bashrc

                $sudo leafpad /etc/bash.bashrc
Step 5: Set Home Path
                export HADOOP_HOME=/home/hduser/hadoop
                export HIVE_HOME=/home/hduser/hive
                export PATH=$PATH:$HADOOP_HOME/bin
                export PATH=$PATH:$HIVE_HOME/bin
Step 6: Save and Close
Step 7: Open terminal and enter hive 
Now you can enter hive shell.

Hive Introduction

Introduction

  • Data ware housing tool on top of Hadoop
  • SQL like interface
  • Provides SQL like language to analyze the data stored on HDFS
  • Can be used by people who know SQL
  • Not all traditional SQL capabilities are supported
  • Under the hood hive queries are executed as MapReduce jobs
  • No extra work is required


Hive Components
  • MetaStore
  • Its a database consisting of table definitions and other metadata
  • By default stored on the local machine on derby database
  • It can be kept on some shared machine like relational data base if multiple users are using

Query Engine
  • Hive-QL which gives SQL like query
  • Internally Hive queries are run as map reduce job

Hive Data Models
  • Hive forns or layers table definitions on top of data residing on HDFS

Databases
  • Name space that separates tables from other units from naming confliction

Table
  • Homogenous unit of data having same schema

MetaStore
Its a data base consisting of table definations and other metadata. By default stored on the local  machine on derby database. It can be kept on some shared machine like relational data base if multiple users are using.

Monday, 17 August 2015

Hadoop Installation (Pseudo Distributed Mode)

Steps for Installation of Hadoop
If you are windows environment follow below steps
Step 1: Install VMware Workstation
                Download Product (http://www.vmware.com/products/workstation/workstation-evaluation)
Step 2: Download any flavor of linux (Ex: RedHat Linux/ubuntu …… if you have low configuration machine use Lubuntu)
Step 3: Start VMware and create a new vm.
Step 4: Install Linux(any flavor)
Step 5: Start Linux command prompt ----terminal
Step 6: Login root directory and create a new group
Step 7: Add a new group
Command  :  sudo adgroup hadoop
It will ask to root  password ….. give root password
Sudo is used if you want use any command as super user(Ex: RH ---)
Step 8: Add a new user for hadoop in group “hadoop”
Command : sudo adduser –ingroup hadoop hduser
It will ask password tht you want to set.
Step 9: Now add hduser  in the list of sudoes. That you can run any command in hduser
Command :  sudo adduser hduser sudo
Step 10: Now logout root and Login hduser
Step 11: Open terminal
Step 12: Hadoop is developed in Java . Java should be installed in your machine before we start using Hadoop
                We need Java 1.5+ (means 1.5 or later versions)
Command :         sudo apt-get install openjdk-6-jdk

Apt-get is a package manager of ubuntu . it will help to install software’s.
Step 13: Install SSH server (Secure Socket Layer)
Command :         sudo apt-get install openssh-server
Step 14:  Once ssh installed we can login to remote machine using following command
Command :         ssh<ipaddress>
if you try ssh localhost, you will notice that it will prompt you for password. Now we want to make this login password-less. One way of doing it is to use keys. we can generate keys using following command.
Command :         ssh-keygen –t rsa –P “”
This command will generate two keys at "/home/hduser/.ssh/" path. id_rsa and id_rsa.pub.
id_rsa is private key.
id_rsa.pub is publc key
Command :         ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@localhost
Give password for hduser.







Step 15: Download Hadoop from Apache website
Step 16: Extract hadoop and put it in folder "/home/hduser/hadoop"

Step 17: Now we need to make configurations in hadoop configuration file. You will find these files in "/home/hduser/hadoop/conf" folder.

Step 18. There are 4 important files in this folder

     a) hadoop-env.sh
     b) hdfs-site.xml
     c) mapred-site.xml
     d) core-site.xml
a)hadoop-env.sh is a file contains hadoop environment related properties.
Here we can set java home.
                export  JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

b) hdfs-site.xml is file which contains properties related to hdfs.
 We need to set here the replication factor here.
 By default replication factor is 3.
 since we are installing hadoop in single machine.
So  we will set it to 1.

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

c) mapred-site.xml is a file that contains properties related to map reduce.
we will set here ip address and port of machine on which job tracker is running
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

d) core-site.xml is property file which contains property which are common or used by both map reduce and hdfs.
 we will set ip address and port number of machine on which namenode will be running.
 Other property tells where should hadoop store files like fsimage and blocks etc.
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/hadoop_tmp_files</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

MapReduce

The MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce
Jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work
so close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the
data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data
Resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone
network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node
spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the
running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes
to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be
viewed from a web browser.
The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user define
Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a
map phase and a reduce phase. The input to the computation is a data set of key/value pairs.
Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a
computation the tasks assigned to them are re-distributed among the remaining nodes. Having
many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with
small runtime overhead.

Hadoop Map/Reduce – Goals:
• Process large data sets
• Cope with hardware failure

• High throughput

Execution overview:

1.The MapReduce library in the user program first splits input files into M pieces of typically 16 MB to 64 MB/piece. It then starts up many copies of the program on a cluster of machines.
2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the assigned input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
4. The locations of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers.
5.When a reduce worker is notified by the master about these locations, it uses RPC remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program---the MapReduce call in the user program returns back to the user code. The output of the mapreduce execution is available in the R output files (one per reduce task).

Hadoop's Distributed File System (HDFS)

Hadoop's Distributed File System is designed to reliably store very large files across machines in a
large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of
blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. Files in
HDFS are "write once" and have strictly one writer at any time.
HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each
node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster.
The situation is typical because each node does not require a datanode to be present. Each datanode serves
up blocks of data over the network using a block protocol specific to HDFS. The file system uses
the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each
other. HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It
achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on
hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on
a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep
the replication of data high.

Hadoop Distributed File System – Goals:
      Store large data sets
      Cope with hardware failure

      Emphasize streaming data access
      The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
      highly fault-tolerant and is designed to be deployed on low-cost hardware.
      provides high throughput access to application data and is suitable for applications that have large data sets.
      relaxes a few POSIX requirements to enable streaming access to file system data.
part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/.













Data Storage in HDFS
A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks.
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

Case 1: If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size?
No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

Case 2: If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

HADOOP

Introduction to Hadoop:


Hadoop is a open source software framework for distributed processing of  large datasets across large clusters of computers.

0 or 1
bit
8 bits
1 byte
1024 bytes
1 KiloByte
1024 KB
1 MegaByte
1024 MB
1 GigaByte
1024 GB
1 TeraByte
1024 TB
1 PetaByte
1024 PB
1 ExaByte
1024 EB
1 ZettaByte



Hadoop History

Hadoop was derived from Google’s MapReduce  and Google File System (GFS) papers. Hadoop was created by Doug Cutting and Michael J. Cafarella (Yahoo employees) and name by  Doug’s son’s toy elephant.

Dec 2004
 Google GFS paper published
July 2005
 Nutch uses MapReduce
Feb 2006
 Starts as a Lucene subproject
Apr 2007
 Yahoo! on 1000-node cluster
Jan 2008
 An Apache Top Level Project
Jul 2008
 A 4000 node test cluster
May 2009
 Hadoop sorts Petabyte in 17 hours

1 Hard Disk = 100MB/sec (~1Gbps)
Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
Rack = 20 Servers = 24GB/sec (~240Gbps)
Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)

HADOOP

Hadoop is a open source software framework for distributed processing of large datasets across large clusters of computers.
Large datasets ? Terabytes or petabytes of data
Large clusters ? hundreds or thousands of nodes
(In other terms…)
It  provides a distributed files-system and a framework for the analysis and transformation of very large data sets using the Map-Reduce paradigm.
Hadoop was inspired by Google's MapReduce and Google File System (GFS).Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language.Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses. Applications need a write-once-read-many access model.
Hadoop framework consists on two main layers:
Distributed file system (HDFS) - Stands for Hadoop Distributed File System. It uses a framework involving many machines which stores large amounts of data in files over a Hadoop cluster.
Execution engine (MapReduce) - Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.

ü  Hadoop is a framework for running applications on large clusters built of commodity
hardware.
ü  The Hadoop framework transparently provides applications both reliability and data
motion.
ü  Hadoop implements a computational paradigm named  Map/Reduce, where the
application is divided into many small fragments of work, each of which may be executed
or re-executed on any node in the cluster.
ü  In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
ü   Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Why Hadoop is able to compete? 

















-- Hadoop is not a database, it is an architecture with a file system called HDFS. The data is stored in HDFS which does not have any predefined containers.
-- Relational database stores data in predefined containers. 

Distributed Databases
Hadoop
Computing Model
-       Notion of transactions
-       Transaction is the unit of work
-       ACID properties, Concurrency control
-       Notion of jobs
-       Job is the unit of work
-       No concurrency control
Data Model
-       Structured data with known schema
-       Read/Write mode
-       Any data will fit in any format
-       (un)(semi)structured
-       ReadOnly mode
Cost Model
-       Expensive servers
-       Cheap commodity machines
Fault Tolerance
-       Failures are rare
-       Recovery mechanisms
-       Failures are common over thousands of machines
-       Simple yet efficient fault tolerance
Key Characteristics
- Efficiency, optimizations, fine-tuning
- Scalability, flexibility, fault tolerance



Architecture 

The Hadoop Common package contains the necessary Java ARchive (JAR) files and scripts needed
to start Hadoop. The package also provides source code, documentation and a contribution
section that includes projects from the Hadoop Community.

HDFS follows a master/slave architecture. An HDFS installation consists of a single
Namenode, a master server that manages the File system namespace and regulates
access to files by clients. In addition, there are a number of Datanodes, one per node in
the cluster, which manage storage attached to the nodes that they run on.

The Namenode makes file system namespace operations like opening, closing, renaming
etc. of files and directories available via an RPC interface. It also determines the mapping
of blocks to Datanodes.

The Datanodes are responsible for serving read and write requests from filesystem
clients, they also perform block creation, deletion, and replication upon instruction from
the Namenode.

Master node (Hadoop currently configured with centurion064 as the master node)
Keeps track of namespace and metadata about items
Keeps track of MapReduce jobs in the system

Slave nodes (Centurion064 also acts as a slave node)
Manage blocks of data sent from master node
In terms of GFS, these are the chunk servers



















Hadoop Master/Slave Architecture

Hadoop is designed as a master-slave shared-nothing architecture
























Totally 5 daemons run in Hadoop Master-slave architecture.

On Master Node : Name Node and Job Tracker and Secondary name node

On Slave Node: Data Node and Task Tracker

Name node:-
Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node) resides. Based on meta info maps the incoming job to corresponding data node. Name Node Handles data node failures through checksums. every data has a record followed by a checksum. if checksum does not match with the original then it reports an data corrupted error.
Name node is responsible for managing the HDFS and Supplies address of the data on the different data nodes. the name nodes require a specified amount of RAM to store file system image in memory. the name node holds information about all the files in the system and needs to be extra reliable. The name node will detect that a data node is not responsive and will start replication of the data from remaining replicas. When data node comes back online, the extra replicas will be deleted.


Note:- Replication Factor controls how many times each individual blocks can be replicated.

Secondary name node:- Secondary name node performs CPU intensive operation of combining edit logs and current file system Snapshots. Its recommended to run Secondary name node in a separate machine which have Master node capacity. If we are not running a secondary name node ,cluster performance will degrade over time since edit log will grow bigger and bigger.

Job Tracker:- The Job Tracker is responsible for scheduling the tasks on slave nodes, collecting results, retrying the failed jobs. There can only be one Job Tracker in the cluster. This can be run on the same machine running the Name Node. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
Note:- Default port number for  job tracker is 50030.

Note:- Daemon is a process or service that runs in background

Datanode:- Datanodes are the slaves which are deployed on each machine and provide the
actual storage. The Datanodes are responsible for serving read and write requests from file
System  clients, they also perform block creation, deletion, and replication upon instruction
From the Namenode .

Task tracker:-Task tracker is also a daemon that runs on datanodes. Task Trackers manage the
execution of individual tasks on slave node. When a client submits a job, the job tracker will
initialize the job and divide the work and assign them to different task trackers to perform
MapReduce tasks.While performing this action, the task tracker will be simultaneously
communicating with job tracker by sending heartbeat. If the job tracker does not receive
heartbeat from task tracker within specified time, then it will assume that task tracker has
crashed and assign that task to another task tracker in the cluster.









Pig- Platform for analyzing large data sets.PigLatin queries are converted into map reduce jobs. Developed at Yahoo!
Hive- Data ware housing tool on top of Hadoop. Provides SQL like interface/language to analyze the data stored on HDFS. Developed at Facebook
HBase- Persistent, distributed, sorted,multidimensional, sparse map. Based on Google BigTable which provides interactive access to information
Sqoop– import SQL-based data to Hadoop
References
[1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google)
[2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!)
[3] R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128, http://www.cs.cmu.edu/~bryant