Monday, 17 August 2015

HADOOP

Introduction to Hadoop:


Hadoop is a open source software framework for distributed processing of  large datasets across large clusters of computers.

0 or 1
bit
8 bits
1 byte
1024 bytes
1 KiloByte
1024 KB
1 MegaByte
1024 MB
1 GigaByte
1024 GB
1 TeraByte
1024 TB
1 PetaByte
1024 PB
1 ExaByte
1024 EB
1 ZettaByte



Hadoop History

Hadoop was derived from Google’s MapReduce  and Google File System (GFS) papers. Hadoop was created by Doug Cutting and Michael J. Cafarella (Yahoo employees) and name by  Doug’s son’s toy elephant.

Dec 2004
 Google GFS paper published
July 2005
 Nutch uses MapReduce
Feb 2006
 Starts as a Lucene subproject
Apr 2007
 Yahoo! on 1000-node cluster
Jan 2008
 An Apache Top Level Project
Jul 2008
 A 4000 node test cluster
May 2009
 Hadoop sorts Petabyte in 17 hours

1 Hard Disk = 100MB/sec (~1Gbps)
Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
Rack = 20 Servers = 24GB/sec (~240Gbps)
Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)

HADOOP

Hadoop is a open source software framework for distributed processing of large datasets across large clusters of computers.
Large datasets ? Terabytes or petabytes of data
Large clusters ? hundreds or thousands of nodes
(In other terms…)
It  provides a distributed files-system and a framework for the analysis and transformation of very large data sets using the Map-Reduce paradigm.
Hadoop was inspired by Google's MapReduce and Google File System (GFS).Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language.Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses. Applications need a write-once-read-many access model.
Hadoop framework consists on two main layers:
Distributed file system (HDFS) - Stands for Hadoop Distributed File System. It uses a framework involving many machines which stores large amounts of data in files over a Hadoop cluster.
Execution engine (MapReduce) - Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.

ü  Hadoop is a framework for running applications on large clusters built of commodity
hardware.
ü  The Hadoop framework transparently provides applications both reliability and data
motion.
ü  Hadoop implements a computational paradigm named  Map/Reduce, where the
application is divided into many small fragments of work, each of which may be executed
or re-executed on any node in the cluster.
ü  In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
ü   Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Why Hadoop is able to compete? 

















-- Hadoop is not a database, it is an architecture with a file system called HDFS. The data is stored in HDFS which does not have any predefined containers.
-- Relational database stores data in predefined containers. 

Distributed Databases
Hadoop
Computing Model
-       Notion of transactions
-       Transaction is the unit of work
-       ACID properties, Concurrency control
-       Notion of jobs
-       Job is the unit of work
-       No concurrency control
Data Model
-       Structured data with known schema
-       Read/Write mode
-       Any data will fit in any format
-       (un)(semi)structured
-       ReadOnly mode
Cost Model
-       Expensive servers
-       Cheap commodity machines
Fault Tolerance
-       Failures are rare
-       Recovery mechanisms
-       Failures are common over thousands of machines
-       Simple yet efficient fault tolerance
Key Characteristics
- Efficiency, optimizations, fine-tuning
- Scalability, flexibility, fault tolerance



Architecture 

The Hadoop Common package contains the necessary Java ARchive (JAR) files and scripts needed
to start Hadoop. The package also provides source code, documentation and a contribution
section that includes projects from the Hadoop Community.

HDFS follows a master/slave architecture. An HDFS installation consists of a single
Namenode, a master server that manages the File system namespace and regulates
access to files by clients. In addition, there are a number of Datanodes, one per node in
the cluster, which manage storage attached to the nodes that they run on.

The Namenode makes file system namespace operations like opening, closing, renaming
etc. of files and directories available via an RPC interface. It also determines the mapping
of blocks to Datanodes.

The Datanodes are responsible for serving read and write requests from filesystem
clients, they also perform block creation, deletion, and replication upon instruction from
the Namenode.

Master node (Hadoop currently configured with centurion064 as the master node)
Keeps track of namespace and metadata about items
Keeps track of MapReduce jobs in the system

Slave nodes (Centurion064 also acts as a slave node)
Manage blocks of data sent from master node
In terms of GFS, these are the chunk servers



















Hadoop Master/Slave Architecture

Hadoop is designed as a master-slave shared-nothing architecture
























Totally 5 daemons run in Hadoop Master-slave architecture.

On Master Node : Name Node and Job Tracker and Secondary name node

On Slave Node: Data Node and Task Tracker

Name node:-
Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node) resides. Based on meta info maps the incoming job to corresponding data node. Name Node Handles data node failures through checksums. every data has a record followed by a checksum. if checksum does not match with the original then it reports an data corrupted error.
Name node is responsible for managing the HDFS and Supplies address of the data on the different data nodes. the name nodes require a specified amount of RAM to store file system image in memory. the name node holds information about all the files in the system and needs to be extra reliable. The name node will detect that a data node is not responsive and will start replication of the data from remaining replicas. When data node comes back online, the extra replicas will be deleted.


Note:- Replication Factor controls how many times each individual blocks can be replicated.

Secondary name node:- Secondary name node performs CPU intensive operation of combining edit logs and current file system Snapshots. Its recommended to run Secondary name node in a separate machine which have Master node capacity. If we are not running a secondary name node ,cluster performance will degrade over time since edit log will grow bigger and bigger.

Job Tracker:- The Job Tracker is responsible for scheduling the tasks on slave nodes, collecting results, retrying the failed jobs. There can only be one Job Tracker in the cluster. This can be run on the same machine running the Name Node. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
Note:- Default port number for  job tracker is 50030.

Note:- Daemon is a process or service that runs in background

Datanode:- Datanodes are the slaves which are deployed on each machine and provide the
actual storage. The Datanodes are responsible for serving read and write requests from file
System  clients, they also perform block creation, deletion, and replication upon instruction
From the Namenode .

Task tracker:-Task tracker is also a daemon that runs on datanodes. Task Trackers manage the
execution of individual tasks on slave node. When a client submits a job, the job tracker will
initialize the job and divide the work and assign them to different task trackers to perform
MapReduce tasks.While performing this action, the task tracker will be simultaneously
communicating with job tracker by sending heartbeat. If the job tracker does not receive
heartbeat from task tracker within specified time, then it will assume that task tracker has
crashed and assign that task to another task tracker in the cluster.









Pig- Platform for analyzing large data sets.PigLatin queries are converted into map reduce jobs. Developed at Yahoo!
Hive- Data ware housing tool on top of Hadoop. Provides SQL like interface/language to analyze the data stored on HDFS. Developed at Facebook
HBase- Persistent, distributed, sorted,multidimensional, sparse map. Based on Google BigTable which provides interactive access to information
Sqoop– import SQL-based data to Hadoop
References
[1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google)
[2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!)
[3] R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128, http://www.cs.cmu.edu/~bryant

No comments:

Post a Comment