Hadoop tutorials: HADOOP

Introduction to Hadoop:

Hadoop is a open source software framework for distributed processing of large datasets across large clusters of computers.

0 or 1	bit
8 bits	1 byte
1024 bytes	1 KiloByte
1024 KB	1 MegaByte
1024 MB	1 GigaByte
1024 GB	1 TeraByte
1024 TB	1 PetaByte
1024 PB	1 ExaByte
1024 EB	1 ZettaByte

Hadoop History

Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. Hadoop was created by Doug Cutting and Michael J. Cafarella (Yahoo employees) and name by Doug’s son’s toy elephant.

Dec 2004	Google GFS paper published
July 2005	Nutch uses MapReduce
Feb 2006	Starts as a Lucene subproject
Apr 2007	Yahoo! on 1000-node cluster
Jan 2008	An Apache Top Level Project
Jul 2008	A 4000 node test cluster
May 2009	Hadoop sorts Petabyte in 17 hours

1 Hard Disk = 100MB/sec (~1Gbps)

Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)

Rack = 20 Servers = 24GB/sec (~240Gbps)

Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)

Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)

HADOOP

Hadoop is a open source software framework for distributed processing of large datasets across large clusters of computers.

Large datasets ? Terabytes or petabytes of data

Large clusters ? hundreds or thousands of nodes

(In other terms…)

It provides a distributed files-system and a framework for the analysis and transformation of very large data sets using the Map-Reduce paradigm.

Hadoop was inspired by Google's MapReduce and Google File System (GFS).Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language.Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses. Applications need a write-once-read-many access model.

Hadoop framework consists on two main layers:

Distributed file system (HDFS) - Stands for Hadoop Distributed File System. It uses a framework involving many machines which stores large amounts of data in files over a Hadoop cluster.

Execution engine (MapReduce) - Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.

ü Hadoop is a framework for running applications on large clusters built of commodity

hardware.

ü The Hadoop framework transparently provides applications both reliability and data

motion.

ü Hadoop implements a computational paradigm named Map/Reduce, where the

application is divided into many small fragments of work, each of which may be executed

or re-executed on any node in the cluster.

ü In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

ü Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

Why Hadoop is able to compete?

-- Hadoop is not a database, it is an architecture with a file system called HDFS. The data is stored in HDFS which does not have any predefined containers.

-- Relational database stores data in predefined containers.

	Distributed Databases	Hadoop
Computing Model	- Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control	- Notion of jobs - Job is the unit of work - No concurrency control
Data Model	- Structured data with known schema - Read/Write mode	- Any data will fit in any format - (un)(semi)structured - ReadOnly mode
Cost Model	- Expensive servers	- Cheap commodity machines
Fault Tolerance	- Failures are rare - Recovery mechanisms	- Failures are common over thousands of machines - Simple yet efficient fault tolerance
Key Characteristics	- Efficiency, optimizations, fine-tuning	- Scalability, flexibility, fault tolerance

Architecture

The Hadoop Common package contains the necessary Java ARchive (JAR) files and scripts needed

to start Hadoop. The package also provides source code, documentation and a contribution

section that includes projects from the Hadoop Community.

HDFS follows a master/slave architecture. An HDFS installation consists of a single

Namenode, a master server that manages the File system namespace and regulates

access to files by clients. In addition, there are a number of Datanodes, one per node in

the cluster, which manage storage attached to the nodes that they run on.

The Namenode makes file system namespace operations like opening, closing, renaming

etc. of files and directories available via an RPC interface. It also determines the mapping

of blocks to Datanodes.

The Datanodes are responsible for serving read and write requests from filesystem

clients, they also perform block creation, deletion, and replication upon instruction from

the Namenode.

Master node (Hadoop currently configured with centurion064 as the master node)

Keeps track of namespace and metadata about items

Keeps track of MapReduce jobs in the system

Slave nodes (Centurion064 also acts as a slave node)

Manage blocks of data sent from master node

In terms of GFS, these are the chunk servers

Hadoop Master/Slave Architecture

Hadoop is designed as a master-slave shared-nothing architecture

Totally 5 daemons run in Hadoop Master-slave architecture.

On Master Node : Name Node and Job Tracker and Secondary name node

On Slave Node: Data Node and Task Tracker

Name node:-

Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node) resides. Based on meta info maps the incoming job to corresponding data node. Name Node Handles data node failures through checksums. every data has a record followed by a checksum. if checksum does not match with the original then it reports an data corrupted error.

Name node is responsible for managing the HDFS and Supplies address of the data on the different data nodes. the name nodes require a specified amount of RAM to store file system image in memory. the name node holds information about all the files in the system and needs to be extra reliable. The name node will detect that a data node is not responsive and will start replication of the data from remaining replicas. When data node comes back online, the extra replicas will be deleted.

Note:- Replication Factor controls how many times each individual blocks can be replicated.

Secondary name node:- Secondary name node performs CPU intensive operation of combining edit logs and current file system Snapshots. Its recommended to run Secondary name node in a separate machine which have Master node capacity. If we are not running a secondary name node ,cluster performance will degrade over time since edit log will grow bigger and bigger.

Job Tracker:- The Job Tracker is responsible for scheduling the tasks on slave nodes, collecting results, retrying the failed jobs. There can only be one Job Tracker in the cluster. This can be run on the same machine running the Name Node. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

Note:- Default port number for job tracker is 50030.

Note:- Daemon is a process or service that runs in background

Datanode:- Datanodes are the slaves which are deployed on each machine and provide the

actual storage. The Datanodes are responsible for serving read and write requests from file

System clients, they also perform block creation, deletion, and replication upon instruction

From the Namenode .

Task tracker:-Task tracker is also a daemon that runs on datanodes. Task Trackers manage the

execution of individual tasks on slave node. When a client submits a job, the job tracker will

initialize the job and divide the work and assign them to different task trackers to perform

MapReduce tasks.While performing this action, the task tracker will be simultaneously

communicating with job tracker by sending heartbeat. If the job tracker does not receive

heartbeat from task tracker within specified time, then it will assume that task tracker has

crashed and assign that task to another task tracker in the cluster.

Pig- Platform for analyzing large data sets.PigLatin queries are converted into map reduce jobs. Developed at Yahoo!

Hive- Data ware housing tool on top of Hadoop. Provides SQL like interface/language to analyze the data stored on HDFS. Developed at Facebook

HBase- Persistent, distributed, sorted,multidimensional, sparse map. Based on Google BigTable which provides interactive access to information

Sqoop– import SQL-based data to Hadoop

References

[1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google)

[2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!)

[3] R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128, http://www.cs.cmu.edu/~bryant

Hadoop tutorials

Pages

Monday, 17 August 2015

HADOOP

HADOOP

Architecture

Hadoop is designed as a master-slave shared-nothing architecture

No comments:

Post a Comment

About Me