Introduction to Hadoop:
Hadoop is a open source software framework for distributed processing of large datasets across large clusters of computers.
0 or 1
|
bit
|
8 bits
|
1 byte
|
1024 bytes
|
1 KiloByte
|
1024 KB
|
1 MegaByte
|
1024 MB
|
1 GigaByte
|
1024 GB
|
1 TeraByte
|
1024 TB
|
1 PetaByte
|
1024 PB
|
1 ExaByte
|
1024 EB
|
1 ZettaByte
|
Hadoop History
Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. Hadoop was created by Doug Cutting and Michael J. Cafarella (Yahoo employees) and name by Doug’s son’s toy elephant.
Dec 2004
|
Google GFS paper published
|
July 2005
|
Nutch uses MapReduce
|
Feb 2006
|
Starts as a Lucene subproject
|
Apr 2007
|
Yahoo! on 1000-node cluster
|
Jan 2008
|
An Apache Top Level Project
|
Jul 2008
|
A 4000 node test cluster
|
May 2009
|
Hadoop sorts Petabyte in 17 hours
|
1 Hard Disk = 100MB/sec (~1Gbps)
|
Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
|
Rack = 20 Servers = 24GB/sec (~240Gbps)
|
Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
|
Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)
|
HADOOP
Hadoop is a open source software framework for distributed processing of large datasets across large clusters of computers.
Large datasets ? Terabytes or petabytes of data
Large clusters ? hundreds or thousands of nodes
(In other terms…)
It provides a distributed files-system and a framework for the analysis and transformation of very large data sets using the Map-Reduce paradigm.
Hadoop was inspired by Google's MapReduce and Google File System (GFS).Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language.Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses. Applications need a write-once-read-many access model.
Hadoop framework consists on two main layers:
Distributed file system (HDFS) - Stands for Hadoop Distributed File System. It uses a framework involving many machines which stores large amounts of data in files over a Hadoop cluster.
Execution engine (MapReduce) - Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.
ü Hadoop is a framework for running applications on large clusters built of commodity
hardware.
ü The Hadoop framework transparently provides applications both reliability and data
motion.
ü Hadoop implements a computational paradigm named Map/Reduce, where the
application is divided into many small fragments of work, each of which may be executed
or re-executed on any node in the cluster.
ü In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
ü Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
Why Hadoop is able to compete?
-- Hadoop is not a database, it is an architecture with a file system called HDFS. The data is stored in HDFS which does not have any predefined containers.
-- Relational database stores data in predefined containers.
Distributed Databases
|
Hadoop
| |
Computing Model
|
- Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
|
- Notion of jobs
- Job is the unit of work
- No concurrency control
|
Data Model
|
- Structured data with known schema
- Read/Write mode
|
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
|
Cost Model
|
- Expensive servers
|
- Cheap commodity machines
|
Fault Tolerance
|
- Failures are rare
- Recovery mechanisms
|
- Failures are common over thousands of machines
- Simple yet efficient fault tolerance
|
Key Characteristics
|
- Efficiency, optimizations, fine-tuning
|
- Scalability, flexibility, fault tolerance
|
Architecture
to start Hadoop. The package also provides source code, documentation and a contribution
section that includes projects from the Hadoop Community.
HDFS follows a master/slave architecture. An HDFS installation consists of a single
Namenode, a master server that manages the File system namespace and regulates
access to files by clients. In addition, there are a number of Datanodes, one per node in
the cluster, which manage storage attached to the nodes that they run on.
The Namenode makes file system namespace operations like opening, closing, renaming
etc. of files and directories available via an RPC interface. It also determines the mapping
of blocks to Datanodes.
The Datanodes are responsible for serving read and write requests from filesystem
clients, they also perform block creation, deletion, and replication upon instruction from
the Namenode.
Master node (Hadoop currently configured with centurion064 as the master node)
Keeps track of namespace and metadata about items
Keeps track of MapReduce jobs in the system
Slave nodes (Centurion064 also acts as a slave node)
Manage blocks of data sent from master node
In terms of GFS, these are the chunk servers
Hadoop Master/Slave Architecture
Hadoop is designed as a master-slave shared-nothing architecture
Totally 5 daemons run in Hadoop Master-slave architecture.
On Master Node : Name Node and Job Tracker and Secondary name node
On Slave Node: Data Node and Task Tracker
Name node:-
Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node) resides. Based on meta info maps the incoming job to corresponding data node. Name Node Handles data node failures through checksums. every data has a record followed by a checksum. if checksum does not match with the original then it reports an data corrupted error.
Name node is responsible for managing the HDFS and Supplies address of the data on the different data nodes. the name nodes require a specified amount of RAM to store file system image in memory. the name node holds information about all the files in the system and needs to be extra reliable. The name node will detect that a data node is not responsive and will start replication of the data from remaining replicas. When data node comes back online, the extra replicas will be deleted.
Note:- Replication Factor controls how many times each individual blocks can be replicated.
Secondary name node:- Secondary name node performs CPU intensive operation of combining edit logs and current file system Snapshots. Its recommended to run Secondary name node in a separate machine which have Master node capacity. If we are not running a secondary name node ,cluster performance will degrade over time since edit log will grow bigger and bigger.
Job Tracker:- The Job Tracker is responsible for scheduling the tasks on slave nodes, collecting results, retrying the failed jobs. There can only be one Job Tracker in the cluster. This can be run on the same machine running the Name Node. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
Note:- Default port number for job tracker is 50030.
Note:- Daemon is a process or service that runs in background
Datanode:- Datanodes are the slaves which are deployed on each machine and provide the
actual storage. The Datanodes are responsible for serving read and write requests from file
System clients, they also perform block creation, deletion, and replication upon instruction
From the Namenode .
Task tracker:-Task tracker is also a daemon that runs on datanodes. Task Trackers manage the
execution of individual tasks on slave node. When a client submits a job, the job tracker will
initialize the job and divide the work and assign them to different task trackers to perform
MapReduce tasks.While performing this action, the task tracker will be simultaneously
communicating with job tracker by sending heartbeat. If the job tracker does not receive
heartbeat from task tracker within specified time, then it will assume that task tracker has
crashed and assign that task to another task tracker in the cluster.
No comments:
Post a Comment