Hadoop's Distributed File System is designed to reliably store very
large files across machines in a
large cluster. It is inspired by the Google File System. Hadoop DFS
stores each file as a sequence of
blocks, all blocks in a file except the last block are the same size.
Blocks belonging to a file are
replicated for fault tolerance. The block size and replication factor
are configurable per file. Files in
HDFS are "write once" and have strictly one writer at any
time.
HDFS is a distributed, scalable,
and portable file system written in Java for the Hadoop framework.
Each
node in a Hadoop instance typically
has a single namenode; a cluster of datanodes form the HDFS cluster.
The situation is typical because
each node does not require a datanode to be present. Each datanode serves
up blocks of data over the network
using a block protocol specific to HDFS. The file system uses
the TCP/IP layer for
communication. Clients use Remote procedure call (RPC) to communicate
between each
other. HDFS stores large files (an
ideal file size is a multiple of 64 MB), across multiple machines. It
achieves reliability
by replicating the data across multiple hosts, and hence does not
require RAID storage on
hosts. With the default replication
value, 3, data is stored on three nodes: two on the same rack, and one on
a different rack. Data nodes can
talk to each other to rebalance data, to move copies around, and to keep
the replication of data high.
Hadoop Distributed File System – Goals:
•
Store
large data sets
•
Cope
with hardware failure
•
Emphasize
streaming data access
•
The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences
from other distributed file systems are significant.
•
highly fault-tolerant and is designed to be
deployed on low-cost hardware.
•
provides high throughput access to application
data and is suitable for applications that have large data sets.
•
relaxes a few POSIX requirements to enable
streaming access to file system data.
part of the Apache Hadoop Core
project. The project URL is http://hadoop.apache.org/core/.
Data Storage in HDFS
A ‘block’ is the minimum amount of
data that can be read or written. In HDFS, the default block size is 64 MB
as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are
broken down into block-sized chunks, which are stored as independent units.
HDFS blocks are large as compared to disk blocks, particularly to minimize the
cost of seeks.
A file can be larger than any
single disk in the network. There’s nothing that requires the blocks from a
file to be stored on the same disk, so they can take advantage of any of the
disks in the cluster. Making the unit of abstraction a block rather than a
file simplifies the storage subsystem. Blocks provide fault tolerance and
availability. To insure against corrupted blocks and disk and machine failure,
each block is replicated to a small number of physically separate machines
(typically three). If a block becomes unavailable, a copy can be read from
another location in a way that is transparent to the client.
Case 1: If a particular file is
50 mb, will the HDFS block still consume 64 mb as the default size?
No, not at all! 64 mb is just a
unit where the data will be stored. In this particular situation, only 50 mb
will be consumed by an HDFS block and 14 mb will be free to store something
else. It is the MasterNode that does data allocation in an efficient manner.
Case 2: If we want to copy 10
blocks from one machine to another, but another machine can copy only 8.5
blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken
down. Before copying the blocks from one machine to another, the Master
node will figure out what is the actual amount of space required, how many
block are being used, how much space is available, and it will allocate the
blocks accordingly.
No comments:
Post a Comment