The main supported operating system is Linux.
However, with some additional software Hadoop can be deployed on Windows.
What is the role of the namenode?
The namenode is the "brain" of the
Hadoop cluster and responsible for managing the distribution blocks on the
system based on the replication policy. The namenode also supplies the specific
addresses for the data based on the client requests.
What happen on the namenode when a client tries to read a data
file?
The namenode will look up the information about
file in the edit file and then retrieve the remaining information from
filesystem memory snapshot. Since the namenode needs to support a large number
of the clients, the primary namenode will only send information back for the
data location. The datanode itselt is responsible for the retrieval.
What are the hardware requirements for a Hadoop cluster (primary
and secondary namenodes and datanodes)?
There are no requirements for datanodes.
However, the namenodes require a specified amount of RAM to store filesystem
image in memory Based on the design of the primary namenode and secondary
namenode, entire filesystem information will be stored in memory. Therefore,
both namenodes need to have enough memory to contain the entire filesystem
image.
What mode(s) can Hadoop code be run in?
Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode. Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes
What mode(s) can Hadoop code be run in?
Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode. Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes
How
would an Hadoop administrator deploy various components of Hadoop in
production?
Deploy namenode and jobtracker on the master
node, and deploy datanodes and taskstrackers on multiple slave nodes. There is
a need for only one namenode and jobtracker on the system. The number of
datanodes depends on the available hardware
What is the best practice to deploy the secondary namenode
Deploy secondary namenode on a
separate standalone machine. The secondary namenode needs to be deployed on a
separate machine. It will not interfere with primary namenode operations in
this way. The secondary namenode must have the same memory requirements as the
main namenode.
Is there a standard procedure to deploy Hadoop?
No, there are some differences between various
distributions. However, they all require that Hadoop jars be installed on the
machine. There are some common requirements for all Hadoop distributions but
the specific procedures will be different for different vendors since they all
have some degree of proprietary software
What is the role of the secondary namenode?
Secondary namenode performs CPU
intensive operation of combining edit logs and current filesystem snapshots.
The secondary namenode was separated out as a process due to having CPU
intensive operations and additional requirements for metadata back-up
What are the side effects of not running a secondary name node?
The cluster performance will degrade
over time since edit log will grow bigger and bigger. If the secondary namenode
is not running at all, the edit log will grow significantly and it will slow
the system down. Also, the system will go into safemode for an extended time
since the namenode needs to combine the edit log and the current filesystem
checkpoint image.
What happen if a datanode loses network connection for a few
minutes?
The namenode will detect that a datanode is
not responsive and will start replication of the data from remaining replicas.
When datanode comes back online, the extra replicas will be The replication
factor is actively maintained by the namenode. The namenode monitors the status
of all datanodes and keeps track which blocks are located on that node. The
moment the datanode is not avaialble it will trigger replication of the data
from the existing replicas. However, if the datanode comes back up,
overreplicated data will be deleted. Note: the data might be deleted from the
original datanode.
What happen if one of the datanodes has much slower CPU?
The task execution will be as fast as
the slowest worker. However, if speculative execution is enabled, the slowest
worker will not have such big impact Hadoop was specifically designed to work
with commodity hardware. The speculative execution helps to offset the slow
workers. The multiple instances of the same task will be created and job
tracker will take the first result into consideration and the second instance
of the task will be killed.
What is speculative execution?
If speculative execution is enabled,
the job tracker will issue multiple instances of the same task on multiple
nodes and it will take the result of the task that finished first. The other
instances of the task will be killed.
|
The speculative execution is used to
offset the impact of the slow workers in the cluster. The jobtracker creates
multiple instances of the same task and takes the result of the first
successful task. The rest of the tasks will be discarded.
How many racks do you need to create an Hadoop cluster in order
to make sure that the cluster operates reliably?
In order to ensure a reliable operation it is
recommended to have at least 2 racks with rack placement configured Hadoop has
a built-in rack awareness mechanism that allows data distribution between different
racks based on the configuration.
Are there any special requirements for namenode?
Yes, the namenode holds information about all
files in the system and needs to be extra reliable. The namenode is a single
point of failure. It needs to be extra reliable and metadata need to be
replicated in multiple places. Note that the community is working on solving
the single point of failure issue with the namenode.
If you have a file 128M size and replication factor is set to 3,
how many blocks can you find on the cluster that will correspond to that file
(assuming the default apache and cloudera configuration)?
6
Based on the configuration settings the file
will be divided into multiple blocks according to the default block size of
64M. 128M / 64M = 2 . Each block will be replicated according to replication
factor settings (default 3). 2 * 3 = 6 .
What is distributed copy (distcp)?
Distcp is a Hadoop utility for launching
MapReduce jobs to copy data. The primary usage is for copying a large amount of
data. One of the major challenges in the Hadoop enviroment is copying data
across multiple clusters and distcp will allow multiple datanodes to be
leveraged for parallel copying of the data.
What is replication factor?
Replication factor controls how many
times each individual block can be replicated –
Data is replicated in the Hadoop cluster based
on the replication factor. The high replication factor guarantees data
availability in the event of failure.
What daemons run on Master nodes?
NameNode, Secondary NameNode and
JobTracker
Hadoop is comprised of five separate daemons
and each of these daemon run in its own JVM. NameNode, Secondary NameNode and
JobTracker run on Master nodes. DataNode and TaskTracker run on each Slave
nodes.
What is rack awareness?
Rack awareness is the way in which the
namenode decides how to place blocks based on the rack definitions. Hadoop will
try to minimize the network traffic between datanodes within the same rack and
will only contact remote racks if it has to. The namenode is able to control
this due to rack awareness
What is the role of the jobtracker in an Hadoop cluster?
The jobtracker is responsible for scheduling
tasks on slave nodes, collecting results, retrying failed tasks. The job
tracker is the main component of the map-reduce execution. It control the
division of the job into smaller tasks, submits tasks to individual
tasktracker, tracks the progress of the jobs and reports results back to calling
code.
How does the Hadoop cluster tolerate datanode failures?
Since Hadoop is design to run on
commodity hardware, the datanode failures are expected. Namenode keeps track
of all available datanodes and actively maintains replication factor on all data.
|
The namenode actively tracks the status of all
datanodes and acts immediately if the datanodes become non-responsive. The
namenode is the central "brain" of the HDFS and starts replication of
the data the moment a disconnect is detected.
What is the procedure for namenode recovery?
A namenode can be recovered in two
ways: starting new namenode from backup metadata or promoting secondary
namenode to primary namenode.
The namenode recovery procedure is
very important to ensure the reliability of the data.It can be accomplished by
starting a new namenode using backup data or by promoting the secondary
namenode to primary.
Web-UI shows that half of the datanodes are in decommissioning
mode. What does that mean? Is it safe to remove those nodes from the network?
This means that namenode is trying
retrieve data from those datanodes by moving replicas to remaining datanodes.
There is a possibility that data can be lost if administrator removes those
datanodes before decomissioning finished .
Due to replication strategy it is possible to
lose some data due to datanodes removal en masse prior to completing the
decommissioning process. Decommissioning refers to namenode trying to retrieve
data from datanodes by moving replicas to remaining datanodes.
What does the Hadoop administrator have to do after adding new
datanodes to the Hadoop cluster?
Since the new nodes will not have any
data on them, the administrator needs to start the balancer to redistribute
data evenly between all nodes.
Hadoop cluster will detect new datanodes
automatically. However, in order to optimize the cluster performance it is
recommended to start rebalancer to redistribute the data between datanodes
evenly.
If the Hadoop administrator needs to make a change, which
configuration file does he need to change?
Each node in the Hadoop cluster has its own
configuration files and the changes needs to be made in every file. One of the
reasons for this is that configuration can be different for every node.
Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?
Map Reduce jobs are failing on a cluster that was just restarted. They worked before restart. What could be wrong?
The cluster is in a safe mode. The
administrator needs to wait for namenode to exit the safe mode before
restarting the jobs again
This is a very common mistake by Hadoop
administrators when there is no secondary namenode on the cluster and the
cluster has not been restarted in a long time. The namenode will go into
safemode and combine the edit log and current file system timestamp
Map Reduce jobs take too long. What can be done to improve the
performance of the cluster?
One the most common reasons for
performance problems on Hadoop cluster is uneven distribution of the tasks. The
number tasks has to match the number of available slots on the cluster
Hadoop is not a hardware aware system. It is
the responsibility of the developers and the administrators to make sure that
the resource supply and demand match.
How often do you need to reformat the namenode?
Never. The namenode needs to formatted
only once in the beginning. Reformatting of the namenode will lead to lost of
the data on entire
The namenode is the only system that needs to
be formatted only once. It will create the directory structure for file system
metadata and create namespaceID for the entire file system.
After increasing the replication level, I still see that data is
under replicated. What could be wrong?
Data replication takes time due to
large quantities of data. The Hadoop administrator should allow sufficient time
for data replication
Depending on the data size the data
replication will take some time. Hadoop cluster still needs to copy data around
and if data size is big enough it is not uncommon that replication will take
from a few minutes to a few hours.