Can you explain about YARN and its components in detail like
1. What is Resource Manager ,how it works. What happens when it goes down and how every thing will restore when Resource manager restart
2. What is Node manager.
3. What is application manager
Please find the answers below:
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open source distributed processing framework. Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for Big Data applications.
YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.
For example, Hadoop Clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs.
The original incarnation of Hadoop closely paired the Hadoop Distributed File System with the batch-oriented Map
Reduce programming framework, which handles resource management and job scheduling on Hadoop systems and supports the parsing and condensing of data sets in parallel.
YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes. Running on commodity Hardware clusters, Hadoop has attracted particular interest as a staging area and data store for large volumes of structured and unstructured data intended for use in analytic applications. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can't wait for batch jobs to finish.
The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker and TaskTracker into separate entities.
In Hadoop 2.0, the JobTracker and TaskTracker no longer exist and have been replaced by three components:
ResourceManager: a scheduler that allocates available resources in the cluster amongst the competing applications.
NodeManager: runs on each node in the cluster and takes direction from the ResourceManager. It is responsible for managing resources available on a single node.
ApplicationMaster: An instance of a framework-specific library, an ApplicationMaster runs a specific YARN job and is responsible for negotiating resources from the ResourceManager and also working with the NodeManager to execute and monitor Containers.
The actual data processing occurs within the Containers executed by the ApplicationMaster.
A Container grants rights to an application to use a specific amount of resources (memory, cpu etc.) on a specific host.
YARN is not the only new major feature of Hadoop 2.0. HDFS has undergone a major transformation with a collection of new features that include:
NameNode HA: automated failover with a hot standby and resiliency for the NameNode master service.
Snapshots: point-in-time recovery for backup, disaster recovery and protection against use errors.
Federation: a clear separation of namespace and storage by enabling generic block storage layer.
NameNode HA is achieved using existing components like ZooKeeper along with new components like a quorum of JournalNodes and the ZooKeeper Failover Controller (ZKFC) processes.
You can go through the below links for video lectures and more reference.