Hope you are doing well


It's pleasure to talk to you.


Here I am giving you an simple example of Cluster Sizing


Hadoop – Cluster Sizing :


Once the hardware for the worker nodes has been selected, the next obvious question is how many of those machines are required to complete a workload. The complexity of sizing a cluster comes from knowing—or more commonly, not knowing—the specifics of such a workload: its CPU, memory, storage, disk I/O, or frequency of execution requirements. Worse, it’s common to see a single cluster support many diverse types of jobs with conflicting resource requirements. Much like a traditional relational database, a cluster can be built and optimized for a specific usage pattern or a combination of diverse workloads, in which case some efficiency may be sacrificed.

 

There are a few ways to decide how many machines are required for a Hadoop deployment. The first, and most common, is sizing the cluster based on the amount of storage required. Many clusters are driven by high data ingest rates; the more data coming into the system, the more machines required. It so happens that as machines are added to the cluster, we get compute resources in addition to the storage capacity. Given the earlier example of 1 TB of new data every day, a growth plan can be built that maps out how many machines are required to store the total amount of data. It usually makes sense to project growth for a few possible scenarios. For instance

 

Sample cluster growth plan based on storage:


Average daily ingest rate 1 TB
Replication factor 3 (copies of each block)
Daily raw consumption 3 TB Ingest × replication
Node raw storage 24 TB 12 × 2 TB SATA II HDD
MapReduce temp space reserve 25% For intermediate MapReduce data
Node-usable raw storage 18 TB Node raw storage – MapReduce reserve
1 year (flat growth) 61 nodes Ingest × replication × 365 / node raw storage
1 year (5% growth per month) 81 nodes
1 year (10% growth per month) 109 nodes


There’s a clear chicken and egg problem; a job must be run with a subset of the data in order to understand how many machines are required to run the job at scale. An interesting property of MapReduce jobs is that map tasks are almost always uniform in execution. If a single map task takes one minute to execute and consumes some amount of user and system CPU time, some amount of RAM and some amount of I/O, 100 map tasks will simply take 100 times the resources. Reduce tasks, on the other hand, don’t have this property. The number of reducers is defined by the developer rather than being based on the size of the data



I think this will resolve the issue.


Please let me know if you have any further issue regarding this.


Feel free to contact us in case you have any query.