I am sending you all the details of the Questions for Interview and also for the certifications.
1.Describe in detail your biggest Hadoop cluster you have worked on? Provide your involvement in modeling, ingestion, transformations, aggregation and data access layer.
The no. of nodes that a company use is totally dependent on the density of data that a company analyses.
Also the no of datanodes is dependent upon the hard disk capacity and the no of cores your commodity hard ware would be having.
So these things depend upon the type of project.The no of cluster in a medium cluster would be anywhere between 20-40 nodes.
2. My involvement in modeling, ingestion, transformations, aggregation and data access layer-- What kind of activity the interviewer is expecting. Can you pls explain?
Interviewer is mainly interested in the data flow pattern in the hadoop framework.
so mainly modeling and ingestion is the step of bringing the data into the hdfs ,trnsformations and aggregation can be done using the various components of hadoop like pig and mapreduce and then the transformed and aggregated data is gain put back into the hdfs.
From this Transformed data we can form reports and reach to a conclusion through the BI tools.
Your role you can explain by explaining him the steps of the projects that you have undergone while implementing the POC provided by us.
3. Some challenges that everyone is facing while implementing
Basically the basic challenges that everyone come across is parsing the data sets.The per-processing of the data set and cleaning before we analyse needs much of attention.
1. Can we create views in hive
Yes we can create views in hive:
To create a view, the following syntax is used:
CREATE VIEW ViewName AS
2. Can we update data in hive tables
No we cant update it like a normal sql table however we have update command within hive which in turn creates a different file itself when we fire update command.
3. What are map side joins
map side joins are the combination of two files i.e. records from two files are merged on the basis of common record in the mapper side itself.
4. How data is processed using sqoop
Sqoop is mainly the tool to import the structured tabular data from rdbms tables to the storage system of hadoop like hdfs,hive tables etc.
5. How can we achieve high availability
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
6. Can replication factor be 0
Replication factor can be 0 that means you will have all the daemons up and running but you wont be allowed to store data onto data nodes.
7. How is security handled in hadoop if we are processing confidential data in HDFS
Data cinfidentiality is maintained either by marking the hdfs to be used by a single user or you can just encrypt the data itself:
Data Encryption on RPC
The data transfered between hadoop services and clients. Setting hadoop.rpc.protection to "privacy" in the core-site.xml activate data encryption.
Data Encryption on Block data transfer.
You need to set dfs.encrypt.data.transfer to "true" in the hdfs-site.xml in order to activate data encryption for data transfer protocol of DataNode.
Data Encryption on HTTP
Data transfer between Web-console and clients are protected by using SSL(HTTPS).
1. Select * from emp ; in hive will it run mapreduce Job?
2. Select emp_name from emp; will this run mapreduce Job?
2. Select emp_name from emp; will this run mapreduce Job?
Both of the operation specified above will run MR job.
All the data manipulation and processing is done by the MR Framework hence each data processing would fire MR jobs.
3. If data nodes capacity in 2tb, with replication factor 3, can a file of 1tb be stored in HDFS?
If the data nodes capacity is 2tb and the replication factor is there then you would not be able to store 3TB of data there as 1tb after replicated thrice becomes 3 TB which in turn crosses the total space of the data nodes.
This is what one of our very active forum participant has written in the discussion forums (LMS):
I cleared CCD-410 today...huh...it was quite a lot of reading...happy to get it thru in the end.
To share my experience - first of all, it test the underlying concepts more than your expertise on hands on (may be true for all certification). So whatever hands on we have done as a part of assignments are good enough (strongly advisable to do those). Obviously, if you can do more - it's always better.
I started with reading through Hadoop in Action, which is little easy going and it made me more comfortable with the subject. I then (re)started with "Hadoop: The Definitive Guide". I read through first 8 chapter of the book and the basics I knew from Pig, Hive, HBase, Scoop classes were good enough as these areas are just touched upon. Read about Oozie too.
It is very important to read the first 8 chapters of the definitive guide very well. Reading the chapter 4 (Hadoop I/O) was killing me, but I did as much as I could. Did not have experience on working with Distributed Cache, Sequence File and MapFile - so took more time to digest it.
Understanding different components on YARN and what it is upto is very important.
Sample test in crinlogic.com and skillsign.com helped to get a feel of it. There were few scenerio/use case based questions as well.
That's pretty much it. I wish you best of luck.