Below is the link of the website.
Project : Analyze social bookmarking sites to find insights
Industry: Social Media
Data: It comprises of the information gathered from sites like reddit.com, stumbleupon.com etc which are bookmarking sites and allow you to bookmark, review, rate, search various links on any topic.reddit.com, stumbleupon.com, etc. A bookmarking site allows you to bookmark, review, rate, search various links on any topic. The data is in XML format and contains various links/posts URL, categories defining it and the ratings linked with it.
Problem Statement: Analyze the data in Hadoop Eco-system to:
1. Fetch the data into Hadoop Distributed File System and analyze it with the help of MapReduce, Pig and Hive to find the top rated links based on the user comments, likes etc.
2. Using MapReduce convert the semi-structured format (XML data) into structured format and categorize the user rating as positive and negative for each of the thousand links.
3. Push the output HDFS and then feed it into PIG, which splits the data into two parts: Category data and Ratings data.
4. Write a fancy Hive Query to analyze the data further and push the output is into relational database (RDBMS) using Sqoop.
5. Use a web server running on grails/java/ruby/python that renders the result in real time processing on a website.
Hope this helps.
Please let me know if you have any other concerns.