Hadoop was initially designed for batch processing. That means, take a large dataset as input all at once, process it, and write a large output. The very concept of MapReduce is geared towards batch and not real-time. But to be honest, this was only the case at Hadoop's beginning, and now you have plenty of opportunities to use Hadoop in a more real-time way.

First I think it's important to define what you mean by real-time. It could be that you're interested in stream processing, or could also be that you want to run queries on your data that return results in real-time.

For stream processing on Hadoop, natively Hadoop won't provide you with this kind of capabilities, but you can integrate some other projects with Hadoop easily:

  • Storm-YARN allows you to use Storm on your Hadoop cluster via YARN.
  • Spark integrates with HDFS to allow you to process streaming data in real-time.

For real-time queries there are also several projects which use Hadoop:

  • Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
  • Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
  • The Stinger project aims to make Hive itself more real-time.

There are probably other projects that would fit into the list of "Making Hadoop real-time", but these are the most well-known ones.

So as you can see, Hadoop is going more and more towards the direction of real-time and, even if it wasn't designed for that, you have plenty of opportunities to extend it for real-time purposes.

Please feel free to revert if you have any other concerns
188961Please feel free to revert if you need any further help.