We hope you are doing great.

Consider using LZO compression. It's splittable. 


That means a big .lzo file can be processed by many mappers. Bzip2 can do that, but it's slow.

 

or MapReduce, LZO sounds a good balance between compression ratio and compress/decompress speed.


Link :  http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/


Link :  http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2


BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.


LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.


LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.


Why is Snappy useful for Hadoop.


Many Hadoop clusters use LZO compression for intermediate MapReduce output. This output, which is never seen by the user, is always written to disk by the mappers, and then accessed across the network by reducers. It is a prime candidate for compression since it tends to be compressible (there is some redundancy in the key space, since the map outputs are sorted), and because writing to disk is slow it pays to perform some light compression to reduce the number of bytes written (and later read). 


Snappy and LZO are not CPU intensive, which is important, as other map and reduce processes running at the same time will not be deprived of CPU time. In testing, we have seen that the performance of Snappy is generally comparable to LZO, with up to a 20% improvement in overall job time in some cases.


This use alone justifies installing Snappy, but there are other places Snappy can be used within Hadoop applications. 

For example, Snappy can be used for block compression in all the commonly-used Hadoop file formats, including Sequence Files, Avro Data Files, and HBase tables.


One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. 

This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.


Please find an attached document for the same for your reference.

Please let us know if you have any other concerns so we can help you out.