Available compression schemes, called codecs. compression is an tool to control extreme data volumes.

A codec, which is a shortened form of compressor/decompressor it’s the implementation of a compression/decompression algorithm.some codecs support something called splittable compression and that codecs differ in both the speed with which they can compress and decompress data and the degree to which they can compress it.

Splittable compression is an important concept in a Hadoop context. The way Hadoop works is that files are split if they’re larger than the file’s block size setting, and individual file splits can be processed in parallel by different mappers.

With most codecs, text file splits cannot be decompressed independently of other splits from the same file, so those codecs are said to be non-splittable, so MapReduce processing is limited to a single mapper.

some common codecs that are supported by the Hadoop framework. Be sure to choose the codec that most closely matches the demands of your particular use case (for example, with workloads where the speed of processing is important, chose a codec with high decompression speeds):

Gzip: A compression utility that was adopted by the GNU project, Gzip (short for GNU zip) generates compressed files that have a .gz extension. You can use the gunzip command to decompress files that were created by a number of compression utilities, including Gzip.

Bzip2: From a usability standpoint, Bzip2 and Gzip are similar. Bzip2 generates a better compression ratio than does Gzip, but it’s much slower. In fact, Of all the available compression codecs in Hadoop, Bzip2 is by far the slowest.

If you’re setting up an archive that you’ll rarely need to query and space is at a high premium, then maybe would Bzip2 be worth considering.

Snappy: The Snappy codec from Google provides modest compression ratios, but fast compression and decompression speeds. (In fact, it has the fastest decompression speeds, which makes it highly desirable for data sets that are likely to be queried often.)

The Snappy codec is integrated into Hadoop Common, a set of common utilities that supports other Hadoop subprojects. You can use Snappy as an add-on for more recent versions of Hadoop that do not yet provide Snappy codec support.

LZO: Similar to Snappy, LZO (short for Lempel-Ziv-Oberhumer, the trio of computer scientists who came up with the algorithm) provides modest compression ratios, but fast compression and decompression speeds. LZO is licensed under the GNU Public License (GPL).

LZO supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs. LZO needs to create an index when it compresses a file, because with variable-length compression blocks, an index is required to tell the mapper where it can safely split the compressed file. LZO is only really desirable if you need to compress text files.

Hadoop Codecs

Codec File Extension Splittable? Degree of Compression Compression Speed
Gzip .gz No Medium Medium
Bzip2 .bz2 Yes High Slow
Snappy .snappy No Medium Fast
LZO .lzo No, unless indexed Medium Fast