Input Format | Description | Key | Value |
TextInputFormat | Default format; reads lines of text files | The byte offset of the line | The line contents |
KeyValueInputFormat | Parses lines into key, value pairs | Everything up to the first tab character | The remainder of the line |
SequenceFileInputFormat | A Hadoop-specific high-performance binary format | user-defined | user-defined |
The default InputFormat is the TextInputFormat. This treats each line of each input file as a separate record, and performs no parsing. This is useful for unformatted data or line-based records like log files.
A more interesting input format is the KeyValueInputFormat. This format also treats each line of input as a separate record. While the TextInputFormat treats the entire line as the value, the KeyValueInputFormat breaks the line itself into the key and value by searching for a tab character. This is particularly useful for reading the output of one MapReduce job as the input to another.
Finally, the SequenceFileInputFormat reads special binary files that are specific to Hadoop. These files include many features designed to allow data to be rapidly read into Hadoop mappers. Sequence files are block-compressed and provide direct serialization and deserialization of several arbitrary data types (not just text). Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.