Buckets should be based upon as Queries to be used in Hive. Buckets are used to reduce the number of input files for processing your query.
Tables/Partitions can be further subdivided into Clusters or Buckets.
Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.
For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table.
Bucketing is helpful for 2 reasons :
•Enables more efficient queries
•Makes sampling more efficient
•Hash(column) MOD(number of buckets) –evenly distributed
CREATE TABLE students_bucket(name STRING,id INT,college STRING) PARTITIONED BY(country STRING) CLUSTERED BY (college) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ ;
INSERT OVERWRITE TABLE employees_bucketPARTITION(country=‘USA’) SELECT name,id,college from college.students_external;
SELECT * FROM students_bucket TABLESAMPLE(BUCKET 2 OUT OF 4);
Let discuss one more example :
When you do bucketing on your data based on particular column(userid), your data is stored in each bucket file is based on hashing of key modulus no of buckets.
For example, if you have 10 records in data in the following format.(original )
if you do buckets as 4 on the above data, your (partition) dir in hdfs will have four files wrt each bucket
bucket-0 will contain data
buckte-1 will contain data
It goes based on remainder of your userid
Coming to Query. "Select * from table1 where userid = 4
How many files will be processed?
Only 1 ie. bucket-0 file
It turn we reduce the number of files for MR using Hive.
We can do bucketing on more number of columns based on frequency of the columns in where clause of your queries.
Note: used 10 records just for explanation only, Buckets can be used even without partition.
We hope this will help you to understand the concept.