In my last Hive post, we introduced partitions and bucketing both of which allow you to horizontally slice data to make it more manageable and easy to query. Staying the course in this post we will introduce two more techniques to improve your experience in Hive through the use of indexes and views.
In the SQL Server world, indexes are a critical for efficiently retrieving rows from tables. Without an index, it is necessary to perform a full scan of all the data to retrieve the required rows. When an index is added the behavior acts more like a seek which is useful when dealing with large sets of data typically found in Hive.
Indexes in Hive function a lot like non-clustered indexes in SQL Server. Rather than containing the actual row data (as in a clustered index), Hive indexes contain the indexed column values and a pointer or offset to the location of the data.
The command to create an index in Hive is fairly straight-forward as seen below:
CREATE INDEX zip_index ON TABLE customer(zip) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
Note that the WITH DEFERRED REBUILD portion of the command prevents the index from immediately being built. To build the index you can issue the following command:
ALTER INDEX zip_index ON customer REBUILD;
After the command is issued, a MapReduce job is started and the index is created. The time the MapReduce process takes to run depends on the volume of data in the table being index and the process results in a new table being created. The table using the example above takes the name default__customer_zip_index__ and contains the zip a bucket, offsets as well as the country and regions columns that were set-up as the partition schema for the table (see the previous blog post HERE).
A couple of points to consider as you get started:
- The Hive Query Rewrite engine is rule based and not cost based. This means that it may not make a good or expected decision and it may be necessary to rewrite the query manually to use the index. If you find it necessary to rewrite a query to manually use the index, consider wrapping the query in a View (see below).
- There is not automatic index maintenance. You should plan for index rebuilds as you load data. Note that it is possible to rebuild an index for only a single partition which can be beneficial if you are loading data by partition.
- In addition to the CompactIndexHandler used above, BitMap indexes are supported as of version 0.8. A BitMap index is ideal for columns with a limited number of distinct values.
Another feature that will be familiar to those coming from the relational-database world is the View. For those who are not familiar with purpose or concept of a view their use within Hive is straight-forward. Views are a logical construct that can be used to simplify queries by either abstracting away complexities such as joins or sub-queries or by pre-filtered or restricting data. Unlike some RDMS implementations however, they never store data or are materialized.
To create a view, the following syntax is used:
CREATE VIEW ViewName AS SELECT <cols> FROM <table> WHERE <predicate>
A view is capable of using any valid select statement and can be used to filter the projection (columns) or it can contain a predicate (WHERE clause) to filter the results as seen below.
CREATE VIEW FloridaOrders AS SELECT * FROM customer JOIN order ON (customer.username = order.username) WHERE state = 'FL'
After you have created you view, it can be queried like any other table within Hive.
SELECT * FROM FloridaOrders WHERE amount > 1000
Another useful purpose for Views relates to the special data types supported within Hive. If you recall from my previous introductory blog on Hive data types, Hive differs from most traditional relational database systems in that it supports three set-based data types: array, map and struct.
Handling these data types as we discussed HERE can add to the complexity of a query since its necessary to use either the array or dot notation to access the set members. Views can be used to flatten the sets making them more accessible via query as seen below.
CREATE VIEW CoachRecords (name, wins, losses, ties) AS SELECT Struct.FirstName + ' ' + Struct.NickName + ' ' + Struct.LastName, record["wins"], record["losses"], record["ties"] FROM coaches
As we wrap-up, there is one important behavior to consider that differs from the RDBMS world. Hive Views are not schema-bound in anyway. This means that changes to the underlying table schema could cause the view to break. You should plan for an be aware of this issue when creating views in an evolving system.
In this post we introduced two more concepts that translate from the RDBMS world to Hive: indexes and views. These techniques when coupled with the partitioning and bucketing functionality discussed previously (HERE), provide performance and manageable features that you are most likely already familiar with. Knowing about and including these features correctly in your Hive implementation will simplify and provide your users with a better experience.
Till next time!