Apache Hive

  • Post author:
  • Post category:General
Apache Hive

In the modern era, companies are now dealing with ever increasing volumes of big data. Big data simple means a large volume of different varieties of data. Processing big data with traditional means was almost impossible and hence Apache foundation came up with innovative software, Apache Hadoop to solve the issue. The interest around the software began from its naming, as it was named after the founder’s toy-elephant.

Being an open source platform Apache Hadoop has gained popularity for distributed storage and processing of large data sets. It has helped in cost effective processing of big sets of data. The service covers functions like data processing, data storage, data access and security. In short, Scalability, Performance and Flexibility have made it gain popularity among organizations. Let’s have a look at what Hive possesses vis-a-vis Hadoop.

Hadoop was developed to organize and store data of all shapes, sizes and formats. It acts as a container of heterogeneous data from a legion of sources. On the other hand, Hive is used by data analysts to query, analyse and summarize the data into usable information for business. Apache Hive is an open source project that is volunteered under the Apache Software Foundation. Initially, it was developed by Facebook.  Hive provides an SQL like interface through which data stored in various databases is integrated with Hadoop and the same can be obtained from here. The SQL is termed as HiveQL in which Hive Query Language (HQL) statements are used. It translates SQL like queries into Mapreduce jobs that are executed on a Hadoop cluster. Hive is fast and scalable.

Hive queries can be run in many ways. It can be run using Hive shell, a command line interface, Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application. Hive almost looks like the conventional database code. The tables in Hive are organised and are similar to those in a relational database. However it has some differences as it is based on Hadoop processing. As Hive is involved in long sequential scans, Hive queries can have relatively high latency. Hence, it doesn’t suit applications, which demand a faster response time. Also Hive is basically read based. So it is not appropriate for operations that involve a fair amount of writing. Data structure of Hadoop imposes certain constraints on what Hive can do.

In a database, the tables are serialized and every table has a Hadoop Distributed File System (HDFS) directory. The tables are sub divided into smaller partitions in which data is distributed. It supports the common data formats like BINARY, DOUBLE, FLOAT, INT, BOOLEAN, CHAR and DECIMAL.
With HiveQL, it is possible to perform data analysis on larger datasets. It has to be noted that it is the best and is designed for traditional data warehousing tasks and are not suited for online transaction processing.

Leave a Reply