Windowing function in Apache Hive

Apache hive is a part of the Hadoop echo system that provides the functionality to create a data warehouse at the Hadoop distributed file system. let us Understand What is Hive and how we can work on windowing function in apache hive.

Firstly, the hive is not an OLTP tool rather it is used to process big data that are distributed at the Hadoop cluster apart from this the apache hive is OLAP (online analytical processing) tool of Hadoop. using apache hive you can create different types of the table like internal, external, partitioned, buckets or you can apply windowing function in Apache Hive to process data that is stored at apache hive warehouse.

A windowing function in apache hive is not a new concept as from the past decade we were using windowing functions features in various relational database management systems like MySQL and oracle apart from this, windowing function techniques are used to execute SQL queries at the top of the temporary inner table that is a window.

In this blog, we are going to cover all windowing functions in apache hive so let us start with what is windowing operations on tables.window is a temp table where you perform a couple of aggregate function like count on the window, sum on the window,avg on the window, etc.

Step-1: Create a table in the hive to better understand the functionalities of the windowing function.

raw data set for hive

Creating a table in the hive.

CREATE TABLE IF NOT EXISTS EMP
(
id int,
name string,
email string,
gender string,
salary int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

Loading Data into Table.

Step-2: Now load your data into table EMP

LOAD DATA LOCAL INPATH ‘/home/cloudera/Desktop/inputdata.csv’ OVERWRITE INTO TABLE EMP;

Step-3: Then verify your data using select * from EMP;

select operation in apache hive

Applying Window Functions on data sets

Performing window operation on the partitions as in the below command we have only two partitions 1-male 2-female so this query will divide all data into 2 parts and will perform count, sum, min, max,avg operation on part1 and part2.

Example Query:-

select name,gender
,count(*) over w as countsal
,sum(Salary) over w as sumsalary
,min(salary) over w as minsalary
,max(salary) over w as maxsalary
,avg(salary) over w as avgsalary
from emp
WINDOW w as (partition by gender);

Result.

partition in apache hive

Finding First and Last Value Over a Window:-

Now we are going to fetch the first and last value over a window.

Example Query:

select name,gender,first_value(salary) over(partition by gender) as first_high from emp;

First Value Result:-

First and last value in apache hive

Example Query:

select name,gender,last_value(salary) over(partition by gender) as last_high from emp;

Last Value Result:-

first and last value in hive

Finding Rank and dense Rank over a window.

Finding row number, rank and dense rank using window.

Example Query:

select name,salary,
ROW_NUMBER() OVER W as row_number,
RANK() OVER W as rank,
DENSE_RANK() OVER W as dens_erank
from emp
window W as (ORDER BY salary ASC);

Result:-

order by in hive

THANK YOU

Leave a Reply

Your email address will not be published. Required fields are marked *