Aggregating over groups and calculating summary statistics by group provide an informative angle to review the data. As examining the data at the disaggregated level, we might prefer including the summary statistics aside. Therefore, we need to merge a table of summary statistics at the group level back to the disaggregated data using group ID. Window functions are introduced in SQL to facilitate this process.
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. In contrast to regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row -- the rows retain their separate identities (MODE Tutorial Link).
This post summarizes the primary use of window functions by presenting practice examples.
The OVER Clause Syntax
The basic syntax of window function is to use SQL functions (either
aggregate functions such as SUM, COUNT, and
AVG or row-wise functions such as ROW_NUMBER()
and RANK()) together with the OVER clause.
1 | ---- OVER (ORDER BY) ---- |
The
PARTITION BYclause within theOVERclause divides the data into subsets by the values of selected columns.
GROUP BY +
JOIN = Window Functions
Here is a diagram to illustrate this
GROUP BY+JOIN process:
This can be completed by
pandas.DataFrame.transformin Python (Reference Link) orDT[, .SD, by]usingdata.tablepackage in R (Reference Link).
1 | ---- Window Functions ---- |
ROW_NUMBER(),
RANK(), and DENSE_RANK()
When labeling the row numbers, we are thinking of
ROW_NUMBER(), RANK(), and
DENSE_RANK().
RANK()vs.ROW_NUMBER()For example, if you order by
start_time, there might exist some rows with identical start times.RANK()gives the same rank to these rows, whereasROW_NUMBER()gives them different numbers in order.RANK()vs.DENSE_RANK()For example,
RANK()would give the identical rows a rank of 2, then might give 5 by skipping the ranks 3 and 4. In contrast,DENSE_RANK()would not skip any rank. It gives all the identical rows a rank of 2, followed by 3.
1 | ---- ROW_NUMBER() ---- |
Quartile, Quintile, and Percentile
Use the window function NTILE(# of buckets) to label the
subgroups ordered by selected columns.
1 | SELECT start_terminal, |
Window Alias
If you're planning to write several window functions in to the same
query, using the same window, you can create an alias. The
WINDOW clause, if included, should always come
after the WHERE clause.
1 | SELECT start_terminal, |
When
# of bucketsis greater than the number of records,NTILE()==ROW_NUMBER().
Comparing to The Preceding or Following Rows
You can use LAG or LEAD to create columns
that pull values from other rows.
1 | SELECT start_terminal, |
View / Make Comments