Aggregating over groups and calculating summary statistics by group provide an informative angle to review the data. As examining the data at the disaggregated level, we might prefer including the summary statistics aside. Therefore, we need to merge a table of summary statistics at the group level back to the disaggregated data using group ID. Window functions are introduced in SQL to facilitate this process.
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. In contrast to regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row -- the rows retain their separate identities (MODE Tutorial Link).
This post summarizes the primary use of window functions by presenting practice examples.
The OVER
Clause Syntax
The basic syntax of window function is to use SQL functions (either
aggregate functions such as SUM
, COUNT
, and
AVG
or row-wise functions such as ROW_NUMBER()
and RANK()
) together with the OVER
clause.
1 | ---- OVER (ORDER BY) ---- |
The
PARTITION BY
clause within theOVER
clause divides the data into subsets by the values of selected columns.
GROUP BY
+
JOIN
= Window Functions
Here is a diagram to illustrate this
GROUP BY
+JOIN
process:
This can be completed by
pandas.DataFrame.transform
in Python (Reference Link) orDT[, .SD, by]
usingdata.table
package in R (Reference Link).
1 | ---- Window Functions ---- |
ROW_NUMBER()
,
RANK()
, and DENSE_RANK()
When labeling the row numbers, we are thinking of
ROW_NUMBER()
, RANK()
, and
DENSE_RANK()
.
RANK()
vs.ROW_NUMBER()
For example, if you order by
start_time
, there might exist some rows with identical start times.RANK()
gives the same rank to these rows, whereasROW_NUMBER()
gives them different numbers in order.RANK()
vs.DENSE_RANK()
For example,
RANK()
would give the identical rows a rank of 2, then might give 5 by skipping the ranks 3 and 4. In contrast,DENSE_RANK()
would not skip any rank. It gives all the identical rows a rank of 2, followed by 3.
1 | ---- ROW_NUMBER() ---- |
Quartile, Quintile, and Percentile
Use the window function NTILE(# of buckets)
to label the
subgroups ordered by selected columns.
1 | SELECT start_terminal, |
Window Alias
If you're planning to write several window functions in to the same
query, using the same window, you can create an alias. The
WINDOW
clause, if included, should always come
after the WHERE
clause.
1 | SELECT start_terminal, |
When
# of buckets
is greater than the number of records,NTILE()
==ROW_NUMBER()
.
Comparing to The Preceding or Following Rows
You can use LAG
or LEAD
to create columns
that pull values from other rows.
1 | SELECT start_terminal, |
View / Make Comments