Skip to content

Commit

Permalink
Issue duckdb#4532: Secondary Window Orderings
Browse files Browse the repository at this point in the history
* Document new `ORDER BY` argument support for window functions
* Update aggregate documentation for `DISTINCT` and `ORDER BY` simplification

fixes: duckdb#4532
  • Loading branch information
Richard Wesley committed Jan 9, 2025
1 parent df819b9 commit d4e8928
Show file tree
Hide file tree
Showing 3 changed files with 81 additions and 36 deletions.
2 changes: 1 addition & 1 deletion BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ scripts/serve-latest.sh
Visit <http://localhost:4000/docs/> to browse the website.
Note that to save time on building, the `serve-latest.sh` script only deploys the latest stable version and exclude the archives. To serve the full website with old versions included, run:
Note that to save time on building, the `serve-latest.sh` script only deploys the latest stable version and excludes the archives. To serve the full website with old versions included, run:
```bash
scripts/serve.sh
Expand Down
5 changes: 3 additions & 2 deletions docs/sql/functions/aggregates.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ Aggregates are functions that *combine* multiple rows into a single value. Aggre
### `DISTINCT` Clause in Aggregate Functions

When the `DISTINCT` clause is provided, only distinct values are considered in the computation of the aggregate. This is typically used in combination with the `count` aggregate to get the number of distinct elements; but it can be used together with any aggregate function in the system.
There are some aggregates that are insensitive to duplicate values (e.g., `min` and `max`) and for them this clause is parsed and ignored.

### `ORDER BY` Clause in Aggregate Functions

Expand Down Expand Up @@ -142,11 +143,11 @@ The table below shows the available general aggregate functions.
| [`histogram_exact(arg, elements)`](#histogram_exactarg-elements) | Returns a `MAP` of key-value pairs representing the requested elements and their counts. A catch-all element specific to the data-type is automatically added to count other elements when they appear, see [`is_histogram_other_bin`]({% link docs/sql/functions/utility.md %}#is_histogram_other_binarg). |
| [`last(arg)`](#lastarg) | Returns the last value of a column. This function is [affected by ordering](#order-by-clause-in-aggregate-functions). |
| [`list(arg)`](#listarg) | Returns a `LIST` containing all the values of a column. This function is [affected by ordering](#order-by-clause-in-aggregate-functions). |
| [`max(arg)`](#maxarg) | Returns the maximum value present in `arg`. |
| [`max(arg)`](#maxarg) | Returns the maximum value present in `arg`. This function is [unaffected by distinctness](#distinct-clause-in-aggregate-functions). |
| [`max(arg, n)`](#maxarg-n) | Returns a `LIST` containing the `arg` values for the "top" `n` rows ordered by `arg` descending. |
| [`max_by(arg, val)`](#max_byarg-val) | Finds the row with the maximum `val`. Calculates the `arg` expression at that row. This function is [affected by ordering](#order-by-clause-in-aggregate-functions). |
| [`max_by(arg, val, n)`](#max_byarg-val-n) | Returns a `LIST` containing the `arg` expressions for the "top" `n` rows ordered by `val` descending. |
| [`min(arg)`](#minarg) | Returns the minimum value present in `arg`. |
| [`min(arg)`](#minarg) | Returns the minimum value present in `arg`. This function is [unaffected by distinctness](#distinct-clause-in-aggregate-functions). |
| [`min(arg, n)`](#minarg-n) | Returns a `LIST` containing the `arg` values for the "bottom" `n` rows ordered by `arg` ascending. |
| [`min_by(arg, val)`](#min_byarg-val) | Finds the row with the minimum `val`. Calculates the `arg` expression at that row. This function is [affected by ordering](#order-by-clause-in-aggregate-functions). |
| [`min_by(arg, val, n)`](#min_byarg-val-n) | Returns a `LIST` containing the `arg` expressions for the "bottom" `n` rows ordered by `val` ascending. |
Expand Down
110 changes: 77 additions & 33 deletions docs/sql/functions/window_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,24 +64,24 @@ The table below shows the available general window functions.

| Name | Description |
|:--|:-------|
| [`cume_dist()`](#cume_dist) | The cumulative distribution: (number of partition rows preceding or peer with current row) / total partition rows. |
| [`cume_dist([ORDER BY ordering])`](#cume_dist) | The cumulative distribution: (number of partition rows preceding or peer with current row) / total partition rows. |
| [`dense_rank()`](#dense_rank) | The rank of the current row *without gaps;* this function counts peer groups. |
| [`first_value(expr[ IGNORE NULLS])`](#first_valueexpr-ignore-nulls) | Returns `expr` evaluated at the row that is the first row (with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. |
| [`lag(expr[, offset[, default]][ IGNORE NULLS])`](#lagexpr-offset-default-ignore-nulls) | Returns `expr` evaluated at the row that is `offset` rows (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) before the current row within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. |
| [`last_value(expr[ IGNORE NULLS])`](#last_valueexpr-ignore-nulls) | Returns `expr` evaluated at the row that is the last row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. |
| [`lead(expr[, offset[, default]][ IGNORE NULLS])`](#leadexpr-offset-default-ignore-nulls) | Returns `expr` evaluated at the row that is `offset` rows after the current row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. |
| [`nth_value(expr, nth[ IGNORE NULLS])`](#nth_valueexpr-nth-ignore-nulls) | Returns `expr` evaluated at the nth row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame (counting from 1); `NULL` if no such row. |
| [`ntile(num_buckets)`](#ntilenum_buckets) | An integer ranging from 1 to `num_buckets`, dividing the partition as equally as possible. |
| [`percent_rank()`](#percent_rank) | The relative rank of the current row: `(rank() - 1) / (total partition rows - 1)`. |
| [`rank_dense()`](#rank_dense) | The rank of the current row *without gaps. |
| [`rank()`](#rank) | The rank of the current row *with gaps;* same as `row_number` of its first peer. |
| [`row_number()`](#row_number) | The number of the current row within the partition, counting from 1. |

#### `cume_dist()`
| [`first_value(expr[ ORDER BY ordering][ IGNORE NULLS])`](#first_valueexpr-ignore-nulls) | Returns `expr` evaluated at the row that is the first row (with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. |
| [`lag(expr[, offset[, default]][ ORDER BY ordering][ IGNORE NULLS])`](#lagexpr-offset-default-ignore-nulls) | Returns `expr` evaluated at the row that is `offset` rows (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) before the current row within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. |
| [`last_value(expr[ ORDER BY ordering][ IGNORE NULLS])`](#last_valueexpr-ignore-nulls) | Returns `expr` evaluated at the row that is the last row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. |
| [`lead(expr[, offset[, default]][ ORDER BY ordering][ IGNORE NULLS])`](#leadexpr-offset-default-ignore-nulls) | Returns `expr` evaluated at the row that is `offset` rows after the current row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. |
| [`nth_value(expr, nth[ ORDER BY ordering][ IGNORE NULLS])`](#nth_valueexpr-nth-ignore-nulls) | Returns `expr` evaluated at the nth row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame (counting from 1); `NULL` if no such row. |
| [`ntile(num_buckets[ ORDER BY ordering])`](#ntilenum_buckets) | An integer ranging from 1 to `num_buckets`, dividing the partition as equally as possible. |
| [`percent_rank([ORDER BY ordering])`](#percent_rank) | The relative rank of the current row: `(rank() - 1) / (total partition rows - 1)`. |
| [`rank_dense()`](#rank_dense) | The rank of the current row *without gaps*. |
| [`rank([ORDER BY ordering])`](#rank) | The rank of the current row *with gaps;* same as `row_number` of its first peer. |
| [`row_number([ORDER BY ordering])`](#row_number) | The number of the current row within the partition, counting from 1. |

#### `cume_dist([ORDER BY ordering])`

<div class="nostroke_table"></div>

| **Description** | The cumulative distribution: (number of partition rows preceding or peer with current row) / total partition rows. |
| **Description** | The cumulative distribution: (number of partition rows preceding or peer with current row) / total partition rows. If an `ORDER BY` clause is specified, the distribution is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | `DOUBLE` |
| **Example** | `cume_dist()` |

Expand All @@ -94,59 +94,59 @@ The table below shows the available general window functions.
| **Example** | `dense_rank()` |
| **Aliases** | `rank_dense()` |

#### `first_value(expr[ IGNORE NULLS])`
#### `first_value(expr[ ORDER BY ordering][ IGNORE NULLS])`

<div class="nostroke_table"></div>

| **Description** | Returns `expr` evaluated at the row that is the first row (with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. |
| **Description** | Returns `expr` evaluated at the row that is the first row (with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. If an `ORDER BY` clause is specified, the first row number is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | Same type as `expr` |
| **Example** | `first_value(column)` |

#### `lag(expr[, offset[, default]][ IGNORE NULLS])`
#### `lag(expr[, offset[, default]][ ORDER BY ordering][ IGNORE NULLS])`

<div class="nostroke_table"></div>

| **Description** | Returns `expr` evaluated at the row that is `offset` rows (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) before the current row within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. |
| **Description** | Returns `expr` evaluated at the row that is `offset` rows (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) before the current row within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. If an `ORDER BY` clause is specified, the lagged row number is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | Same type as `expr` |
| **Aliases** | `lag(column, 3, 0)` |

#### `last_value(expr[ IGNORE NULLS])`
#### `last_value(expr[ ORDER BY ordering][ IGNORE NULLS])`

<div class="nostroke_table"></div>

| **Description** | Returns `expr` evaluated at the row that is the last row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. |
| **Description** | Returns `expr` evaluated at the row that is the last row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame. If omitted, `offset` defaults to `1` and default to `NULL`. If an `ORDER BY` clause is specified, the last row is determined within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | Same type as `expr` |
| **Example** | `last_value(column)` |

#### `lead(expr[, offset[, default]][ IGNORE NULLS])`
#### `lead(expr[, offset[, default]][ ORDER BY ordering][ IGNORE NULLS])`

<div class="nostroke_table"></div>

| **Description** | Returns `expr` evaluated at the row that is `offset` rows after the current row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. |
| **Description** | Returns `expr` evaluated at the row that is `offset` rows after the current row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) within the window frame; if there is no such row, instead return `default` (which must be of the Same type as `expr`). Both `offset` and `default` are evaluated with respect to the current row. If omitted, `offset` defaults to `1` and default to `NULL`. If an `ORDER BY` clause is specified, the leading row number is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | Same type as `expr` |
| **Aliases** | `lead(column, 3, 0)` |

#### `nth_value(expr, nth[ IGNORE NULLS])`
#### `nth_value(expr, nth[ ORDER BY ordering][ IGNORE NULLS])`

<div class="nostroke_table"></div>

| **Description** | Returns `expr` evaluated at the nth row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame (counting from 1); `NULL` if no such row. |
| **Description** | Returns `expr` evaluated at the nth row (among rows with a non-null value of `expr` if `IGNORE NULLS` is set) of the window frame (counting from 1); `NULL` if no such row. If an `ORDER BY` clause is specified, the nth row number is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | Same type as `expr` |
| **Aliases** | `nth_value(column, 2)` |

#### `ntile(num_buckets)`
#### `ntile(num_buckets[ ORDER BY ordering])`

<div class="nostroke_table"></div>

| **Description** | An integer ranging from 1 to `num_buckets`, dividing the partition as equally as possible. |
| **Description** | An integer ranging from 1 to `num_buckets`, dividing the partition as equally as possible. If an `ORDER BY` clause is specified, the ntile is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | `BIGINT` |
| **Example** | `ntile(4)` |

#### `percent_rank()`
#### `percent_rank([ORDER BY ordering])`

<div class="nostroke_table"></div>

| **Description** | The relative rank of the current row: `(rank() - 1) / (total partition rows - 1)`. |
| **Description** | The relative rank of the current row: `(rank() - 1) / (total partition rows - 1)`. If an `ORDER BY` clause is specified, the relative rank is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | `DOUBLE` |
| **Example** | `percent_rank()` |

Expand All @@ -159,19 +159,19 @@ The table below shows the available general window functions.
| **Example** | `rank_dense()` |
| **Aliases** | `dense_rank()` |

#### `rank()`
#### `rank([ORDER BY ordering])`

<div class="nostroke_table"></div>

| **Description** | The rank of the current row *with gaps;* same as `row_number` of its first peer. |
| **Description** | The rank of the current row *with gaps*; same as `row_number` of its first peer. If an `ORDER BY` clause is specified, the rank is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | `BIGINT` |
| **Example** | `rank()` |

#### `row_number()`
#### `row_number([ORDER BY ordering])`

<div class="nostroke_table"></div>

| **Description** | The number of the current row within the partition, counting from 1. |
| **Description** | The number of the current row within the partition, counting from 1. If an `ORDER BY` clause is specified, the row number is computed within the frame using the provided ordering instead of the frame ordering. |
| **Return Type** | `BIGINT` |
| **Example** | `row_number()` |

Expand All @@ -180,6 +180,50 @@ The table below shows the available general window functions.
All [aggregate functions]({% link docs/sql/functions/aggregates.md %}) can be used in a windowing context, including the optional [`FILTER` clause]({% link docs/sql/query_syntax/filter.md %}).
The `first` and `last` aggregate functions are shadowed by the respective general-purpose window functions, with the minor consequence that the `FILTER` clause is not available for these but `IGNORE NULLS` is.

## DISTINCT Arguments

All aggregate window functions support using a `DISTINCT` clause for the arguments. When the `DISTINCT` clause is
provided, only distinct values are considered in the computation of the aggregate. This is typically used in combination
with the `COUNT` aggregate to get the number of distinct elements; but it can be used together with any aggregate
function in the system. There are some aggregates that are insensitive to duplicate values (e.g., `min`, `max`) and for
them this clause is parsed and ignored.

```sql
-- Count the number of distinct users at a given point in time
SELECT COUNT(DISTINCT name) OVER (ORDER BY time) FROM sales;
-- Concatenate those distinct users into a list
SELECT LIST(DISTINCT name) OVER (ORDER BY time) FROM sales;
```

### ORDER BY Arguments

All aggregate window functions support using an `ORDER BY` argument clause that is *different* from the window ordering.
When the `ORDER BY` argument clause is provided, the values being aggregated are sorted before applying the function.
Usually this is not important, but there are some order-sensitive aggregates that can have indeterminate results (e.g.,
`mode`, `list` and `string_agg`). These can be made deterministic by ordering the arguments. For order-insensitive
aggregates, this clause is parsed and ignored.

```sql
-- Compute the modal value up to each time, breaking ties in favour of the most recent value.
SELECT MODE(value ORDER BY time DESC) OVER (ORDER BY time) FROM sales;
```

The SQL standard does not provide for using `ORDER BY` with general-purpose window functions, but we have extended all
of these functions (except `dense_rank`) to accept this syntax and use framing to restrict the range that the secondary
ordering applies to.

```sql
-- Compare each athlete's time in an event with the best time to date
SELECT event, date, athlete, time
first_value(time ORDER BY time DESC) OVER w AS record_time,
first_value(athlete ORDER BY time DESC) OVER w AS record_athlete,
FROM meet_results
WINDOW w AS (PARTITION BY event ORDER BY datetime)
ORDER BY ALL
```

Note that there is no comma separating the arguments from the `ORDER BY` clause.

## Nulls

All [general-purpose window functions](#general-purpose-window-functions) that accept `IGNORE NULLS` respect nulls by default. This default behavior can optionally be made explicit via `RESPECT NULLS`.
Expand Down Expand Up @@ -291,7 +335,7 @@ This distance can either be specified as an integral number of `ROWS`
or as a `RANGE` delta expression.
For a `RANGE` specification, there must be only one ordering expression,
and it has to support addition and subtraction (i.e., numbers or `INTERVAL`s).
The default frame is from `UNBOUNDED PRECEDING` to `UNBOUNDED FOLLOWING` when no `ORDER BY` clause is present and from `UNBOUNDED PRECEDING` to `CURRENT ROW` when an `ORDER BY` clause is present.
The default frame is from `UNBOUNDED PRECEDING` to `UNBOUNDED FOLLOWING` when no `ORDER BY` clause is present and from `UNBOUNDED PRECEDING` to `CURRENT ROW` when an `ORDER BY` clause is present.
It is invalid for a frame to start after it ends.
Using the [`EXCLUDE` clause](#exclude-clause), rows around the current row can be excluded from the frame.

Expand Down

0 comments on commit d4e8928

Please sign in to comment.