Dwi Wahyudi
Senior Software Engineer (Ruby, Golang, Java)
There is a time when we need to group our data by certain time period in PostgreSQL.
Overview
Let say we have a table named billings
which stores billing informations in our database. It has amount
column that stores the paid amount and paid_at
column that stores when the billing is paid.
We want to get some informations like:
- On each month, how many count of billings do we have?
- On each week, how much the billings we pay?
- On each day, what’s the average of the billings we pay?
We have several functions to pick from. DATE_TRUNC
, EXTRACT
and DATE_PART
.
DATE_TRUNC
Function
DATE_TRUNC
function basically truncates the timestamp from the field (by month, week, day or so on).
Let’s go back to first question:
On each month, how many count of billings do we have?
Here’s how we’re going to solve it.
SELECT
COUNT(*) AS count_all,
DATE_TRUNC('month', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date AS month
FROM
"billings"
WHERE
("billings"."paid_at" IS NOT NULL)
GROUP BY
DATE_TRUNC('month', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date
And here’s the result example:
count_all|month |
---------+----------+
432|2023-03-01|
14|2023-01-01|
30|2022-04-01|
9|2022-09-01|
7|2023-04-01|
10|2023-06-01|
12|2022-07-01|
90|2022-10-01|
44|2022-06-01|
20|2022-08-01|
31|2023-05-01|
65|2022-11-01|
183|2023-02-01|
91|2023-07-01|
10|2023-08-01|
3|2022-12-01|
12|2022-02-01|
108|2022-05-01|
Let’s go to the next question:
On each week, how much the billings we pay?
The grouping principle is the same, however we need to group the data by week.
SELECT
SUM("billings"."amount") AS sum_amount,
DATE_TRUNC('week', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date AS week_segment
FROM
"billings"
WHERE
("billings"."paid_at" IS NOT NULL)
GROUP BY
DATE_TRUNC('week', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date
The result will be something like this:
sum_amount |week_segment|
--------------+------------+
130.000000| 2023-01-01|
480.000000| 2022-10-23|
0.000000| 2022-04-17|
80.000000| 2022-06-05|
431.000000| 2023-06-11|
0.000000| 2023-04-16|
24.000000| 2023-06-25|
33.000000| 2022-08-21|
0.000000| 2022-06-12|
3000.000000| 2022-02-13|
20.000000| 2022-08-07|
1120.000000| 2023-01-15|
The same goes with the last question.
On each day, what’s the average of the billings we pay?
We just need to truncate it by day.
SELECT
AVG("billings"."amount") AS avg_amount,
DATE_TRUNC('day', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date AS day
FROM
"billings"
WHERE
("billings"."paid_at" IS NOT NULL)
GROUP BY
DATE_TRUNC('day', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date
EXTRACT
Function
We can use EXTRACT
function to get the specific number of timestamp date (specific number of month, day, etc).
SELECT
COUNT(1) AS count,
EXTRACT('year' FROM paid_at) AS year, EXTRACT('month' FROM paid_at) AS month
FROM
"billings"
WHERE
("billings"."paid_at" IS NOT NULL)
GROUP BY
EXTRACT('year' FROM paid_at), EXTRACT('month' FROM paid_at)
ORDER BY EXTRACT('year' FROM paid_at), EXTRACT('month' FROM paid_at);
The example result:
count|year|month|
-----+----+-----+
13|2022| 2|
120|2022| 4|
8|2022| 5|
12|2022| 6|
210|2022| 7|
180|2022| 8|
3|2022| 9|
DATE_PART
Function
DATE_PART
function is quite similar to EXTRACT
function.
SELECT
COUNT(1) AS count,
DATE_PART('year', paid_at) AS year, DATE_PART('month', paid_at) AS month
FROM
"billings"
WHERE
("billings"."paid_at" IS NOT NULL)
GROUP BY
DATE_PART('year', paid_at), DATE_PART('month', paid_at)
ORDER BY DATE_PART('year', paid_at), DATE_PART('month', paid_at);
Cyclical Period
DATE_PART
or EXTRACT
is preferrable if we want to see the cyclical data, we can group it with certain cycle period, so for example, group data by week day.
GROUP BY EXTRACT (DOW from paid_at AT TIME ZONE 'utc' AT TIME ZONE locations.timezone);
This will make the query return only 7 rows (from Sunday (0) to Saturday (6) represented by integer value). Data will be grouped by the week day, Sunday to Saturday.
We can for example group data cyclical by hour:
GROUP BY EXTRACT (hour from paid_at AT TIME ZONE 'utc' AT TIME ZONE locations.timezone);
This will make the query to return at most 24 rows. Data will be grouped by hour.
Missing Values
The problem with these 3 functions, is that any missing segment (segment that doesn’t have data) won’t be shown in the result, let say month August 2023 or first week of March 2022 don’t have the data (thus it won’t be aggregated by PostgreSQL, and thus won’t be shown in the result).
PostgreSQL will just aggregate (GROUP BY
) data that exist.
We can handle this by generating the segments in our application code and match them one by one or we can use GENERATE_SERIES
function.
SELECT GENERATE_SERIES('2023-01-1', '2023-12-1', INTERVAL '1 MONTH') AS start_of_months;
Which will return the result like this:
start_of_months |
-----------------------------+
2023-01-01 00:00:00.000 +0700|
2023-02-01 00:00:00.000 +0700|
2023-03-01 00:00:00.000 +0700|
2023-04-01 00:00:00.000 +0700|
2023-05-01 00:00:00.000 +0700|
2023-06-01 00:00:00.000 +0700|
2023-07-01 00:00:00.000 +0700|
2023-08-01 00:00:00.000 +0700|
2023-09-01 00:00:00.000 +0700|
2023-10-01 00:00:00.000 +0700|
2023-11-01 00:00:00.000 +0700|
2023-12-01 00:00:00.000 +0700|
With this function in mind, we can join the result of it with the aggregation function.
Here’s the example, we first get the earliest and latest paid_at
timestamp from billings
(week_values
). We use it to generate a series of weekly segments (week_segments
). We then get the aggregated values (weekly_counts
).
From week_segments
we LEFT JOIN to weekly_counts
to find out if each week segment has aggregated value or not.
WITH week_values AS (
SELECT DATE_TRUNC('week', min(paid_at)) AS earliest_week,
DATE_TRUNC('week', max(paid_at)) AS latest_week
FROM billings
),
week_segments AS (
SELECT GENERATE_SERIES(earliest_week, latest_week, '1 week'::interval) AS week FROM week_values
),
weekly_counts AS (
SELECT DATE_TRUNC('week', paid_at) AS week,
COUNT(*) AS count
FROM billings GROUP BY 1
)
SELECT week_segments.week, COALESCE(weekly_counts.count, 0) AS count FROM week_segments
LEFT OUTER JOIN weekly_counts ON week_segments.week = weekly_counts.week;