Dwi Wahyudi

Senior Software Engineer (Ruby, Golang, Java)


There is a time when we need to group our data by certain time period in PostgreSQL.

Overview

Let say we have a table named billings which stores billing informations in our database. It has amount column that stores the paid amount and paid_at column that stores when the billing is paid.

We want to get some informations like:

  • On each month, how many count of billings do we have?
  • On each week, how much the billings we pay?
  • On each day, what’s the average of the billings we pay?

We have several functions to pick from. DATE_TRUNC, EXTRACT and DATE_PART.

DATE_TRUNC Function

DATE_TRUNC function basically truncates the timestamp from the field (by month, week, day or so on).

Let’s go back to first question:

On each month, how many count of billings do we have?

Here’s how we’re going to solve it.

SELECT
	COUNT(*) AS count_all,
	DATE_TRUNC('month', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date AS month
FROM
	"billings"
WHERE
	("billings"."paid_at" IS NOT NULL)
GROUP BY
	DATE_TRUNC('month', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date

And here’s the result example:

count_all|month     |
---------+----------+
      432|2023-03-01|
       14|2023-01-01|
       30|2022-04-01|
        9|2022-09-01|
        7|2023-04-01|
       10|2023-06-01|
       12|2022-07-01|
       90|2022-10-01|
       44|2022-06-01|
       20|2022-08-01|
       31|2023-05-01|
       65|2022-11-01|
      183|2023-02-01|
       91|2023-07-01|
       10|2023-08-01|
        3|2022-12-01|
       12|2022-02-01|
      108|2022-05-01|

Let’s go to the next question:

On each week, how much the billings we pay?

The grouping principle is the same, however we need to group the data by week.

	SELECT
		SUM("billings"."amount") AS sum_amount,
		DATE_TRUNC('week', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date AS week_segment
	FROM
		"billings"
	WHERE
		("billings"."paid_at" IS NOT NULL)
	GROUP BY
		DATE_TRUNC('week', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date

The result will be something like this:

sum_amount    |week_segment|
--------------+------------+
    130.000000|  2023-01-01|
    480.000000|  2022-10-23|
      0.000000|  2022-04-17|
     80.000000|  2022-06-05|
    431.000000|  2023-06-11|
      0.000000|  2023-04-16|
     24.000000|  2023-06-25|
     33.000000|  2022-08-21|
      0.000000|  2022-06-12|
   3000.000000|  2022-02-13|
     20.000000|  2022-08-07|
   1120.000000|  2023-01-15|

The same goes with the last question.

On each day, what’s the average of the billings we pay?

We just need to truncate it by day.

	SELECT
		AVG("billings"."amount") AS avg_amount,
		DATE_TRUNC('day', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date AS day
	FROM
		"billings"
	WHERE
		("billings"."paid_at" IS NOT NULL)
	GROUP BY
		DATE_TRUNC('day', "billings"."paid_at"::timestamptz AT TIME ZONE 'Etc/UTC')::date

EXTRACT Function

We can use EXTRACT function to get the specific number of timestamp date (specific number of month, day, etc).

SELECT
	COUNT(1) AS count,
	EXTRACT('year' FROM paid_at) AS year, EXTRACT('month' FROM paid_at) AS month
FROM
	"billings"
WHERE
	("billings"."paid_at" IS NOT NULL)
GROUP BY
	EXTRACT('year' FROM paid_at), EXTRACT('month' FROM paid_at)
ORDER BY EXTRACT('year' FROM paid_at), EXTRACT('month' FROM paid_at);

The example result:

count|year|month|
-----+----+-----+
   13|2022|    2|
  120|2022|    4|
    8|2022|    5|
   12|2022|    6|
  210|2022|    7|
  180|2022|    8|
    3|2022|    9|

DATE_PART Function

DATE_PART function is quite similar to EXTRACT function.

SELECT
	COUNT(1) AS count,
	DATE_PART('year', paid_at) AS year, DATE_PART('month', paid_at) AS month
FROM
	"billings"
WHERE
	("billings"."paid_at" IS NOT NULL)
GROUP BY
	DATE_PART('year', paid_at), DATE_PART('month', paid_at)
ORDER BY DATE_PART('year', paid_at), DATE_PART('month', paid_at);

Cyclical Period

DATE_PART or EXTRACT is preferrable if we want to see the cyclical data, we can group it with certain cycle period, so for example, group data by week day.

GROUP BY EXTRACT (DOW from paid_at AT TIME ZONE 'utc' AT TIME ZONE locations.timezone);

This will make the query return only 7 rows (from Sunday (0) to Saturday (6) represented by integer value). Data will be grouped by the week day, Sunday to Saturday.

We can for example group data cyclical by hour:

GROUP BY EXTRACT (hour from paid_at AT TIME ZONE 'utc' AT TIME ZONE locations.timezone);

This will make the query to return at most 24 rows. Data will be grouped by hour.

Missing Values

The problem with these 3 functions, is that any missing segment (segment that doesn’t have data) won’t be shown in the result, let say month August 2023 or first week of March 2022 don’t have the data (thus it won’t be aggregated by PostgreSQL, and thus won’t be shown in the result).

PostgreSQL will just aggregate (GROUP BY) data that exist.

We can handle this by generating the segments in our application code and match them one by one or we can use GENERATE_SERIES function.

SELECT GENERATE_SERIES('2023-01-1', '2023-12-1', INTERVAL '1 MONTH') AS start_of_months;

Which will return the result like this:

start_of_months              |
-----------------------------+
2023-01-01 00:00:00.000 +0700|
2023-02-01 00:00:00.000 +0700|
2023-03-01 00:00:00.000 +0700|
2023-04-01 00:00:00.000 +0700|
2023-05-01 00:00:00.000 +0700|
2023-06-01 00:00:00.000 +0700|
2023-07-01 00:00:00.000 +0700|
2023-08-01 00:00:00.000 +0700|
2023-09-01 00:00:00.000 +0700|
2023-10-01 00:00:00.000 +0700|
2023-11-01 00:00:00.000 +0700|
2023-12-01 00:00:00.000 +0700|

With this function in mind, we can join the result of it with the aggregation function.

Here’s the example, we first get the earliest and latest paid_at timestamp from billings (week_values). We use it to generate a series of weekly segments (week_segments). We then get the aggregated values (weekly_counts).

From week_segments we LEFT JOIN to weekly_counts to find out if each week segment has aggregated value or not.

WITH week_values AS (
  SELECT DATE_TRUNC('week', min(paid_at)) AS earliest_week,
         DATE_TRUNC('week', max(paid_at)) AS latest_week
  FROM billings
),

week_segments AS (
  SELECT GENERATE_SERIES(earliest_week, latest_week, '1 week'::interval) AS week FROM week_values
),

weekly_counts AS (
  SELECT DATE_TRUNC('week', paid_at) AS week,
  COUNT(*) AS count
  FROM billings GROUP BY 1
)

SELECT week_segments.week, COALESCE(weekly_counts.count, 0) AS count FROM week_segments
LEFT OUTER JOIN weekly_counts ON week_segments.week = weekly_counts.week;