News

Welcome to End Point’s blog

Ongoing observations by End Point people

Saving time with generate_series()

I was giving a presentation once on various SQL constructs, and, borrowing an analogy I'd seen elsewhere, described PostgreSQL's generate_series() function as something you might use in places where, in some other language, you'd use a FOR loop. One attendee asked, "So, why would you ever want a FOR loop in a SQL query?" A fair question, and one that I answered using examples later in the presentation. Another such example showed up recently on a client's system where the ORM was trying to be helpful, and chose a really bad query to do it.

The application in question was trying to display a list of records, and allow the user to search through them, modify them, filter them, etc. Since the ORM knew users might filter on a date-based field, it wanted to present a list of years containing valid records. So it did this:

SELECT DISTINCT DATE_TRUNC('year', some_date_field) FROM some_table;

In fairness to the ORM, this query wouldn't be so bad if some_table only had a few hundred or thousand rows. But in our case it has several tens of millions. This query results in a sequential scan of each of those records, in order to build a list of, as it turns out, about fifty total years. There must be a better way...

The better way we chose turns out to be, in essence, this: find the years of the maximum and minimum date values in the date field, construct a list of all years between the minimum and maximum, inclusive, and see which ones exist in the table. This date field is indexed, so finding its maximum and minimum is very fast:

SELECT
    DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
    DATE_TRUNC('year', MAX(some_date_field)) AS mymax
FROM some_table

Here's where the FOR loop idea comes in, though it's probably better described as an "iterator" rather than a FOR loop specifically: for each year between mymin and mymax inclusive, I want a database row. The analogy may not hold terribly well, but the technique is very useful, because it will create a list of all the possible years I might be interested in, and it will do it with just two scans of the some_date_field index, rather than a sequential scan of millions of rows.

SELECT
    generate_series(mymin::INTEGER, mymax::INTEGER) AS yearnum
FROM (
    SELECT
        DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
        DATE_TRUNC('year', MAX(some_date_field)) AS mymax
    FROM some_table
) minmax_tbl

Now I simply have to convert these values to years, and see which ones exist in the underlying table:

SELECT
    yearbegin::timestamptz
FROM
    (
        SELECT
            yearnum * INTERVAL '1 year' + '0000-01-01'::date AS yearbegin
        FROM (
            SELECT
                generate_series(mymin::INTEGER, mymax::INTEGER) AS yearnum
            FROM (
                SELECT
                    DATE_TRUNC('year', MIN(some_date_field)) AS mymin,
                    DATE_TRUNC('year', MAX(some_date_field)) AS mymax
                FROM some_table
        ) yearnum_tbl
    ) beginend_tbl
WHERE
    EXISTS (
        SELECT 1 FROM some_table
        WHERE
            some_date_field BETWEEN yearbegin AND yearbegin + INTERVAL '1 year'
    )
ORDER BY yearbegin ASC
;

As expected, this probes the some_date_field index twice, to get the maximum and minimum date values, and then once for each year between those values. Because of some strangely-dated data in there, that means nearly 10,000 index probes, but that's still much faster than scanning the entire table.

4 comments:

umitanuki said...

generate_series() accepts timestamp as of 8.4. The expample should get shorter with it.

Sebastiaan Stok said...

A good friend of my wrote an article about generate_series().

Note, it is in Dutch.
But you can use Google Translate for that.

http://www.yapf.net/index.php/Generate_series

I used his article to write an very complex query to get the traffic usage of my webhosting customers.

I will post it later, since I can't reach the source from this computer.

Sebastiaan Stok said...

Better late then never.

"SELECT
SUM(log_entries.bytes_sent) AS traffic_use
,g.num AS stat_time
FROM
generate_series(1::integer, 24::integer) AS g(num)
LEFT JOIN
log_entries
ON
date_part('hour'::text, log_entries.request_time) = g.num
AND
date_part('day'::text, log_entries.request_time) = :day
AND
date_part('year'::text, request_time) = :year
AND
date_part('month'::text, request_time) = :month
AND
virtual_domain::text LIKE ANY (public.get_hostnames(:account_id))
GROUP BY
g.num,
date_part('year'::text, request_time),
date_part('month'::text, request_time)
ORDER BY
stat_time ASC"

Joshua Tolley said...

I'm glad to know you didn't forget it! :)