Welcome to End Point’s blog

Ongoing observations by End Point people

Reducing bloat without locking

It's not altogether uncommon to find a database where someone has turned off vacuuming, for a table or for the entire database. I assume people do this thinking that vacuuming is taking too much processor time or disk IO or something, and needs to be turned off. While this fixes the problem very temporarily, in the long run it causes tables to grow enormous and performance to take a dive. There are two ways to fix the problem: moving rows around to consolidate them, or rewriting the table completely. Prior to PostgreSQL 9.0, VACUUM FULL did the former; in 9.0 and above, it does the latter. CLUSTER is another suitable alternative, which also does the latter. Unfortunately all these methods require heavy table locking.

Recently I've been experimenting with an alternative method -- sort of a VACUUM FULL Lite. Vanilla VACUUM can reduce table size when the pages at the end of a table are completely empty. The trick is to empty those pages of live data. You do that by paying close attention to the table's ctid column:

5432 josh@josh# \d foo
      Table ""
 Column |  Type   | Modifiers 
 a      | integer | not null
 b      | integer | 
    "foo_pkey" PRIMARY KEY, btree (a)

5432 josh@josh# select ctid, * from foo;
 ctid  | a | b 
 (0,1) | 1 | 1
 (0,2) | 2 | 2
(2 rows)

The ctid is one of several hidden columns found in each PostgreSQL table. It shows up in query results only if you explicitly ask for it, and tells you two values: a page number, and a tuple number. Pages are numbered sequentially from zero, starting with the first page in the relation's first file, and ending with the last page in its last file. Tuple numbers refer to entries within each page, and are numbered sequentially starting from one. When I update a row, the row's ctid changes, because the update creates a new version of the row and leaves the old version behind (see this page for explanation of that behavior).

5432 josh@josh# update foo set a = 3 where a = 2;
5432 josh@josh*# select ctid, * from foo;
 ctid  | a | b 
 (0,1) | 1 | 1
 (0,3) | 3 | 2
(2 rows)

Note the changed ctid for the second row. If I vacuum this table now, I'll see it remove one dead row version, from both the table and its associated index:

5432 josh@josh# VACUUM verbose foo;
INFO:  vacuuming ""
INFO:  scanned index "foo_pkey" to remove 1 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": removed 1 row versions in 1 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "foo_pkey" now contains 2 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": found 1 removable, 2 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
1 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

So given these basics, how can I make tables smaller? Let's build a bloated table:

5432 josh@josh# truncate foo;
5432 josh@josh*# insert into foo select generate_series(1, 1000);
INSERT 0 1000
5432 josh@josh*# delete from foo where a % 2 = 0;
5432 josh@josh*# select max(ctid) from foo;
(1 row)
5432 josh@josh# vacuum verbose foo;
INFO:  vacuuming ""
INFO:  scanned index "foo_pkey" to remove 500 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": removed 500 row versions in 4 pages

I've filled the table with 1000 rows, and then deleted every other row. The last tuple is on the fourth page (remember they're numbered starting with zero), but since half the table is empty space, I can probably squish it into three or maybe just two pages. I'll start by moving the tuples on the last page off to another page, by updating them:

5432 josh@josh# begin;
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
5432 josh@josh*# update foo set a = a where ctid >= '(3,0)';
5432 josh@josh*# commit;

Here I'm not changing the row at all, but the tuples are moving around into dead space earlier in the table; this is apparent because the number of rows affected decreases. For the first update or two, there's room enough on the page to store all the new rows, but after a few updates they have to start moving to new pages. Eventually the row count goes to zero, meaning there are no rows on or after page #3, so vacuum can truncate that page:

5432 josh@josh# vacuum verbose foo;
INFO:  vacuuming ""

INFO:  "foo": truncated 4 to 3 pages

It's important to note that I did this all within a transaction. If I hadn't, there's a possibility that vacuum would have reclaimed some of the dead space made by the updates, so instead of moving to different pages, the tuples would have moved back and forth within the same page.

There remains one problem: I can't remove index bloat, and in fact, all this tuple-moving causes more index bloat. I can't fix that completely, but in PostgreSQL 8.3 and later I can avoid creating too much new bloat by updating an unindexed column instead of an indexed one. In PostgreSQL 8.3 and later, the heap-only tuples (HOT) feature avoids modifying indexes if:

  1. the update touches only unindexed columns, and
  2. there's sufficient free space available for the tuple to stay on the same page.
Despite the index bloat caveat, this can be a useful technique to slim down particularly bloated tables without VACUUM FULL and its associated locking.


Heikki Linnakangas said...

When it was decided to change the way VACUUM FULL works, there was this idea of providing a contrib module that does what you're describing. See:
thread on pgsql-hackers

Joshua Tolley said...

Yeah, I didn't bring up that discussion at least in part because it was already a long post. I'd really like something like that to exist, especially if it can be told to run through only a portion of the available bloat. In at least some cases, full table locking would be fine if it was only for a short time, as opposed to the full time required to consolidate rows, or rewrite an entire table and all its indexes.

Greg Smith said...

To clarify that bit at the end, HOT helps reduce the scope of the problems with this trick, by avoiding bloat during some of the many updates here, but it can't completely eliminate the index bloat concerns. Referring to what you wrote for a second about when HOT kicks in: "...there's sufficient free space available for the tuple to stay on the same page". If the tuple stays on the same page, you didn't accomplish anything useful yet: the tuple will still be on the almost empty page you were trying to clean out. So eventually, to do what you want here, you must create a situation where HOT doesn't help.

Also, while this is a neat userland version of what had been only proposed as server side code before, your title is overpromising just a bit here. In order for the database to do the "truncated x to y pages" trick, it still has to acquire an exclusive lock on the table, as pointed out in the "Recovering Disk Space section of the documentation (and in vacuumlazy.c where this message spits out from). So this technique doesn't eliminate the need for such a lock; it just reduces the amount of time said lock has to be active for. VACUUM FULL holds an exclusive lock the whole time you're running it. Using this approach, you only need it for the shorter period when the "lazy" regular VACUUM is running at the end. To reduce the size of a relation, you must at some point have at least a brief exclusive lock on it, can't be avoided.

Shanx said...

Did this ever get fixed? There has to be a better way than this. I now have large 10GB tables, which has about 6GB of wasted disk space. Instead of doing the jiggery pokery described in this article, which is all fancy and cute with small tables, with this admin episodically performed. But it's impractical for real world high performance environments. Any ideas?

Greg Sabino Mullane said...

Shanx, you might look into the pg_repack program: