End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Downstream Implications of Data Page Checksums

Now that Postgres 9.3 is all the rage, page checksums are starting to see use in production. It's not enabled by default during initdb, so you may want to double check the options used when you upgraded.

What? You have already upgraded to 9.3, right? No? Oh well, when you do get around to updating, keep an eye out for initdb's --data-checksums option, or just -k. To give the feature a try on my development desktop, after the initdb I created a table and loaded in some text data. Small text strings are being cast from integers so we can more easily see it in the on-disk structure. You'll see why in a moment. The table was loaded with a good amount of data, at least more than my shared_buffers setting:

postgres=# CREATE TABLE filler (txt TEXT PRIMARY KEY);
CREATE TABLE
postgres=# INSERT INTO filler SELECT generate_series::text FROM generate_series(-10000000,10000000);
INSERT 0 20000001
postgres=# \dt+
List of relations
 Schema |  Name  | Type  |  Owner   |  Size  | Description
--------+--------+-------+----------+--------+-------------
 public | filler | table | postgres | 761 MB |
(1 row)

There. Maybe a little more than I needed, but it works. My storage (on this desktop) is so much slower than the processor, of course, I certainly didn't notice any difference in performance with checksums on. But on your nice and speedy server you might see the performance hit. Anyway, now to find the file on disk...

postgres=# SELECT relfilenode FROM pg_class WHERE relname = 'filler';
 relfilenode
-------------
       16390
(1 row)

postgres@endpoint:~/9.3$ dd bs=8192 count=1 skip=10 if=main/base/12066/16390 of=block
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.0161733 s, 507 kB/s

That relfilenode (plus the "postgres" database oid of 12066) corresponds to base/12066/16390, so I've taken a copy of the 10th page in that file. And then introduced some "silent" corruption, such as some that might be seen if I had a scary storage driver, or a cosmic ray hit the disk platter and flipped a bit:

postgres@endpoint:~/9.3$ sed -iorig 's/9998000/9999000/' block

postgres@endpoint:~/9.3$ diff -u <(hexdump -C block) <(hexdump -C blockorig)
--- /dev/fd/63    2013-10-18 16:35:22.963860942 -0400
+++ /dev/fd/62    2013-10-18 16:35:22.963860942 -0400
@@ -134,7 +134,7 @@
 00000850  98 00 01 00 02 09 18 00  13 2d 39 39 39 37 39 39  |.........-999799|
 00000860  39 00 00 00 00 00 00 00  02 00 00 00 00 00 00 00  |9...............|
 00000870  00 00 00 00 00 00 0a 00  97 00 01 00 02 09 18 00  |................|
-00000880  13 2d 39 39 39 39 30 30  30 00 00 00 00 00 00 00  |.-9999000.......|
+00000880  13 2d 39 39 39 38 30 30  30 00 00 00 00 00 00 00  |.-9998000.......|
 00000890  02 00 00 00 00 00 00 00  00 00 00 00 00 00 0a 00  |................|
 000008a0  96 00 01 00 02 09 18 00  13 2d 39 39 39 38 30 30  |.........-999800|
 000008b0  31 00 00 00 00 00 00 00  02 00 00 00 00 00 00 00  |1...............|

Yep, definitely right in the middle of a column value. Normal Postgres wouldn't have noticed at all, and that incorrect value could creep into queries that are expecting something different. Inject that corrupt page back into the heap table...

postgres@endpoint:~/9.3$ dd bs=8192 count=1 seek=10 of=main/base/12066/16390 if=block
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000384083 s, 21.3 MB/s

postgres=# SELECT count(*) FROM filler;
WARNING:  page verification failed, calculated checksum 14493 but expected 26981
ERROR:  invalid page in block 10 of relation base/12066/16390

... And our checksum-checking Postgres catches it, just as it's supposed to. And, obviously, we can't modify anything on that page either, as Postgres would need to read it into the shared buffer before any tuples there could be modified.

postgres=# UPDATE filler SET txt ='Postgres Rules!' WHERE txt = '-9997999';
WARNING:  page verification failed, calculated checksum 14493 but expected 26981
ERROR:  invalid page in block 10 of relation base/12066/16390

The inability to even try to modify the corrupted data is what got me thinking about replicas. Assuming we're protecting against silent disk corruption (rather than Postgres bugs,) nothing corrupted has made it into the WAL stream. So, naturally, the replica is fine.

postgres@endpoint:~/9.3$ psql -p 5439
psql (9.3.1)
Type "help" for help.
postgres=# SELECT ctid, * FROM filler WHERE txt IN ('-9997998', '-9997999', '-9998000', '-9998001');
   ctid   |   txt
----------+----------
 (10,153) | -9997998
 (10,152) | -9997999
 (10,151) | -9998000
 (10,150) | -9998001
(4 rows)

You'd probably be tempted to fail over to the replica at this point, which would be the Right thing to do. You are, after all, starting to see odd and (presumably) unexplained corruption in the on-disk state. You would be wise to switch off that hardware as soon as you can and investigate.

But Halloween is right around the corner, so lets given to some Mad Scientist tendencies! And remember, only try this at home.

postgres@endpoint:~/9.3$ dd bs=8192 count=1 skip=10 if=replica/base/12066/16390 seek=10 of=main/base/12066/16390
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000295787 s, 27.7 MB/s

With one assumption -- that the replica is caught up to the point where the primary first saw the page as corrupted -- the replica should be guaranteed to have an up-to-date copy of the page, even if other things on the master are changing and the replica's lagging behind.

Above, we did a direct copy from the replica's version of that page back to the master...

postgres@endpoint:~/9.3$ psql -p 5435
psql (9.3.1)
Type "help" for help.

postgres=# SELECT ctid, * FROM filler WHERE txt IN ('-9997998', '-9997999', '-9998000', '-9998001');
   ctid   |   txt
----------+----------
 (10,150) | -9998001
 (10,151) | -9998000
 (10,152) | -9997999
 (10,153) | -9997998
(4 rows)

postgres=# UPDATE filler SET txt ='Postgres Rules!' WHERE txt = '-9997999';
UPDATE 1

Voila! Corruption fixed without even having to take the database down.

3 comments:

Holger Amann said...

Thanks for this article!

Is there a chance to upgrade a larger database (~ 250GB) from 9.2 to 9.3 and enable data page checksums, but without doing a full pg_dump/restore by using pg_upgrade and maybe using it's hardlink feature to reduce downtime?

Josh Williams said...

Hi Holger,

Sadly no, I don't think pg_upgrade would support upgrading to 9.3 with checksums, and would just throw an error.

If your schema would allow for it, I'd probably look at trigger-based replication to perform the upgrade with minimal downtime.

Greg Smith said...

Note that as part of review here I wrote a corruption testing program that automates introducing single bit errors. Look for the examples using pg_corrupt in the thread at http://postgresql.1045698.n5.nabble.com/Enabling-Checksums-td5731320i80.html#a5740476

Eventually I get that pushed into a QA repo with some similar related code, but the program was pretty easy to use already.

And, no, you can't use pg_upgrade to move between the different checksum settings yet. pg_upgrade only touches catalog information right now, it doesn't handle that sort of job at all yet. (It has to eventually, but the problems involved are a mess)