Welcome to End Point’s blog

Ongoing observations by End Point people

Filesystem I/O: what we presented

As mentioned last week, Gabrielle Roth and I presented results from tests run in the new Postgres Performance Lab. Our slides are available on Slideshare.

We tested eight core assumptions about filesystem I/O performance and presented the results to a room of filesystem hackers and a few database specialists. Some important things to remember about our tests: we were testing I/O only - no tuning had been done on the hardware, filesystem defaults or for Postgres - and we did not take reliability into account at all.  Tuning the database and filesystem defaults will be done for our next round of tests.

Filesystems we tested were ext2, ext3 (with or without data journaling), xfs, jfs, and reiserfs.

Briefly, here are our assumptions, and the results we presented:

  1. RAID5 is the worst choice for a database. Our tests confirmed this, as expected.
  2. LVM incurs too much overhead to use. Our test showed that for sequential or random reads on RAID0, LVM doesn't incur much more overhead than hardware or software RAID.
  3. Software RAID is slower. Same result as LVM for sequential or random reads.
  4. Turning off 'atime' is a big performance gain. We didn't see a big improvement, but you do generally get 2-3% improvement "for free" by turning atime off on a filesystem.
  5. Partition alignment is a big deal. Our tests weren't able to prove this, but we still think it's a big problem. Here's one set of tests demonstrating the problem on Windows-based servers.
  6. Journaling filesystems will have worse performance than non-journaling filesystems. Turn the data journaling off on ext3, and you will see better performance than ext2. We polled the audience, and nearly all thought ext2 would have performed better than ext3. People in the room suggested that the difference was because of seek-bundling that's done in ext3, but not ext2.
  7. Striping doubles performance. Doubling-performance is a best-case scenario, and not what we observed. Throughput increased about 35%.
  8. Your read-ahead buffer is big enough.  The default read-ahead buffer size is 128K. Our tests, and an independent set of tests by another author, confirm that increasing read-ahead buffers can provide a performance boost of about 75%.  We saw improvement leveling out when the buffer is sized at 8MB, with the bulk of the improvement occurring up to 1MB. We plan to test this further in the future.

All the data from these tests is available on the Postgres Developers wiki.

Our hope is that someone in the Linux filesystem community takes up these tests and starts to produce them for other hardware, and on a more regular basis. We did have 3 people interested in running their own tests on our hardware from the talk!  In the future, we plan to focus our testing most on Postgres performance.

Mark Wong and Gabrielle will be presenting this talk again, with a few new results, at the PostgreSQL Conference West.


Jon Jensen said...

Very interesting, Selena. I'm curious about your LVM part of the investigation. What LVM configuration were you using? You wrote:

Our test showed that for sequential or random reads on RAID0, LVM doesn't incur much more overhead than hardware or software RAID.

Does that mean you were using LVM to concatenate several physical volumes into one volume group, and testing performance on a logical volume in that group?

Sorry to be pedantic, but specifically I'm curious if there's any measurable decrease in performance when using LVM on a single disk vs. no LVM (but no software RAID either).

LVM is still useful in such cases, among other things for making atomic volume snapshots. I have never been able to detect a performance hit, but I suppose there could be, just due to using device mapper if nothing else.

Selena Deckelmann said...

Hi Jon,

We tested two disks in a striped LVM configuration. The relevant throughput results are here:

At LPC, one developer noted that LVM automatically adjusts the read-ahead buffer, which may have accounted for a significant portion of the performance gain. This is something we probably won't dig into further (but would love for other people to test), other than to recommend to people that they try increasing their read-ahead buffer size to something like 8MB and test performance on their hardware.

qu1j0t3 said...


bushidoka said...

Very well done! So what should I use for my pgsql dbs? I just started a new job and am new to PG in fact. We are using RAID5 right now which I see is not so great! But we like the checksum disk on RAID5. So what's the best configuration to give us that kind of reliability?

Jon Jensen said...

For reliability and performance, RAID 1 and RAID 10 are a good choice.