News

Welcome to End Point’s blog

Ongoing observations by End Point people

Perl 5 now on Git

It's awesome to see that the Perl 5 source code repository has been migrated from Perforce to Git, and is now active at http://perl5.git.perl.org/. Congratulations to all those who worked hard to migrate the entire version control history, all the way back to the beginning with Perl 1.0!

Skimming through the history turns up some fun things:

  • The last Perforce commit appears to have been on 16 December 2008.
  • Perl 5 is still under very active development! (It seems a lot of people are missing this simple fact, so I don't feel bad stating it.)
  • Perl 5.8.0 was released on 18 July 2002, and 5.6.0 on 23 March 2000. Those both seem so recent ...
  • Perl 5.000 was released on 17 October 1994.
  • Perl 4.0.00 was released 21 March 1991, and the last Perl 4 release, 4.0.36, was released on 4 February 1993. For having an active lifespan of only 4 or so years till Perl 5 became popular, Perl 4 code sure kicked around on servers a lot longer than that.
  • Perl 1.0 was announced by Larry Wall on 18 December 1987. He called Perl a "replacement" for awk and sed. That first release included 49 regression tests.
  • Some of the patches are from people whose contact information is long gone, rendered in Git commits as e.g. Dan Faigin, Doug Landauer <unknown@longtimeago>.
  • The modern Internet hadn't yet completely taken over, as evidenced by email addresses such as isis!aburt and arnold@emoryu2.arpa.
  • The first Larry Wall entry with email address larry@wall.org was 28 June 1988, though he continued to use his jpl.nasa.gov after that sometimes too.
  • There are some weird things in the commit notices. For example, it's hard to believe the snippet of Perl code in the following change notice wasn't somehow mangled in the conversion process:
commit d23b30860e3e4c1bd7e12ed5a35d1b90e7fa214c
Author: Larry Wall <lwall@scalpel.netlabs.com>
Date:   Wed Jan 11 11:01:09 1995 -0800

   duplicate DESTROY
  
   In order to fix the duplicate DESTROY bug, I need to remove [the
   modified] lines from sv_setsv.
  
   Basically, copying an object shouldn't produce another object without an
   explicit blessing.  I'm not sure if this will break anything.  If Ilya
   and anyone else so inclined would apply this patch and see if it breaks
   anything related to overloading (or anything else object-oriented), I'd
   be much obliged.
  
   By the way, here's a test script for the duplicate DESTROY.  You'll note
   that it prints DESTROYED twice, once for , and once for .  I don't
   think an object should be considered an object unless viewed through
   a reference.  When accessed directly it should behave as a builtin type.
  
   #!./perl
  
    = new main;
    = '';
  
   sub new {
       my ;
       local /tmp/ssh-vaEzm16429/agent.16429 = bless $a;
       local  = ;      # Bogusly makes  an object.
       /tmp/ssh-vaEzm16429/agent.16429;
   }
  
   sub DESTROY {
       print "DESTROYED\n";
   }
  
   Larry

sv.c |    4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

Yes, it really is that weird. Check it out for yourself.

The Easy Git summary information from eg info has some interesting trivia:

Total commits: 36647
Number of contributors: 926
Number of files: 4439
Number of directories: 657
Biggest file size, in bytes: 4176496 (Changes5.8)
Commits: 31178

And there's a nice new POD document instructing how work with the Perl repository using Git: perlrepository.

In other news, maintenance release Perl 5.8.9 is out, expected to be the last 5.8.x release. The change log shows most bundled modules have been updated.

Finally, use Perl also notes that Booking.com is donating $50,000 to further Perl development, specifically Perl 5.10 development and maintenance. They're also hosting the new Git master repository. Thanks!

Using YSlow to analyze website performance

While attending OSCON '08 I listened to Steve Souders discuss some topics from his O'Reilly book, High Performance Web Site, and a new book that should drop in early 2009. Steve made the comment that 80%-90% of the performance of a site is in the delivery and rendering of the front end content. Many engineers tend to immediately look at the back end when optimizing and forget about the rendering of the page and how performance there effects the user's experience.

During the talk he demonstrated the Firebug plugin, YSlow, which he built to illustrate 13 of the 14 rules from his book. The tool shows where performance might be an issue and gives suggestions on which resources can be changed to improve performance. Some of the suggestions may not apply to all sites, but they can be used as a guide for the engineer to make an informed decision.

On a related note, Jon Jensen brought this blog posting to our attention that Google is planning to incorporate landing page time into its quality score for Adword landing pages. With that being known, front-end website performance will become even more important and there may be a point one day where load times come into play when determining natural rank in addition to landing page scores.

Sometimes it's a silly hardware problem

I've been using Twinkle and Ekiga for SIP VoIP on Ubuntu 8.10 x86_64. That's been working pretty well.

However, I finally had to take some time to hunt down the source of a very annoying high-pitched noise coming from my laptop's sound system (external speaker and headset both). I have an Asus M50SA laptop with Intel 82801H (ICH8 Family) audio on Realtek ALC883. I first thought perhaps it was the HDMI cable going to an external monitor, or some other RF interference from a cable, but turning things off or unplugging them didn't make any difference.

Then I suspected there was some audio driver problem because the whine only started once the sound driver loaded at boot time. After trying all sorts of variations in the ALSA configuration, changing the options to the snd-hda-intel kernel module, I was at a loss and unplugged my USB keyboard and mouse.

It was the USB mouse! It's a laser-tracked mouse with little shielding on the short cable. Plugging it into either of the USB ports near the front of the computer caused the noise. The keyboard didn't matter.

At first I thought my other USB non-laser ball mouse didn't add any noise, but it did, just a quieter and lower-pitch noise.

Then ... I discovered a third USB port near the back of the computer that I hadn't ever noticed. Plugging mice in there doesn't interfere with the audio. Sigh. Maybe this tale will save someone else some trouble.

In the process I also fixed a problem that was in software: The external speakers didn't mute when headphones are plugged in, as others have described as well. One of their solutions worked.

In /etc/modprobe.d/alsa-base add: "options snd-hda-intel model=targa-2ch-dig" and reboot. Or, if you dread rebooting as I do, exit all applications using audio, modprobe -r snd-hda-intel then modprobe snd-hda-intel. Finally, uncheck the "Headphones" checkbox in the sound control panel.

TrueCrypt whole-disk encryption for Windows

A few months ago I had a chance to use a new computer with Windows Vista on it. This was actually kind of a fun experience, because Windows 98 was the last version I regularly used myself, though I was at least mildly familiar with Windows 2000 and XP on others' desktops.

Since I've been using encrypted filesystems on Linux since around 2003, I've gotten used to the comfort of knowing a lost or stolen computer would mean only lost hardware, not worries about what may happen with the data on the disk. Linux-Mandrake was the first Linux distribution I recall offering an easy encrypted filesystem option during setup. Now Ubuntu and Fedora have it too.

I wanted to try the same thing on Windows, but found only folder-level encryption was commonly used out of the box. Happily, the open source TrueCrypt software introduced whole-disk system encryption for Windows with version 5. I've now used it with versions 6.0, 6.1, and 6.1a on three machines under Windows Vista and XP, and it really works well, with a few caveats.

The installation is smooth, and system encryption is really easy to set up if you don't have any other operating systems on the machine. It will even encrypt on the fly while you're still using the computer! It's faster if you exit any programs that would use the disk, but it still works under active use. Very impressive.

Some people have reported problems with logical (extended) partitions. Others have workarounds for dual-booting. I tried dual-booting GRUB with Windows Vista as per this blog post and the linked detailed instructions.

That seemed to work well, and Linux booted. Vista also started but then partway through the boot process, after the GUI started up, it noticed something had changed and it died with the largest red "ERROR" message I've ever seen. Microsoft makes impressive error messages!

I battled with dual-booting for a while but eventually gave up, as I was just playing around with it anyway. Sticking with TrueCrypt's recommended Windows-only configuration, everything's worked great. The additional CPU for encryption and decryption is negligible, and becomes increasingly so with multi-core CPUs.

Everyone with a laptop should really be using encrypted filesystems. The peace of mind is well worth the minor initial work and the one extra passphrase to enter at boot time.

As a footnote to my catch-up on the state of Windows, it's really much easier to bear as a Unix and X Window System user with the now wide availability of open source software for Windows. I used 7zip, Audacity, Coolplayer, Cygwin, Firefox, Gimp, Git, Gnucash, Google Chrome, OpenOffice.org, Pidgin, Putty, Strawberry Perl, Vim, VirtualBox, VLC, WinMTR, WinPT (including GnuPG), WinSCP, Wireshark, and Xchat (from Silverex).

Oh, and also helpful was somebody's nice registry edit to remap the Caps Lock key as another Control key, so I don't go crazy. The somewhat abandoned WinPT takes some prodding to get working on a few customer machines I've set it up on, but otherwise all the open source software I tried worked well on Windows. I'm sure there's much more out there too. This UTOSC presentation's slides mentions more.

However, it's still no replacement for a fully free system. So despite the brief investigation, I'll be sticking with Linux. :)

Parallel Inventory Access using PostgreSQL

Inventory management has a number of challenges. One of the more vexing issues with which I've dealt is that of forced serial access. We have a product with X items in inventory. We also have multiple concurrent transactions vying for that inventory. Under any normal circumstance, whether the count is a simple scalar, or is comprised of any number of records up to one record/quantity, the concurrent transactions are all going to hone in on the same record, or set of records. In doing so, all transactions must wait and get their inventory serially, even if doing so isn't of interest.

If inventory is a scalar value, we don't have much hope of circumventing the problem. And, in fact, we wouldn't want to under that scenario because each transaction must reflect the part of the whole it consumed so that the next transaction knows how much is left to work with.

However, if we have inventory represented with one record = one quantity, we aren't forced to serialize in the same way. If we have multiple concurrent transactions vying for inventory, and the sum of the need is less than that available, why must the transactions wait at all? They would normally line up serially because, no matter what ordering you apply to the selection (short of random), it'll be the same ordering for each transaction (and even an increasing probability of conflict with random as concurrency increases). Thus, to all of them, the same inventory record looks the "most interesting" and, so, each waits for the lock from the transaction before it to resolve before moving on.

What we really want is for those transactions to attack the inventory like an easter-egg hunt. They may all make a dash for the "most interesting" egg first, but only one of them will get it. And, instead of the other transaction standing there, coveting the taken egg, we want them to scurry on unabated and look for the next "most interesting" egg to throw in their baskets.

We can leverage some PostgreSQL features to accomplish this goal. The key for establishing parallel access into the inventory is to use the row lock on the inventory records as an indicator of a "soft lock" on the inventory. That is, we assume any row-locked inventory will ultimately be consumed, but recognize that it might not be. That allows us to pass over locked inventory, looking for other inventory to fill the need; but if we find we don't have enough inventory for our need, those locked records indicate that we should take another pass and try again. Eventually, we either get all the inventory we need, or we have consumed all the inventory there is, meaning less than we asked for but with no locked inventory present.

We write a pl/pgsql function to do all the dirty work for us. The function has the following args:

  • Name of table on which we want to apply parallel access
  • Query that retrieves all pertinent records, and in the desired order
  • Integer number of records we ultimately want locked for this transaction.
  • The function returns a setof ctid. Using the ctid has the advantage of the function needing to know nothing about the composition of the table and providing exceedingly fast access back to the records of interest. Thus, the function can be applied to any table if desired and doesn't depend on properly indexed fields in the case of larger tables.

    CREATE OR REPLACE FUNCTION getlockedrows (
           tname TEXT,
           query TEXT,
           desired INT
       )
    RETURNS SETOF TID
    STRICT
    VOLATILE
    LANGUAGE PLPGSQL
    AS $EOR$
    DECLARE
       total   INT NOT NULL := 0;
       locked  BOOL NOT NULL := FALSE;
       myst    TEXT;
       myrec   RECORD;
       mytid   TEXT;
       found   TID[];
       loops   INT NOT NULL := 1;
    BEGIN
       -- Variables: tablename, full query of interest returning ctids of tablename rows, and # of rows desired.
       RAISE DEBUG 'Desired rows: %', desired;
       <<outermost>>
       LOOP
    /*
       May want a sanity limit here, based on loops:
       IF loops > 10 THEN
           RAISE EXCEPTION 'Giving up. Try again later.';
       END IF;
    */
           BEGIN
               total := 0;
               FOR myrec IN EXECUTE query
               LOOP
                   RAISE DEBUG 'Checking lock on id %',myrec.ctid;
                   mytid := myrec.ctid;
                   myst := 'SELECT 1 FROM '
                       || quote_ident(tname)
                       || ' WHERE ctid = $$'
                       || mytid
                       || '$$ FOR UPDATE NOWAIT';
                   BEGIN
                       EXECUTE myst;
                       -- If it worked:
                       total := total + 1;
                       found[total] := myrec.ctid;
                       -- quit as soon as we have all requested
                       EXIT outermost WHEN total >= desired;
                   -- It did not work
                   EXCEPTION
                       WHEN LOCK_NOT_AVAILABLE THEN
                           -- indicate we have at least one candidate locked
                           locked := TRUE;
                   END;
               END LOOP; -- end each row in the table
               IF NOT locked THEN
                   -- We have as many in found[] as we can get.
                   RAISE DEBUG 'Found % of the requested % rows.',
                       total,
                       desired;
                   EXIT outermost;
               END IF;
               -- We did not find as many rows as we wanted!
               -- But, some are currently locked, so keep trying.
               RAISE DEBUG 'Did not find enough rows!';
               RAISE EXCEPTION 'Roll it back!';
           EXCEPTION
               WHEN RAISE_EXCEPTION THEN
                   PERFORM pg_sleep(RANDOM()*0.1+0.45);
                   locked := FALSE;
                   loops := loops + 1;
           END;
       END LOOP outermost;
       FOR x IN 1 .. total LOOP
           RETURN NEXT found[x];
       END LOOP;
       RETURN;
    END;
    $EOR$
    ;
    

    The function makes a pass through all the records, attempting to row lock each one as it can. If we happen to lock as many as requested, we exit <<outermost>> immediately and start returning ctids. If we pass through all records without hitting any locks, we return the set even though it's less than requested. The calling code can decide how to react if there aren't as many as requested.

    To avoid artificial deadlocks, with each failed pass of <<outermost>>, we raise exception of the encompassing block. That is, with each failed pass, we start over completely instead of holding on to those records we've already locked. Once a run has finished, it's all or nothing.

    We also mix up the sleep times just a bit so any two transactions that happen to be locked into a dance precisely because of their timing will (likely) break the cycle after the first loop.

    Example of using our new function from within a pl/pgsql function:

    ...
       text_query := $EOQ$
    SELECT ctid
    FROM inventory
    WHERE sku = 'COOLSHOES'
       AND status = 'AVAILABLE'
    ORDER BY age, location
    $EOQ$
    ;
    
       OPEN curs_inv FOR
           SELECT inventory_id
           FROM inventory
           WHERE ctid IN (
                   SELECT *
                   FROM getlockedrows(
                       'inventory',
                       text_query,
                       3
                   )
           );
    
       LOOP
    
           FETCH curs_inv INTO int_invid;
    
           EXIT WHEN NOT FOUND;
    
           UPDATE inventory
           SET status = 'SOLD'
           WHERE inventory_id = int_invid;
    
       END LOOP;
    ...
    

    The risk we run with this approach is that our ordering will not be strictly enforced. In the above example, if it's absolutely critical that the sort on age and location never be violated, then we cannot run our access to the inventory in parallel. The risk comes if T1 grabs the first record, T2 only needs one and grabs the second, but T1 aborts for some other reason and never consumes the record it originally locked.

    Why is my function slow?

    I often hear people ask "Why is my function so slow? The query runs fast when I do it from the command line!" The answer lies in the fact that a function's query plans are cached by Postgres, and the plan derived by the function is not always the same as shown by an EXPLAIN from the command line. To illustrate the difference, I downloaded the pagila test database. To show the problem, we'll need a table with a lot of rows, so I used the largest table, rental, which has the following structure:

    pagila# \d rental
                           Table "public.rental"
        Column    |   Type     |             Modifiers
    --------------+-----------------------------+--------------------------------
     rental_id    | integer    | not null default nextval('rental_rental_id_seq')
     rental_date  | timestamp  | not null
     inventory_id | integer    | not null
     customer_id  | smallint   | not null
     return_date  | timestamp  |
     staff_id     | smallint   | not null
     last_update  | timestamp  | not null default now()
    Indexes:
        "rental_pkey" PRIMARY KEY (rental_id)
        "idx_unq_rental" UNIQUE (rental_date, inventory_id, customer_id)
        "idx_fk_inventory_id" (inventory_id)
    

    It only had 16044 rows, however, not quite enough to demonstrate the difference we need. So let's add a few more rows. The unique index means any new rows will have to vary in one of the three columns: rental_date, inventory_id, or customer_id. The easiest to change is the rental date. By changing just that one item and adding the table back into itself, we can quickly and exponentially increase the size of the table like so:

    INSERT INTO rental(rental_date, inventory_id, customer_id, staff_id)
      SELECT rental_date + '1 minute'::interval, inventory_id, customer_id, staff_id
      FROM rental;
    

    I then ran the same query again, but with '2 minutes', '4 minutes', '8 minutes', and finally '16 minutes'. At this point, the table had 513,408 rows, which is enough for this example. I also ran an ANALYZE on the table in question (this should always be the first step when trying to figure out why things are going slower than expected). The next step is to write a simple function that accesses the table by counting how many rentals have occurred since a certain date:

    DROP FUNCTION IF EXISTS count_rentals_since_date(date);
    
    CREATE FUNCTION count_rentals_since_date(date)
    RETURNS BIGINT
    LANGUAGE plpgsql
    AS $body$
      DECLARE
        tcount INTEGER;
      BEGIN
        SELECT INTO tcount
          COUNT(*) FROM rental WHERE rental_date > $1;
      RETURN tcount;
      END;
    $body$;
    

    Simple enough, right? Let's test out a few dates and see how long each one takes:

    pagila# \timing
    
    pagila# select count_rentals_since_date('2005-08-01');
     count_rentals_since_date
    --------------------------
                       187901
    Time: 242.923 ms
    
    pagila# select count_rentals_since_date('2005-09-01');
     count_rentals_since_date
    --------------------------
                         5824
    Time: 224.718 ms
    

    Note: all of the queries in this article were run multiple times first to reduce any caching effects. Those times appear to be about the same, but I know from the distribution of the data that the first query will not hit the index, but the second one should. Thus, when we try and emulate what the function is doing on the command line, the first effort often looks like this:

    pagila# explain analyze select count(*) from rental where rental_date > '2005-08-01';
                         QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate (actual time=579.543..579.544)
       Seq Scan on rental (actual time=4.462..403.122 rows=187901)
         Filter: (rental_date > '2005-08-01 00:00:00')
     Total runtime: 579.603 ms
    
    pagila# explain analyze select count(*) from rental where rental_date > '2005-09-01';
    
                         QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate  (actual time=35.133..35.133)
       Bitmap Heap Scan on rental (actual time=1.852..30.451)
         Recheck Cond: (rental_date > '2005-09-01 00:00:00')
         -> Bitmap Index Scan on idx_unq_rental (actual time=1.582..1.582 rows=5824)
             Index Cond: (rental_date > '2005-09-01 00:00:00')
     Total runtime: 35.204 ms
    
    

    Wow, that's a huge difference! The second query is hitting the index and using some bitmap magic to pull back the rows in a blistering time of 35 milliseconds. However, the same date, using the function, takes 224 ms - over six times as slow! What's going on? Obviously, the function is *not* using the index, regardless of which date is passed in. This is because the function cannot know ahead of time what the dates are going to be, but caches a single query plan. In this case, it is caching the 'wrong' plan.

    The correct way to see queries as a function sees them is to use prepared statements. This caches the query plan into memory and simply passes a value to the already prepared plan, just like a function does. The process looks like this:

    pagila# PREPARE foobar(DATE) AS SELECT count(*) FROM rental WHERE rental_date > $1;
    PREPARE
    
    pagila# EXPLAIN ANALYZE EXECUTE foobar('2005-08-01');
                    QUERY PLAN
    --------------------------------------------------------------
     Aggregate  (actual time=535.708..535.709 rows=1)
       ->  Seq Scan on rental (actual time=4.638..364.351 rows=187901)
             Filter: (rental_date > $1)
     Total runtime: 535.781 ms
    
    pagila# EXPLAIN ANALYZE EXECUTE foobar('2005-09-01');
                    QUERY PLAN
    --------------------------------------------------------------
     Aggregate  (actual time=280.374..280.375 rows=1)
       ->  Seq Scan on rental  (actual time=5.936..274.911 rows=5824)
             Filter: (rental_date > $1)
     Total runtime: 280.448 ms
    

    These numbers match the function, so we can now see the reason the function is running as slow as it does: it is sticking to the "Seq Scan" plan. What we want to do is to have it use the index when the given date argument is such that the index would be faster. Functions cannot have more than one cached plan, so what we need to do is dynamically construct the SQL statement every time the function is called. This costs us a small bit of overhead versus having a cached query plan, but in this particular case (and you'll find in nearly all cases), the overhead lost is more than compensated for by the faster final plan. Making a dynamic query in plpgsql is a little more involved than the previous function, but it becomes old hat after you've written a few. Here's the same function, but with a dynamically generated SQL statement inside of it:

    DROP FUNCTION IF EXISTS count_rentals_since_date_dynamic(date);
    
    CREATE FUNCTION count_rentals_since_date_dynamic(date)
    RETURNS BIGINT
    LANGUAGE plpgsql
    AS $body$
      DECLARE
        myst TEXT;
        myrec RECORD;
      BEGIN
        myst = 'SELECT count(*) FROM rental WHERE rental_date > ' || quote_literal($1);
        FOR myrec IN EXECUTE myst LOOP
          RETURN myrec.count;
        END LOOP;
      END;
    $body$;
    

    Note that we use the quote_literal function to take care of any quoting we may need. Also notice that we need to enter into a loop to run the query and then parse the output, but we can simply return right away, as we only care about the output from the first (and only) returned row. Let's see how this new function performs compared to the old one:

    pagila# \timing
    
    pagila# select count_rentals_since_date_dynamic('2005-08-01');
     count_rentals_since_date_dynamic
    ----------------------------------
                               187901
    Time: 255.022 ms
    
    pagila# select count_rentals_since_date('2005-08-01');
     count_rentals_since_date
    --------------------------
                       187901
    Time: 249.724 ms
    
    pagila# select count_rentals_since_date('2005-09-01');
     count_rentals_since_date
    --------------------------
                         5824
    Time: 228.224 ms
    
    pagila# select count_rentals_since_date_dynamic('2005-09-01');
     count_rentals_since_date_dynamic
    ----------------------------------
                                 5824
    Time: 6.618 ms
    

    That's more like it! Problem solved. The function is running much faster now, as it can hit the index. The take-home lessons here are:

    1. Always make sure the tables you are using have been analyzed.
    2. Emulate the queries inside a function by using PREPARE + EXPLAIN EXECUTE, not EXPLAIN.
    3. Use dynamic SQL inside a function to prevent unwanted query plan caching.

    Best practices for cron

    Crontab Best Practice

    Cron is a wonderful tool, and a standard part of all sysadmins toolkit. Not only does it allow for precise timing of unattended events, but it has a straightforward syntax, and by default emails all output. What follows are some best practices for writing crontabs I've learned over the years. In the following discussion, 'cron' indicates the program itself, 'crontab' indicates the file changed by 'crontab -e', and 'entry' begin a single timed action specified inside the crontab file. Cron best practices:

    * Version control

    This rule is number one for a reason. *Always* version control everything you do. It provides an instant backup, accountability, easy rollbacks, and a history. Keeping your crontabs in version control is slightly more work than normal files, but all you have to do is pick a standard place for the file, then export it with crontab -l > crontab.postgres.txt. I prefer RCS for quick little version control jobs like this: no setup required, and everything is in one place. Just run: ci -l crontab.postgres.txt and you are done. The name of the file should be something like the example shown, indicating what it is (a crontab file), which one it is (belongs to the user 'postgres'), and what format it is in (text).

    You can even run another cronjob that compares the current crontab for each user with the version-controlled version and mail an alert and/or check it in automatically on a difference.

    * Keep it organized

    The entries in your crontab should be in some sort of order. What order depends on your preferences and on the nature of your entries, but some options might include:

    • Put the most important at the top.
    • Put the ones that run more often at the top.
    • Order by time they run.
    • Order by job groups (e.g. all entries dealing with the mail system).

    I generally like to combine the above entries, such that I'll put the entries that run the most often at the top. If two entries happen at the same frequency (e.g. once an hour), then I'll put the one that occurs first in the day (e.g. 00:00) at the top of the list. If all else is still equal, I order them by priority. Whatever you do, put a note at the top of your crontab explaining the system used in the current file.

    * Always test

    It's very important to test out your final product. Cron entries have a nasty habit of working from the command line, but failing when called by cron, usually due to missing environment variables or path problems. Don't wait for the clock to roll around when adding or changing an entry - test it right away by making it fire 1-2 minutes into the future. Of course, this is only after you have tested it by creating a simple shell script and/or running it from the command line.

    In addition to testing normal behavior, make sure you test all possible failure and success scenarios as well. If you have an entry that deletes all files older than a day in a certain directory, use the touch command to age some files and verify they get deleted. If your command only performs an action when a rare, hard-to-test criteria is met (such as a disk being 99% full), tweak the parameters so it will pass (such as setting the previous example to 5%).

    Once it's all working, set the time to normal and revert any testing tweaks you made. You may want to make the output verbose as a final 'live' test, and then make things quiet once it has run successfully.

    * Use scripts

    Don't be afraid to call external scripts. Anything even slightly complex should not be in the crontab itself, but inside of an external script called by the crontab. Make sure you name the script something very descriptive, such as flush_older_iptables_rules.pl. While a script means another separate dependency to keep track of, it offers many advantages:

    • The script can be run standalone outside of cron.
    • Different crontabs call all share the same script.
    • Concurrency and error handling is much easier.
    • A script can filter output and write cleaner output to log files.

    * Use aliases

    Use aliases (actually environment variables, but it's easier to call them aliases) at the top of your cron script to store any commands, files, directories, or other things that are used throughout your crontab. Anything that is complex or custom to your site/user/server is a good candidate to make an alias of. This has many advantages:

    • The crontab file as a whole is easier to read.
    • Entries are easier to read, and allow you to focus on the "meat" of the entry, not the repeated constants.
    • Similar aliases grouped together allow for easier spotting of errors.
    • Changes only need to be made in one place.
    • It is easier to find and make changes.
    • Entries can be more easily re-used and cut-n-pasted elsewhere.

    Example:

    PSQL_MASTER='/usr/local/bin/psql -X -A -q -t -d master'
    PSQL_SLAVE='/usr/local/bin/psql -X -A -q -t -d master'
    
    */15 * * * * $PSQL_MASTER -c 'VACUUM pg_listener'
    */5 * * * * $PSQL_SLAVE -c 'VACUUM pg_listener' && $PSQL_SLAVE -c 'VACUUM pg_class'
    
    

    * Forward emails

    In addition to using non-root accounts whenever possible, it is also very important to make sure that someone is actively receiving emails for each account that has cronjobs. Email is the first line of defense for things going wrong with cron, but all too often I'll su into an account and find that it has 6000 messages, all of them from cron indicating that the same problem has been occurring over and over for weeks. Don't let this happen to you - learn about the problem the moment it stops happening by making sure the account is either actively checked, or set up a quick forward to one that is. If you don't want to get all the mail for the account, setup a quick filter - the output of cron is very standard and easy to filter.

    * Document everything

    Heavily document your crontab file. The top line should indicate how the entries are organized, and perhaps have a line for $Id: cronrox.html,v 1.1 2008/12/08 13:41:31 greg Exp greg $, if your version control system uses that. Every entry should have at a minimum a comment directly above it explaining how often it runs, what it does, and why it is doing it. A lot of this may seem obvious and duplicated information, but it's invaluable. People not familiar with crontab's format may be reading it. People not as familiar as you with the flags to the 'dd' command will appreciate a quick explanation. The goal is to have the crontab in such a state that your CEO (or anyone else on down) can read and understand what each entry is doing.

    * Avoid root

    Whenever possible, use some other account than root for cron entries. Not only is is desirable in general to avoid using root, it should be avoided because:

    • The root user probably already gets lots of email, so important cron output is more likely to be missed.
    • Entries should belong to the account responsible for that service, so Nagios cleanup jobs should be in the Nagios user's crontab. If rights are needed, consider granting specific sudo permissions.
    • Because root is a powerful account, its easier to break things or cause big problems with a simple typo.

    * Chain things together

    When possible, chain items together using the && operator. Not only is this a good precondition test, but it allows you to control concurrency, and creates less processes than separate entries.

    Consider these two examples:

    ## Example 1:
    30 * * * * $PSQL -c 'VACUUM abc'
    32 * * * * $PSQL -c 'ANALYZE abc'
    32 * * * * $PSQL -c 'VACUUM def'
    
    ## Example 2:
    30 * * * * $PSQL -c 'VACUUM abc' && $PSQL -c 'VACUUM def' && $PSQL -c 'ANALYZE abc'
    

    The first example has many problems. First, it creates three separate cron processes. Second, the ANALYZE on table abc may end up running while the VACUUM is still going on - not a desired behavior. Third, the second VACUUM may start before the previous VACUUM or ANALYZE has finished. Fourth, if the database is down, there are three emailed error reports going out, and three errors in the Postgres logs.

    The second example fixes all of these problems. The second VACUUM and the ANALYZE will not run until the previous actions are completed. Only a single cron process is spawned. If the first VACUUM encounters a problem (such as the database being down), the other two commands are not even attempted.

    The only drawback is to make sure that you don't stick very important items at the end of a chain, where they may not run if a previous command does not successfully complete (or just takes too long to be useful). A better way around this is to put all the complex interactions into a single script, which can allow you to run later actions even if some of the previous ones failed, with whatever logic you want to control it all.

    * Avoid redirects to /dev/null

    Resist strongly the urge to add 2>/dev/null to the end of your entries. The problem with such a redirect that it is a very crude tool that removes *all* error output, both the expected (what you are probably trying to filter out) and the unexpected (the stuff you probably do not want filtered out). Turning off the error output negates one of the strongest features of cron - emailing of output.

    Rather than using 2>/dev/null or >/dev/null, make the actions quiet by default. Many commands take a -q, --quiet, or --silent option. Use Unix tools to filter out known noise. If all else fails, append the output to a logfile, so you can come back and look at things later when you realize your entry is not working the way you thought it was.

    If all else fails, call an external script. It's well worth the extra few minutes to whip up a simple script that parses the error output and filters out the known noise. That way, the unknown noise (e.g. errors) are mailed out, as they should be.

    * Don't rely on email

    Unfortunately, cron emails all output to you by default - both stdout and stderr. This means that the output tends to be overloaded - both informational messages and errors are sent. It's too easy for the error messages to get lost if you tend to to receive many informational cron messages. Even well-intentioned messages tend to cause problems over time, as you grow numb (for example) to the daily message showing you the output of a script that runs at 2 AM. After a while, you stop reading the body of the message, and then you mentally filter them away when you see them - too much mail to read to look that one over. Unfortunately, that's when your script fails and cron sends an error message that is not seen.

    The best solution is to reserve cron emails for when things go wrong. Thus, an email from cron is a rare event and will very likely be noticed and taken care of. If you still need the output from stdout, you can append it to a logfile somewhere. A better way, but more complex, is to call an external script that can send you an email itself, thus allowing control of the subject line.

    * Avoid passwords

    Don't put passwords into your crontab. Not only is it a security risk (crontab itself, version control files, ps output), but it decentralizes the information. Use the standard mechanisms when possible. For Postgres connections, this means a .pgpass of pg_service.conf file. For ftp and others, the .netrc file. If all else fails, call a script to perform the action, and have it handle the passwords.

    * Use full paths

    Both for safety and sanity, use the full paths to all commands. This is quite easy to do when using aliases, and allows you to also add standard flags as well (e.g. /usr/bin/psql -q -t -A -X). Of course, you can probably get away with not giving the full path to very standard commands such as 'cp' - few sysadmins are *that* paranoid. :)

    * Conditionally run

    Don't run a command unless you have to. This also prevents errors from popping up. Generally, you only want to do this when you know there is a chance a command will not *need* to run, and you don't care if it doesn't in that case. For example, on a clustered system, test for a directory indicating that the node in question is active. You also want to account for the possibility that the previous cron entry of the same kind is still running. The simplest way to do this is with a custom PID file, perhaps in /var/run.

    * Use /etc/cron.* when appropriate

    Consider using the system cron directories for what they were designed for: important system-wide items that run at a regular interval (cron.daily cron.hourly cron.monthly cron.weekly). Personally, I don't use these: for one thing, it' not possible to put them directly into version control.

    * Efficiency rarely matters

    Don't go overboard making your commands efficient and/or clever. Cronjobs run at most once a minute, so it's usually better to be clearer and precise, rather than quick and short.

    * When in doubt, run more often

    Don't be afraid to run things more often than is strictly needed. Most of the jobs that crontab ends up doing are simple, inexpensive, and mundane. Yet they are also very important and sorely missed when not run. Rather than running something once a day because it only *needs* to be run once a day, consider running it twice a day. That way, if the job fails for some reason, it still has another chance to meet the minimum once a day criteria. This rule does not apply to all cronjobs, of course.


    Cron future

    Some things I'd like to see cron do someday:

    • Better granularity than a minute.
    • Built in detection of previously running cronjobs.
    • Rescheduling of missed jobs ala fcron (and most other fcron features as well)
    • Better filtering.
    • Cron entries based on real files in the user's home directory.

    Creating a PL/Perl RPM linked against a custom Perl build

    My and everyone else's Planet PostgreSQL blogs have been down for some time now. I recently needed to refer to a post I made on March 7, 2007, showing how to build a PL/Perl RPM linked against a custom Perl build. Of course it wasn't there! The archive.org Wayback Machine came to my rescue.

    A few things have changed since that time, so I've reworked it here, updated for local Perl 5.10.0 built into RPMs:

    We sometimes have to install a custom Perl build without thread support, and to have some specific newer and/or older versions of CPAN modules, and we don't want to affect the standard distribution Perl that lives in /usr/bin/perl and /usr/lib/perl5. We use standard PGDG RPMs to install PostgreSQL. We also use PL/Perl, and want PL/Perl to link against our custom Perl build in /usr/local/bin and /usr/local/lib/perl5.

    It's easy to achieve this with a small patch to the source RPM spec file:

    --- postgresql-8.3.spec 2008-10-31 17:34:34.000000000 +0000
    +++ postgresql-8.3.custom.spec  2008-11-30 02:10:09.000000000 +0000
    @@ -315,6 +315,7 @@
     CFLAGS=`echo $CFLAGS|xargs -n 1|grep -v ffast-math|xargs -n 100`
     
     export LIBNAME=%{_lib}
    +export PATH=/usr/local/bin:$PATH
     %configure --disable-rpath \
     %if %beta
        --enable-debug \
    @@ -322,6 +323,7 @@
     %endif
     %if %plperl
        --with-perl \
    +   --with-libraries=/usr/local/lib64/perl5/5.10.0/x86_64-linux/CORE/libperl.so \
     %endif
     %if %plpython
        --with-python \
    

    Since we build RPMs of our local Perl, we want this PL/Perl RPM to depend on them, so we make this additional patch before building:

    --- postgresql-8.3.spec 2008-10-31 17:34:34.000000000 +0000
    +++ postgresql-8.3.custom.spec  2008-11-30 02:10:09.000000000 +0000
    @@ -100,7 +100,7 @@
     Patch6:        postgresql-perl-rpath.patch
     Patch8:        postgresql-prefer-ncurses.patch
     
    -Buildrequires: perl glibc-devel bison flex 
    +Buildrequires: local-perl = 4:5.10.0, glibc-devel, bison, flex
     Requires:  /sbin/ldconfig initscripts
     
     %if %plpython
    @@ -227,7 +227,7 @@
     %package plperl
     Summary:   The Perl procedural language for PostgreSQL
     Group:     Applications/Databases
    -Requires:  postgresql-server = %{version}-%{release}
    +Requires:  postgresql-server = %{version}-%{release}, local-perl = 4:5.10.0
     %ifarch ppc ppc64
     BuildRequires:  perl-devel
     %endif
    

    After applying the patch(es) (adjusted for your own custom Perl build, of course), rebuild the RPM, and install the postgresql-plperl (8.2 or newer) or postgresql-pl (8.1 and earlier) RPM. With a "service postgresql restart", you're ready to go.

    For pre-built PostgreSQL 8.3 RPMs that link against unthreaded local Perl 5.10.0, for Red Hat Enterprise Linux 5 x86_64, see the packages.endpoint.com Yum repositories.

    Multiple reverse DNS pointers per IP address

    I recently ran across an IP address that had two PTR (reverse DNS) records in DNS. I've always thought that each IP address is limited to only a single PTR record, and I've seen this rule enforced by many ISPs, but I don't remember ever seeing it conclusively stated.

    I was going to note the problem to the responsible person but thought it'd be good to test my assumption first. Lo and behold, it's not true. The Wikipedia "Reverse DNS lookup" page and a source it cites, an IETF draft on reverse DNS, note that multiple PTR records per IP address have always been allowed.

    There is apparently plenty of software out there that can't properly deal with more than one PTR record per IP address, and with too many PTR records, a DNS query response will no longer fit inside a single UDP packet, forcing a TCP response instead, which can cause trouble of its own. And as I noted, many ISPs won't allow more than one PTR record, so in those cases it's an academic question.

    But it's not invalid, and I saved myself and someone else a bit of wasted time by doing a quick bit of research. It was a good reminder of the value of checking assumptions.

    OpenSQL Camp 2008

    I attended the OpenSQL Camp last weekend, which ran Friday night to Sunday, November 14-16th. This was the first "unconference" I had been to, and Baron Schwartz did a great job in pulling this all together. I drove down with Bruce Momjian who said that this is the first cross-database conference of any kind since at least the year 2000.

    The conference was slated to start at 6 pm, and Bruce and I arrived at our hotel a few minutes before then. Our hotel was at one end of the Charlottesville Downtown Mall, and the conference was at the other end, so we got a quick walking tour of the mall. Seems like a great place - lots of shops, people walking, temporary booths set out, outdoor seating for the restaurants. It reminded me a lot of Las Ramblas, but without the "human statue" performance artists. Having a hotel within walking distance of a conference is a big plus in my book, and I'll go out of my way to find one.


    The first night was simply mingling with other people and designing the next day's sessions. There was a grid of talk slots on a wall, with large sticky notes stuck to some of them to indicate already-scheduled sessions. Next to the grid were two sections, where people added sticky notes for potential lightning talks, and for potential regular talks. There were probably about 20 of each type of talk by the end of the night. The idea was to put a check next to any talk you were interested in, although I don't think everyone really got the message about that, judging by the number of checks vs. the number of people. At one point, we gathered in a circle and gave a quick 5 word introduction about ourselves. Mine was "Just Another Perl Postgres Hacker." There were probably around 50-60 or so people there, and the vast majority were from Sun/MySQL. A smaller group of people were non-Sun MySQL people, such as Baron and Sheeri. Coming in at a minority of two was Bruce and myself, representing Postgres (although Saturday saw our numbers swell to three, with the addition of Kelly McDonald). However, the smallest minority was the SQLite contingent, consisting solely of Dr. Richard Hipp (whom it was great to meet in person). Needless to say, I met a lot of MySQL people at this conference! All were very friendly and receptive to Bruce and myself, and it did feel mostly like an open source database conference rather than a MySQL one. Seven of the twenty one talks were by non-MySQL people, which means we were technically overrepresented. Or had more interesting talks! ;)

    After heading back to the room and reviewing my notes before bed, I got up the next day and caught the keynote, given by Brian Aker, about the future of open-source databases. Thanks for the Skype/Postgres shout out, Brian! :) A comment by Jim Starkey at the end of the talk led to an interesting discussion on bot nets, the current kings of cloud computing.

    My talk on MVCC was the first talk of the day, which of course means lots of technical difficulties. As usual, my laptop refused to cooperate with the overhead projector. In anticipation of this, I had copied the presentation in PDF format to a USB disk, and ended up using someone else's Mac laptop to give the presentation. (I don't remember whose it was, but thank you!) I've given the talk before, but this was a major rewrite to suit the audience: much less Postgres-specific material, and some details about how other systems implement MVCC, as well as the advantages and disadvantages of both ways. Both Oracle and InnoDB update the actual value on disk, and save changes elsewhere, optimistically assuming that a rollback won't happen. This makes a rollback expensive, as the old diffs must be looked up and applied to the main table. Postgres is pessimistic, in that rollbacks are not as expensive as we simply add an entire new row on update, and a rollback simply marks it as no longer valid. Both ways involve some sort of cleaning up of old rows, and handle tradeoffs in different ways. There was some interesting discussions during and after the talk, as Jim Starkey and Ann Harrison weighed in on how other systems (Falcon and Firebird) perform MVCC, and the costs and tradeoffs involved. After the talk, I had some interesting talks with Ann about garbage collection and vacuuming in general.

    The next talk was by Dr. Hipp, entitled "How SQL Database Engines Work", which was fascinating as it gave a glance into the inner working and philosophy of SQLite, whose underlying assumptions about power usage, memory, transactions, portability, and resource usage are radically different from most other database systems. Again there was some interesting discussions about certain slides from the audience within the talk.

    The competing talk for that time slot was "Libdrizzle" by Eric Day. While I missed this talk, I did get to talk to him the night before about libdrizzle, among other things. Patrick Galbraith and I tried to explain the monstrosity that is XS to Eric (as he and I maintain DBD::mysql and DBD::Pg respectively), and Eric showed us how PHP does something similar.

    My DBIx::Cache talk was sabotaged by Bruce having a better session at the same time, so I attended that instead of giving mine. I'll post the slides for the DBIx::Cache talk on the OpenSQL Camp wiki soon, however. I liked Bruce's talk ("Moving Application Logic Into the Database"), mostly becasuse he was preaching to the choir when talking about putting business logic into the database. There was an interesting discussion about the borrowing of LIMIT and OFFSET from MySQL and putting it into Postgres, and we even helped Richard figure out that he was unknowingly supporting the broken and deprecated Postgres "comma-comma" syntax. Bruce's talk was very polished and interesting. I suspect he may have given talks before. :)

    Lunch was catered in, and I talked to many people while eating lunch, indeed over the conference itself. Apparently MySQL 5.1 is finally going to be released, this time for sure, according to first Giuseppe and then Dups. Post-lunch were the lightning talks, which I normally would not miss, but their overall MySQL-centricness and my interest in another session, entitled "MySQL Unconference" by Sheeri K. Cabral, drew me away. Bruce, Sheeri, Giuseppe Maxia, and myself talked about the details of such a conference. It was a very interesting perpective: MySQL has the problem of a "one company, and no community" perception, while Postgres suffers from a "all community, and no company" perception. Neither perception is accurate, of course, but there are some seeds of truth to both.

    Bruce's second presentation, "Postgres Talks", turned into mostly a wide-ranging discussion between those present (myself, Bruce, Ann, Kelly, Richard, others?) about materialized views, vacuum, building query trees, and other topics.

    I bailed out on my fellow Postgres talk "Postgres Extensions" by Kelly McDonald (sorry Kelly). I had already picked his brain about it earlier, so I felt not too much guilt in attending "Atomic Commit In SQLite" by Dr. Hipp. Again, it's fascinating to see things from the SQLite perspective. Not only technically, but how their development is structured is different as well.

    I was not feeling well, so I ran back to the hotel to drop off my backpack with super-heavy laptop inside, and thus missed my next planned talk, "Unix Command Line Productivity Tips". If anyone went and can pass on some tips in the comments below, please do so! :)

    The final talk I went to was "Join-Fu" by Jay Pipes. I honestly had no idea what this talk would be about, but I actually found it very interesting (and entertaining). Jay is a great speaker, and is not shy about pointing out some of MySQL's weaknesses. The talk was basically a collection of best practices for MySQL, and I actually learned not only things about MySQL I can put to use, but things to apply to Postgres as well. He spent some time on the MySQL query cache as well, which is particularly interesting to me as I'd love to see Postgres get something similar (and until then, people can use DBIx::Cache of course!).

    After the final set of presentations was more mingling, eating of some pizza with funky toppings, and planning for the nexy day's hackathon. All the proposed ideas were MySQL-specific, as to be expected, but Bruce and I actually got some work done that night by looking over the pg_memcached code, prompted by Brian. I had looked it over a little bit a few months ago, but Bruce and I managed to fix a bug and, more importantly, found other people to continue working on it. Don't forget to take the credit when they finish their work, Bruce! :)

    All in all, a great time. I would have liked to see the presentations stretched out over two days, and to have seen a greater Postgres turnout, but there's always next year. Thanks to Baron for creating a unique event!

    Google Sponsored AFS Hack-A-Thon

    Day One:

    Woke up an hour early, due to having had a bit of confusion as to the start time (the initial email was a bit optimistic as to what time AFS developers wanted to wake up for the conference).

    Met up with Mike Meffie (an AFS Developer of Sine Nomine) and got a shuttle from the hotel to the 'Visitors Lobby'; only to find out that each building has a visitors lobby. One neat thing, Google provides free bikes (beach cruisers) to anyone who needs them. According to the receptionist, any bike that isn't locked down is considered public property at Google. However, it's hard to pedal a bike and hold a briefcase; so off we went hiking several blocks to the correct building. Mike was smart enough to use a backpack, but hiked with me regardless.

    The food was quite good, a reasonably healthy breakfast including fresh fruit (very ripe kiwi, and a good assortment). The coffee was decent as well! After much discussion, it was decided that Mike & I would work towards migrating the community CVS repository over to git. Because git sees the world as 'patch sets' instead of just individual file changes, migrating it from a view of the 'Deltas' makes the most sense. The new git repo. (when complete) should match 1:1 to the Delta history. There was a good amount of teasing as to whether Mike and I could make any measurable progress in 2 days. Derrick was able to provide pre-processed delta patches and the bare CVS repo. (though we spent a good amount of the day just transferring things around and determining what machine should be used for development).

    Lunch (rather tasty sandwiches) and after lunch snacks were provided; Google definitely doesn't skimp on the catering. Made good progress for one day of combined work, we now have a clear strategy for processing the deltas and initial code that is showing strong promise. Much teasing ensued that Mike & I should not be allowed to eat if we did not have the git repo. ready for use. Dinner was a big group affair of food, beer, and Kerberos.

    Day Two:

    After arriving with Mike Meffie via the shuttle, we found out that Tom Keiser (also of Sine Nomine) had been left behind! The shuttle driver was kind enough to go pick up Tom (who ended up at a related, but different hotel than the conference recommended) and bring him for questioning (or development, as the case may be). Determined that the major issue in applying the deltas was simply due to inconsistencies in what the 'base' import should consist of... After several rounds of cleanup, all but a few of the deltas (and those were fixed by hand) applied cleanly!

    On the food side, Google outdid itself with these cornbread 'pizzas' that were extremely good. Once we started having a few branches to play with, things came together quickly... generating much buzz and excitement (at least, for us). We all split off for dinner, with a few of us escorting Tom to his train then getting some Indian food (on a rather busy day, as it was the 'Festival of Lights').

    In Conclusion:

    We were able to get a clean specification with consensus for how we want to produce the public git repository. The specifications are even available on the OpenAFS wiki. The tools (found at '/afs/sinenomine.net/public/openafs/projects/git_work/') to produce this repo. are all in a rough working form, with only the 'merge' tool still needing some development effort. All of these efforts were definitely facilitated by Google providing a comfortable work environment, a solid internet connection and good food to keep us fueled through it all.

    Things to do now:

    • Clean up and document the existing tools
    • Improve the merge process to simplify folding the branches
    • Actually produce the Git repository
    • Validate the consistency of the Git repository against the CVS repository
    • Determine how tags are to be ported over and apply them
    • Publish repo. publicly

    10,000 databases on a PostgreSQL cluster

    One of our clients wanted confirmation that PostgreSQL will have no problem handling 1000 or 2000 databases on a single database cluster. I remember testing some years ago, probably on Postgres 7.2 or 7.3, creating 1000 or so databases and finding that it worked fine. But that was a long time ago, the software has changed, and I thought I should make sure my old experiment results still stand.

    There's a PostgreSQL FAQ question, "What is the maximum size for a row, a table, and a database?" but no mention of the maximum number (or more importantly, maximum practical number) of databases per cluster. So I threw together a test script to create 10,000 databases, each with between (randomly) 1-5 tables with 2 columns each (INTEGER and TEXT), each getting randomly between 1-10 inserts with random data up to 100 or so characters in the TEXT field.

    I ran the test on PostgreSQL 8.1, the default that ships with Red Hat Enterprise Linux 5 (x86_64). The hardware was a desktop-class HP with an Intel Core 2 @ 1.86 GHz that wasn't always idle.

    The short answer: Postgres 8.1 handles 10,000 databases just fine. \l in psql generates a long list of databases, of course, but returns quickly enough. Ad-hoc concurrency testing was fine. Running queries, inserts, etc. on a handpicked group of the various play databases worked fine, including while new databases were being created. During the creation process, the last database creates seemed about as fast the first. It took 2.75 hours to run.

    This all is hardly a big surprise, but maybe by documenting it I'll save someone the bother of running your own test in the future.

    Addendum: The actual limit on this platform is probably 31995 databases, because each database occupies a subdirectory in data/base/ and the ext3 filesystem has a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode. The other 5 would be ., .., template0, template1, and postgres. (Thanks, Wikipedia.)

    Varnish, Radiant, etc.

    As my colleague Jon mentioned, the Presidential Youth Debates launched its full debate content this week. And, as Jon also mentioned, the mix of tools involved was fairly interesting:

    Our use of Postgres for this project was not particularly special, and is simply a reflection of our using Postgres by default. So I won't discuss the Postgres usage further (though it pains me to ignore my favorite piece of the software stack).

    Radiant

    Dan Collis-Puro, who has done a fair amount of CMS-focused work throughout his career, was the initial engineer on this project and chose Radiant as the backbone of the site. He organized the content within Radiant, configured the Page Attachments extension for use with Amazon's S3 (Simple Storage Service), and designed the organization of videos and thumbnails for easy administration through the standard Radiant admin interface. Furthermore, prior to the release of the debate videos, Dan built a question submission and moderation facility as a Radiant extension, through which users could submit questions that might ultimately get passed along to the candidates for the debate.

    In the last few days prior to launch, it fell to me to get the new debate materials into production, and we had to reorganize the way we wanted to lay out the campaign videos and associated content. Because the initial implementation relied purely on conventions in how page parts and page attachments are used, accomplishing the reorganization was straightforward and easily achieved; it was not the sort of thing that required code tweaks and the like, managed purely through the CMS. It ended up being quite -- dare I say it? -- an agile solution. (Agility! Baked right in! Because it's opinionated software! Where's my Mac? It just works! Think Same.)

    For managing small, simple, straightforward sites, Radiant has much to recommend it. For instance:

    • the hierarchical management of content/pages is quite effective and intuitive
    • a pretty rich set of extensions (such as page attachments)
    • the "filter" option on content is quite handy (switch between straight text, fckeditor, etc.) and helpful
    • the Radiant tag set for basic templating/logic is easy to use and understand
    • the general resources available for organizing content (pages, layouts, snippets) enables and readily encourages effective reuse of content and/or presentation logic

    That said, there are a number of things for which one quickly longs within Radiant:

    • In-place editing user interface: an adminstrative mode of viewing the site in which editing tools would show in-place for the different components on a given page. This is not an uncommon approach to content management. The fact that you can view the site in one window/tab and the admin in another mitigates the pain of not having this feature to a healthy extent, but the ease of use undoubtedly suffers nevertheless.
    • Radiant offers different publishing "states" for any given page ("draft", "published", "hidden", etc.), and only publicly displays pages in the "published" state in production. This is certainly helpful, but it is ultimately insufficient. This is no substitute for versioning of resources; there is no way to have a staging version of a given page, in which the staging version is exposed to administrative users only at the same URL as the published version. To get around this, one needs to make an entirely different page that will replace the published page when you're ready. While it's possible to work around the problem in this manner, it clutters up the set of resources in the CMS admin UI, and doesn't fit well with the hierarchical nature of the system; the staging version of a page can't have the same children as the published version of the page, so any staging involving more than one level of edits is problematic and awkward. That leaves quite a lot to be desired: any engineer who has ever done all development on a production site (no development sites) and moved to version-controlled systems knows full well that working purely against a live system is extremely painful. Content management is no different.
    • The page attachments extension, while quite handy in general, has configuration information (such as file size limits and the attachment_fu storage backend to use) hard-coded into its PageAttachment model class definition, rather than abstracting that configuration information into YAML files. Furthermore, it's all or nothing: you can only use one storage backend, apparently, rather than having the flexibility of choosing different storage backends by the content type of the file attached, or choosing manually when uploading the file, etc. The result in our case is that all page attachments go to Amazon S3, even though videos were the only thing we really wanted to have in S3 (bandwidth on our server is not a concern for simple images and the like).

    The in-place editing UI features could presumably be added to Radiant given a reasonable degree of patience. The page attachment criticisms also seem achievable. The versioning, however, is a more fundamental issue. Many CMSes attempt to solve this problem many different ways, and ultimately things tend to get unpleasant. I tend to think that CMSes would do well to learn from version control systems like Git in their design; beyond that, integrate with Git: dump the content down to some intelligent serialized format and integrate with git branching, checkin, checkout, pushing, etc. That dandy, glorious future is not easily realized.

    To be clear: Radiant is a very useful, effective, straightforward tool; I would be remiss not to emphasize that the things it does well are more important than the areas that need improvement. As is the case with most software, it could be better. I'd happily use/recommend it for most content management cases I've encountered.

    Amazon S3

    I knew it was only a matter of time before I got to play with Amazon S3. Having read about it, I felt like I pretty much knew what to expect. And the expectations were largely correct: it's been mostly reliable, fairly straightforward, and its cost-effectiveness will have to be determined over time. A few things did take me by surprise, though:

    • The documentation on certain aspects, particularly the logging is, fairly uninspiring. It could be a lot worse. It could also be a lot better. Given that people pay for this service, I would expect it to be documented extremely well. Of course, given the kind of documentation Microsoft routinely spits out, this expectation clearly lacks any grounding in reality.
    • Given that the storage must be distributed under the hood, making usage information aggregation somewhat complicated, it's nevertheless disappointing that Amazon doesn't give any interface for capping usage for a given bucket. It's easy to appreciate that Amazon wouldn't want to be on the hook over usage caps when the usage data comes in from multiple geographically-scattered servers, presumably without any guarantee of serialization in time. Nevertheless, it's a totally lame problem. I have reason to believe that Amazon plans to address this soon, for which I can only applaud them.

    So, yeah, Amazon S3 has worked fine and been fine and generally not offended me overmuch.

    Varnish

    The Presidential Youth Debate project had a number of high-profile sponsors potentially capable of generating significant usage spikes. Given the simplicity of the public-facing portion of the site (read-only content, no forms to submit), scaling out with a caching reverse proxy server was a great option. Fortunately, Varnish makes it pretty easy; basic Varnish configuration is simple, and putting it in place took relatively little time.

    Why go with Varnish? It's designed from the ground up to be fast and scalable (check out the architecture notes for an interesting technical read). The time-based caching of resources is a nice approach in this case; we can have the cached representations live for a couple of minutes, which effectively takes the load off of Apache/Rails (we're running Rails with Phusion Passenger) while refreshing frequently enough for little CMS-driven tweaks to percolate up in a timely fashion. Furthermore, it's not a custom caching design, instead relying upon the fundamentals of caching in HTTP itself. Varnish, with its Varnish Configuration Language (VCL), is extremely flexible and configurable, allowing us to easily do things like ignore cookies, normalize domain names (though I ultimately did this in Apache), normalize the annoying Accept-Encoding header values, etc. Furthermore, if the cache interval is too long for a particular change, Varnish gives you a straightforward, expressive way of purging cached representations, which came in handy on a number of occasions close to launch time.

    A number of us at End Point have been interested in Varnish for some time. We've made some core patches: JT Justman tracked down a caching bug when using Edge-Side Includes (ESI), and Charles Curley and JT have done some work to add native gzip/deflate support in Varnish, though that remains to be released upstream. We've also prototyped a system relying on ESI and message-driven cache purging for an up-to-date, high-speed, extremely scalable architecture. (That particular project hasn't gone into production yet due to the degree of effort required to refactor much of the underlying app to fit the design, though it may still come to pass next year -- I hope!)

    Getting Varnish to play nice with Radiant was a non-issue, because the relative simplicity of the site feature set and content did not require specialized handling of any particular resource: one cache interval was good for all pages. Consequently, rather than fretting about having Radiant issue Cache-Control headers on a per-page basis (which may have been fairly unpleasant, though I didn't look into it deeply; eventually I'll need to, though, having gotten modestly hooked on Radiant and less-modestly hooked on Varnish), the setup was refreshingly simple:

    • The public site's domain disallows all access to the Radiant admin, meaning it's effectively a read-only site.
    • The public domain's Apache container always issues a couple of cache-related headers:
      Header always set Cache-Control "public; max-age=120"
      Header always set Vary "Accept-Encoding"
      The Cache-Control header tells clients (Varnish in this case) that it's acceptable to cache representations for 120 seconds, and that all representations are valid for all users ("public"). We can, if we want, use VCL to clean this out of the representation Varnish passes along to clients (i.e. browsers) so that browsers don't cache automatically, instead relying on conditional GET. The Vary header tells clients that cache (again, primarily concerned with Varnish here) to consider the "Accept-Encoding" header value of a request when keying cached representations.
    • An entirely separate domain exists that is not fronted by Varnish and allows access to the Radiant admin. We could have it fronted by Varnish with caching deactivated, but the configuration we used keeps things clean and simple.
    • We use some simple VCL to tell Varnish to ignore cookies (in case of Rails sessions on the public site), to normalize the Accept-Encoding header value to one of "gzip" or "deflate" (or none at all) to avoid caching different versions of the same representation due to inconsistent header values submitted by competing browsers.

    Getting all that sorted was, as stated, refreshingly easy. It was a little less easy, surprisingly, to deal with logging. The main Varnish daemon (varnishd) logs to a shared memory block. The logs just sit there (and presumably eventually get overwritten) unless consumed by another process. A varnishlog utility, which can be run as a one-off or as a daemon, reads in the logs and outputs them in various ways. Furthermore, a varnishncsa utility outputs logging information in an Apache/NCSA-inspired "combined log" format (though it includes full URLs in the request string rather than just the path portion, presumably due to the possibility of Varnish fronting many different domains). Neither one of these is particularly complicated, though the varnishlog output is reportedly quite verbose and may need frequent rotation, and when run in daemon mode, both will re-open the log file to which they write upon receiving SIGHUP, meaning they'll play nice with log rotation routines. I found myself repeatedly wishing, however, that they both interfaced with syslog.

    So, I'm very happy with Varnish at this point. Being a jerk, I must nevertheless publicly pick a few nits:

    • Why no syslog support in the logging utilities? Is there a compelling argument against it (I haven't encountered one, but admittedly I haven't looked very hard), or is it simply a case of not having been handled yet?
    • The VCL snippet we used for normalizing the Accept-Encoding header came right off the Varnish FAQ, and seems to be a pretty common case. I wonder if it would make more sense for this to be part of the default VCL configuration requiring explicit deactivation if not desired. It's not a big deal either way, but it seems like the vast majority of deployments are likely to use this strategy.

    That's all I have to whine about, so either I'm insufficiently observant or the software effectively solves the problem it set out to address. These options are not mutually exclusive.

    I'm definitely looking forward to further work with Varnish. This project didn't get into ESI support at all, but the native ESI support, combined with the high-performance caching, seems like a real win, potentially allowing for simplification of resource design in the application server, since documents can be constructed by the edge server (Varnish in this case) from multiple components. That sort of approach to design calls into question many of the standard practices seen in popular (and unpopular) application servers (namely, high-level templating with "pages" fitting into an overall "layout") but could help engineers keep maintain component encapsulation, think through more effectively the URL space, resource privacy and scoping considerations (whether or not a resource varies per user, by context, etc.), etc. But I digress. Shocking.

    Ohio Linux Fest AFS Hackathon

    The one-day Ohio Linux Fest AFS Hackathon flew by in a hurry. Those new to OpenAFS got started converting some commented source code into Doxygen-usable format to both improve the code documentation as well as get a feel for some of the subsystems in OpenAFS. Several of the developers took advantage of the time to analyze some outstanding issues in Demand Attach (DAFS). We also worked on some vldb issues and had several good conversations about AFS roadmaps, Rx OSD, the migration from CVS to Git, and the upcoming Google-sponsored AFS hackathon.

    The Doxygen gave those new to OpenAFS code a chance to look under the covers of OpenAFS. Doxygen produces pretty nice output from simple formatting commands, so it's really just a matter of making comments follow some basic rules. Sample Doxygen output (from some previous work) can be seen here, and some of the new Doxygen changes made to OpenAFS are already upstream.

    The Demand Attach work focused on the interprocess communications pieces, namely the FSSYNC & SALVSYNC channels, specifying requirements and outlining the approaches for implementing bi-directional communications so that the failure of one process would not leave a volume in an indeterminate state. Some coding was done to address some specific locking issues, but the design and implementation of better interprocess volume state management is still an open issue.

    The OpenAFS Roadmap discussion revolved around 3 major pieces: CVS to Git conversion, Demand Attach, and Rx OSD. DAFS is in the 1.5.x branch currently, but Rx OSD is not. The general consensus was that DAFS plus some of Rx OSD might be able to go into a stable 1.6 release in Q1 of 2009, which would also let the Windows and Unix stable branches merge back together.

    However, the major goal in the short term is to get the CVS to Git migration done to make development more streamlined. Derrick Brashear, Mike Meffie, and Fabrizio Manfredi are all working on this.

    The 1.6 merge, DAFS, and Rx OSD are all still very much works-in-progress in terms off getting them into a stable release together. While individually, DAFS and Rx OSD have been used by some OpenAFS installations in production, there is a lot more work to be done in terms of getting them integrated into a stable OpenAFS release.

    Overall, the hackathon went very well, with some new AFS developers trained, and some progress made on existing projects. Many thanks to the Ohio Linux Fest for their support, and to Mike Meffie specifically for his efforts in coordinating the hackathon.

    Walden University Presidential Youth Debate goes live

    This afternoon was the launch of Walden University's Presidential Youth Debate website, which features 14 questions and video responses from Presidential candidates Barack Obama and John McCain. The video responses are about 44 minutes long overall.

    The site has a fairly simple feature set but is technologically interesting for us. It was developed by Dan Collis-Puro and Ethan Rowe using Radiant, PostgreSQL, CentOS Linux, Ruby on Rails, Phusion Passenger, Apache, Varnish, and Amazon S3.

    Nice work, guys!

    WhereCampPDX Conference Report

    Over the weekend, geo-location hacking geeks converged in Portland, OR for the first WhereCampPDX. Topics of discussion ranged from where to find fruit in urban areas and the creepiness of location-aware dating software, to disaster management and using personal location data in times of crisis. I was part of the planning team, and was happy and proud that we brought together nearly 100 people over the course of three days.

    WhereCampPDX is an unconference -- no sessions were planned in advance and all the participants were charged with responsibility for their own experience. This is an Open Spaces model for managing large group meetings. Photos of all the session topics are available here.

    Many conversations were documented at the drop.io site. Some groups have decided to continue to meet, in particular the What Now, Portland? group had an intense few hours. One participant thought she was going to spend the morning playing Cruel2BKind out in the Saturday Market pavillion on Sunday, but ended up so engaged and deep in discussion that she never left her chair. I came away inspired that people were able talk about their passions and totally lose track of time.

    One topic that came up repeatedly was crisis management and how locative tech might be able to help in the event of a disaster. This is an emerging topic and there's so much energy around it - expect a mini-conference on how to bridge the gap between people and the technologies that could help them this spring.

    And people who work hard together, should play hard too! Check out this video of the PacManhattan game and other great coverage of the conference on WebMonkey.com.

    PostgreSQL Conference West 2008 Report

    2008-10-12_IMG_9662

    I attended the PostgreSQL Conference West and had a great time again this year. My photos of the event are up here.

    In addition, I shot some footage of the event in an attempt to highlight the benefits of the conference, Postgres itself, and the community strengths. I'm looking for a talented Editor willing to donate time; if none volunteer then I'll probably do it in January. My guess is that there will be several web sites willing to host it for free when it's done.

    The Code Sprint was really interesting. Selena Deckelmann gave everyone a lot of ideas to get the most out of the time available for hacking code. At regular intervals, each team shared the progress they made and recieved candy as a reward. It was neat to see other people hacking on and committing changes to the Postgres source tree in meatspace.

    Bruce Momjian's Postgres training covered a wide gamut of information about Postgres. He polled everyone in the room for their particular needs, which varied from administration to performance, then tailored the training to cover information relating to those needs in particular detail. Those who attended reported that they learned a great deal of new information from the training. From here, a lot of folks went out to continue interacting with Postgres people, but I headed for home.

    Windowing Functions were covered by David Fetter in a talk that addressed ways to make OLAP easier with new features coming to Postgres 8.4. Functionality that used to be slow and difficult in client-side applications can be handled easily right in the database. I made a note to check this out when 8.4 hits the streets.

    Jesse Young spoke about using Linux-HA + DRBD to build high availability Postgres clusters. It is working very well for him in over 30 different server installations; he proved this by taking down a production server in the middle of the presentation and demonstrating the rapid transition to the failover server. Just set-it-and-forget-it. I was able to weigh the advantages and disadvantages compared to other clustering options such as shared disk (e.g. GFS) and Postgres-specific replication options (Slony, Postgres Replicator, Bucardo, etc.).

    In his talk, PostgreSQL Optimizer Exposed, Tom Raney delved into a variety of interesting topics. He described the general workings of the optimizer, then showed a variety of interesting plans that are evaluated for the example query, how each plan was measured for cost, and why the cost varied. He uncovered several interesting facts, such as demonstrating that the Materialization step (pushing sorts to disk that are too large for memory) doesn't increase the cost associated with that plan. Tom Lane explained that this would rarely, if ever, affect real world results, but that is the kind of information made obvious in the Visual Planner, but hidden by textual EXPLAIN ANALYZE. Tom Raney also demonstrated the three-fold difference (in one case) between the cost of the clustered index and the rest. Optimizing query performance is one of my favorite pastimes, so I enjoyed this talk a lot.

    I learned a bit about what was going in Postgres community organizations during Joshua Drake's talk, "PostgreSQL - The happening since EAST". The PostgreSQL.US and other organizations are doing a lot to increase awareness of Postgres among education, government, business, and other developers. The point was made that we should do as much as we can to reach out to widely prevalent PHP applications and web hosting providers.

    Common Table Expressions (CTE) were given a good explanation by David Fetter in his talk about doing Trees and More in SQL. Having worked on Nested Set and Adjacency List models, I was very interested in this new feature coming to 8.4. Starting with a simple recursive example, David built on it slide-by-slide until he had built and executed a brute force solution to a Traveling Salesman Problem (for a small number of cities in Italy) using only plain SQL. I'm excited to try this out and measure the performance.

    Mark Wong & Gabrielle Roth presented the results of testing that they completed. Selena also covered that information in her post about Testing filesystems and RAID. After that we talked Perl on the way to the Portland Paramount for the party.

    On Sunday, I sat in on "Developing a PL (Procedural Language) for PostgreSQL", by Joshua Tolley, as he carefully explained the parts and steps involved. LOLCODE humor peppered the useful information on the related Postgres libraries, Bison usage, and pitfalls.

    I was glad to see Catalyst boosted in Matt Trout's presentation. He very quickly covered the design and advantages of Catalyst, DBIx::Class, and Postgres as they related to the implementation of a high profile and complex web site. It was very informative to see the Class structure for the View model, which gave me several ideas to take use for my own development. He demonstrated how a complex 56-way join was coded in very brief and comprehensible perl code relying on the underlying modules to provide the underlying support. The explain tree is so large that it couldn't fit on the screen even in microscopic font, and even with very large data sets, the Postgres optimizer found a way to return the results in one tenth of a second. Matt also demonstrated several flaws in his design, such as how his use of rules to implement updatable views caused multi-row updates to be slower than the equivalent trigger-based system. I use Catalyst for several projects, but I think Interchange still has more advantages. I'm definitely going to take another look at DBIx::Class.

    Before lunch, I asked if I could shoot a group photo, so we went to the park. Several people were not in attendance, and I didn't want to take more than a minute or two, so the shots are not as good as I would have liked. Next time I'll ask if we can plan some time for arranging the group. At lunch I had a great time talking to fellow Postgres developers and learning more about their work.

    Lightning Talks followed lunch and included a variety of interesting topics. One of my favorites was "I can't do that" by Matt Trout. He explained how wrong it is to believe you can't contribute something to Postgres or any other open source project. If you think your code will be incomplete or buggy, do it anyway, because it may prompt someone else to work on it, or scrap yours and do it right. Don't think you can't contribute to documentation because of your infamiliarity with the system, because that's exactly the advantage you have for documentation contributions: those who need the docs are in exactly your shoes.

    Matt also gave the closing talk, "Perl 5 is Alive!", which was a concise, water-tight presentation of Perl 5's superiority over other development environments, including CPAN and job statistics that demonstrate its growing popularity.

    Some attendees went out afterwards to finish the conference over a drink. I slept about 11 hours straight to recover from the whirlwind of weekend activity. Overall I'm grateful for the opportunity to interact with the community again and I'm excited for what the future has in store for Postgres.

    Spree 0.4.0 Released

    Spree 0.4.0 was officially released today. Spree is a complete open source ecommerce platform written for Ruby on Rails. While Spree technically works "out of the box" as a fully functional store, it is really intended to serve as a strong foundation for a custom commerce solution. Like Rails, Spree is considered to be "opinionated software", and it does not seek to solve 100% of the commerce needs of all possible clients. Developers are able to provide the missing functionality by using the powerful extension system.

    The current release of Spree contains many signficant improvements from the previous 0.2.0 release. Some of the highlights include:

    • Rails 2.1 support
    • SEO improvements
    • Security enhancements
    • Public assets for extensions
    • Mailer templates for extensions
    • VAT inclusive pricing
    • Taxonomy

    Most open source projects in the Rails space are maintained by a single individual and tend to be limited in scope. For Spree we seek to create a large and healthy open source community similar to the ones found in more mature languages and frameworks. The Spree project has received contributions from over twenty different developers and has been translated into five additional languages.

    I created Spree in late 2007, and End Point later became an official sponsor of the Spree project and employs me and several other Spree contributors.

    64-bit Windows naming fun

    At OSNews.com the article Windows x64 Watch List describes some of the key differences between 64-bit and 32-bit Windows. It's pretty interesting, and mostly pretty reasonable. But this one caught my eye:

    There are now separate system file sections for both 32-bit and 64-bit code

    Windows x64's architecture keeps all 32-bit system files in a directory named "C:\WINDOWS\SysWOW64", and 64-bit system files are place in the the oddly-named "C:\WINDOWS\system32" directory. For most applications, this doesn't matter, as Windows will re-direct all 32-bit files to use "SysWOW64" automatically to avoid conflicts.

    However, anyone (like us system admins) who depend on VBScripts to accomplish tasks, may have to directly reference "SysWOW64" files if needed, since re-direction doesn't apply as smoothly.

    I've been using 64-bit Linux since 2005 and found there to be some learning curve there, with distributors taking different approaches to supporting 32-bit libraries and applications on a 64-bit operating system.

    The Debian Etch approach is to treat the 64-bit architecture as "normal", for lack of a better word, with 64-bit libraries residing in /lib and /usr/lib as always. It's recommended to run a 32-bit chroot with important libraries in the ia32-libs package going into /emul/ia32-linux. Ubuntu is similar, but its ia32-libs puts its ia32-libs files into /usr/lib32.

    The Red Hat approach called "multilib" keeps 32-bit libraries in /lib and /usr/lib with new 64-bit libraries living in /lib64 and /usr/lib64. (I mentioned this a while back while discussing building a custom Perl on 64-bit Red Hat OSes.)

    Each way has its tradeoffs, and causes a bit of trouble. That's just the cost of dealing with multiple architectures in a single running OS, where no such support was previously needed.

    But the Windows way? Putting your 32-bit libraries in C:\WINDOWS\SysWOW64 and your 64-bit libraries in C:\WINDOWS\system32? It hurts to see the names be exactly backwards. That's really tops for confusion.

    Filesystem I/O: what we presented

    As mentioned last week, Gabrielle Roth and I presented results from tests run in the new Postgres Performance Lab. Our slides are available on Slideshare.

    We tested eight core assumptions about filesystem I/O performance and presented the results to a room of filesystem hackers and a few database specialists. Some important things to remember about our tests: we were testing I/O only - no tuning had been done on the hardware, filesystem defaults or for Postgres - and we did not take reliability into account at all.  Tuning the database and filesystem defaults will be done for our next round of tests.

    Filesystems we tested were ext2, ext3 (with or without data journaling), xfs, jfs, and reiserfs.

    Briefly, here are our assumptions, and the results we presented:

    1. RAID5 is the worst choice for a database. Our tests confirmed this, as expected.
    2. LVM incurs too much overhead to use. Our test showed that for sequential or random reads on RAID0, LVM doesn't incur much more overhead than hardware or software RAID.
    3. Software RAID is slower. Same result as LVM for sequential or random reads.
    4. Turning off 'atime' is a big performance gain. We didn't see a big improvement, but you do generally get 2-3% improvement "for free" by turning atime off on a filesystem.
    5. Partition alignment is a big deal. Our tests weren't able to prove this, but we still think it's a big problem. Here's one set of tests demonstrating the problem on Windows-based servers.
    6. Journaling filesystems will have worse performance than non-journaling filesystems. Turn the data journaling off on ext3, and you will see better performance than ext2. We polled the audience, and nearly all thought ext2 would have performed better than ext3. People in the room suggested that the difference was because of seek-bundling that's done in ext3, but not ext2.
    7. Striping doubles performance. Doubling-performance is a best-case scenario, and not what we observed. Throughput increased about 35%.
    8. Your read-ahead buffer is big enough.  The default read-ahead buffer size is 128K. Our tests, and an independent set of tests by another author, confirm that increasing read-ahead buffers can provide a performance boost of about 75%.  We saw improvement leveling out when the buffer is sized at 8MB, with the bulk of the improvement occurring up to 1MB. We plan to test this further in the future.

    All the data from these tests is available on the Postgres Developers wiki.

    Our hope is that someone in the Linux filesystem community takes up these tests and starts to produce them for other hardware, and on a more regular basis. We did have 3 people interested in running their own tests on our hardware from the talk!  In the future, we plan to focus our testing most on Postgres performance.

    Mark Wong and Gabrielle will be presenting this talk again, with a few new results, at the PostgreSQL Conference West.