End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Perl 5 now on Git

It's awesome to see that the Perl 5 source code repository has been migrated from Perforce to Git, and is now active at http://perl5.git.perl.org/. Congratulations to all those who worked hard to migrate the entire version control history, all the way back to the beginning with Perl 1.0!

Skimming through the history turns up some fun things:

  • The last Perforce commit appears to have been on 16 December 2008.
  • Perl 5 is still under very active development! (It seems a lot of people are missing this simple fact, so I don't feel bad stating it.)
  • Perl 5.8.0 was released on 18 July 2002, and 5.6.0 on 23 March 2000. Those both seem so recent ...
  • Perl 5.000 was released on 17 October 1994.
  • Perl 4.0.00 was released 21 March 1991, and the last Perl 4 release, 4.0.36, was released on 4 February 1993. For having an active lifespan of only 4 or so years till Perl 5 became popular, Perl 4 code sure kicked around on servers a lot longer than that.
  • Perl 1.0 was announced by Larry Wall on 18 December 1987. He called Perl a "replacement" for awk and sed. That first release included 49 regression tests.
  • Some of the patches are from people whose contact information is long gone, rendered in Git commits as e.g. Dan Faigin, Doug Landauer <unknown@longtimeago>.
  • The modern Internet hadn't yet completely taken over, as evidenced by email addresses such as isis!aburt and arnold@emoryu2.arpa.
  • The first Larry Wall entry with email address larry@wall.org was 28 June 1988, though he continued to use his jpl.nasa.gov after that sometimes too.
  • There are some weird things in the commit notices. For example, it's hard to believe the snippet of Perl code in the following change notice wasn't somehow mangled in the conversion process:
commit d23b30860e3e4c1bd7e12ed5a35d1b90e7fa214c
Author: Larry Wall <lwall@scalpel.netlabs.com>
Date:   Wed Jan 11 11:01:09 1995 -0800

   duplicate DESTROY
  
   In order to fix the duplicate DESTROY bug, I need to remove [the
   modified] lines from sv_setsv.
  
   Basically, copying an object shouldn't produce another object without an
   explicit blessing.  I'm not sure if this will break anything.  If Ilya
   and anyone else so inclined would apply this patch and see if it breaks
   anything related to overloading (or anything else object-oriented), I'd
   be much obliged.
  
   By the way, here's a test script for the duplicate DESTROY.  You'll note
   that it prints DESTROYED twice, once for , and once for .  I don't
   think an object should be considered an object unless viewed through
   a reference.  When accessed directly it should behave as a builtin type.
  
   #!./perl
  
    = new main;
    = '';
  
   sub new {
       my ;
       local /tmp/ssh-vaEzm16429/agent.16429 = bless $a;
       local  = ;      # Bogusly makes  an object.
       /tmp/ssh-vaEzm16429/agent.16429;
   }
  
   sub DESTROY {
       print "DESTROYED\n";
   }
  
   Larry

sv.c |    4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

Yes, it really is that weird. Check it out for yourself.

The Easy Git summary information from eg info has some interesting trivia:

Total commits: 36647
Number of contributors: 926
Number of files: 4439
Number of directories: 657
Biggest file size, in bytes: 4176496 (Changes5.8)
Commits: 31178

And there's a nice new POD document instructing how work with the Perl repository using Git: perlrepository.

In other news, maintenance release Perl 5.8.9 is out, expected to be the last 5.8.x release. The change log shows most bundled modules have been updated.

Finally, use Perl also notes that Booking.com is donating $50,000 to further Perl development, specifically Perl 5.10 development and maintenance. They're also hosting the new Git master repository. Thanks!

Using YSlow to analyze website performance

While attending OSCON '08 I listened to Steve Souders discuss some topics from his O'Reilly book, High Performance Web Site, and a new book that should drop in early 2009. Steve made the comment that 80%-90% of the performance of a site is in the delivery and rendering of the front end content. Many engineers tend to immediately look at the back end when optimizing and forget about the rendering of the page and how performance there effects the user's experience.

During the talk he demonstrated the Firebug plugin, YSlow, which he built to illustrate 13 of the 14 rules from his book. The tool shows where performance might be an issue and gives suggestions on which resources can be changed to improve performance. Some of the suggestions may not apply to all sites, but they can be used as a guide for the engineer to make an informed decision.

On a related note, Jon Jensen brought this blog posting to our attention that Google is planning to incorporate landing page time into its quality score for Adword landing pages. With that being known, front-end website performance will become even more important and there may be a point one day where load times come into play when determining natural rank in addition to landing page scores.

Sometimes it's a silly hardware problem

I've been using Twinkle and Ekiga for SIP VoIP on Ubuntu 8.10 x86_64. That's been working pretty well.

However, I finally had to take some time to hunt down the source of a very annoying high-pitched noise coming from my laptop's sound system (external speaker and headset both). I have an Asus M50SA laptop with Intel 82801H (ICH8 Family) audio on Realtek ALC883. I first thought perhaps it was the HDMI cable going to an external monitor, or some other RF interference from a cable, but turning things off or unplugging them didn't make any difference.

Then I suspected there was some audio driver problem because the whine only started once the sound driver loaded at boot time. After trying all sorts of variations in the ALSA configuration, changing the options to the snd-hda-intel kernel module, I was at a loss and unplugged my USB keyboard and mouse.

It was the USB mouse! It's a laser-tracked mouse with little shielding on the short cable. Plugging it into either of the USB ports near the front of the computer caused the noise. The keyboard didn't matter.

At first I thought my other USB non-laser ball mouse didn't add any noise, but it did, just a quieter and lower-pitch noise.

Then ... I discovered a third USB port near the back of the computer that I hadn't ever noticed. Plugging mice in there doesn't interfere with the audio. Sigh. Maybe this tale will save someone else some trouble.

In the process I also fixed a problem that was in software: The external speakers didn't mute when headphones are plugged in, as others have described as well. One of their solutions worked.

In /etc/modprobe.d/alsa-base add: "options snd-hda-intel model=targa-2ch-dig" and reboot. Or, if you dread rebooting as I do, exit all applications using audio, modprobe -r snd-hda-intel then modprobe snd-hda-intel. Finally, uncheck the "Headphones" checkbox in the sound control panel.

TrueCrypt whole-disk encryption for Windows

A few months ago I had a chance to use a new computer with Windows Vista on it. This was actually kind of a fun experience, because Windows 98 was the last version I regularly used myself, though I was at least mildly familiar with Windows 2000 and XP on others' desktops.

Since I've been using encrypted filesystems on Linux since around 2003, I've gotten used to the comfort of knowing a lost or stolen computer would mean only lost hardware, not worries about what may happen with the data on the disk. Linux-Mandrake was the first Linux distribution I recall offering an easy encrypted filesystem option during setup. Now Ubuntu and Fedora have it too.

I wanted to try the same thing on Windows, but found only folder-level encryption was commonly used out of the box. Happily, the open source TrueCrypt software introduced whole-disk system encryption for Windows with version 5. I've now used it with versions 6.0, 6.1, and 6.1a on three machines under Windows Vista and XP, and it really works well, with a few caveats.

The installation is smooth, and system encryption is really easy to set up if you don't have any other operating systems on the machine. It will even encrypt on the fly while you're still using the computer! It's faster if you exit any programs that would use the disk, but it still works under active use. Very impressive.

Some people have reported problems with logical (extended) partitions. Others have workarounds for dual-booting. I tried dual-booting GRUB with Windows Vista as per this blog post and the linked detailed instructions.

That seemed to work well, and Linux booted. Vista also started but then partway through the boot process, after the GUI started up, it noticed something had changed and it died with the largest red "ERROR" message I've ever seen. Microsoft makes impressive error messages!

I battled with dual-booting for a while but eventually gave up, as I was just playing around with it anyway. Sticking with TrueCrypt's recommended Windows-only configuration, everything's worked great. The additional CPU for encryption and decryption is negligible, and becomes increasingly so with multi-core CPUs.

Everyone with a laptop should really be using encrypted filesystems. The peace of mind is well worth the minor initial work and the one extra passphrase to enter at boot time.

As a footnote to my catch-up on the state of Windows, it's really much easier to bear as a Unix and X Window System user with the now wide availability of open source software for Windows. I used 7zip, Audacity, Coolplayer, Cygwin, Firefox, Gimp, Git, Gnucash, Google Chrome, OpenOffice.org, Pidgin, Putty, Strawberry Perl, Vim, VirtualBox, VLC, WinMTR, WinPT (including GnuPG), WinSCP, Wireshark, and Xchat (from Silverex).

Oh, and also helpful was somebody's nice registry edit to remap the Caps Lock key as another Control key, so I don't go crazy. The somewhat abandoned WinPT takes some prodding to get working on a few customer machines I've set it up on, but otherwise all the open source software I tried worked well on Windows. I'm sure there's much more out there too. This UTOSC presentation's slides mentions more.

However, it's still no replacement for a fully free system. So despite the brief investigation, I'll be sticking with Linux. :)

Parallel Inventory Access using PostgreSQL

Inventory management has a number of challenges. One of the more vexing issues with which I've dealt is that of forced serial access. We have a product with X items in inventory. We also have multiple concurrent transactions vying for that inventory. Under any normal circumstance, whether the count is a simple scalar, or is comprised of any number of records up to one record/quantity, the concurrent transactions are all going to hone in on the same record, or set of records. In doing so, all transactions must wait and get their inventory serially, even if doing so isn't of interest.

If inventory is a scalar value, we don't have much hope of circumventing the problem. And, in fact, we wouldn't want to under that scenario because each transaction must reflect the part of the whole it consumed so that the next transaction knows how much is left to work with.

However, if we have inventory represented with one record = one quantity, we aren't forced to serialize in the same way. If we have multiple concurrent transactions vying for inventory, and the sum of the need is less than that available, why must the transactions wait at all? They would normally line up serially because, no matter what ordering you apply to the selection (short of random), it'll be the same ordering for each transaction (and even an increasing probability of conflict with random as concurrency increases). Thus, to all of them, the same inventory record looks the "most interesting" and, so, each waits for the lock from the transaction before it to resolve before moving on.

What we really want is for those transactions to attack the inventory like an easter-egg hunt. They may all make a dash for the "most interesting" egg first, but only one of them will get it. And, instead of the other transaction standing there, coveting the taken egg, we want them to scurry on unabated and look for the next "most interesting" egg to throw in their baskets.

We can leverage some PostgreSQL features to accomplish this goal. The key for establishing parallel access into the inventory is to use the row lock on the inventory records as an indicator of a "soft lock" on the inventory. That is, we assume any row-locked inventory will ultimately be consumed, but recognize that it might not be. That allows us to pass over locked inventory, looking for other inventory to fill the need; but if we find we don't have enough inventory for our need, those locked records indicate that we should take another pass and try again. Eventually, we either get all the inventory we need, or we have consumed all the inventory there is, meaning less than we asked for but with no locked inventory present.

We write a pl/pgsql function to do all the dirty work for us. The function has the following args:

  • Name of table on which we want to apply parallel access
  • Query that retrieves all pertinent records, and in the desired order
  • Integer number of records we ultimately want locked for this transaction.
  • The function returns a setof ctid. Using the ctid has the advantage of the function needing to know nothing about the composition of the table and providing exceedingly fast access back to the records of interest. Thus, the function can be applied to any table if desired and doesn't depend on properly indexed fields in the case of larger tables.

    CREATE OR REPLACE FUNCTION getlockedrows (
           tname TEXT,
           query TEXT,
           desired INT
       )
    RETURNS SETOF TID
    STRICT
    VOLATILE
    LANGUAGE PLPGSQL
    AS $EOR$
    DECLARE
       total   INT NOT NULL := 0;
       locked  BOOL NOT NULL := FALSE;
       myst    TEXT;
       myrec   RECORD;
       mytid   TEXT;
       found   TID[];
       loops   INT NOT NULL := 1;
    BEGIN
       -- Variables: tablename, full query of interest returning ctids of tablename rows, and # of rows desired.
       RAISE DEBUG 'Desired rows: %', desired;
       <<outermost>>
       LOOP
    /*
       May want a sanity limit here, based on loops:
       IF loops > 10 THEN
           RAISE EXCEPTION 'Giving up. Try again later.';
       END IF;
    */
           BEGIN
               total := 0;
               FOR myrec IN EXECUTE query
               LOOP
                   RAISE DEBUG 'Checking lock on id %',myrec.ctid;
                   mytid := myrec.ctid;
                   myst := 'SELECT 1 FROM '
                       || quote_ident(tname)
                       || ' WHERE ctid = $$'
                       || mytid
                       || '$$ FOR UPDATE NOWAIT';
                   BEGIN
                       EXECUTE myst;
                       -- If it worked:
                       total := total + 1;
                       found[total] := myrec.ctid;
                       -- quit as soon as we have all requested
                       EXIT outermost WHEN total >= desired;
                   -- It did not work
                   EXCEPTION
                       WHEN LOCK_NOT_AVAILABLE THEN
                           -- indicate we have at least one candidate locked
                           locked := TRUE;
                   END;
               END LOOP; -- end each row in the table
               IF NOT locked THEN
                   -- We have as many in found[] as we can get.
                   RAISE DEBUG 'Found % of the requested % rows.',
                       total,
                       desired;
                   EXIT outermost;
               END IF;
               -- We did not find as many rows as we wanted!
               -- But, some are currently locked, so keep trying.
               RAISE DEBUG 'Did not find enough rows!';
               RAISE EXCEPTION 'Roll it back!';
           EXCEPTION
               WHEN RAISE_EXCEPTION THEN
                   PERFORM pg_sleep(RANDOM()*0.1+0.45);
                   locked := FALSE;
                   loops := loops + 1;
           END;
       END LOOP outermost;
       FOR x IN 1 .. total LOOP
           RETURN NEXT found[x];
       END LOOP;
       RETURN;
    END;
    $EOR$
    ;
    

    The function makes a pass through all the records, attempting to row lock each one as it can. If we happen to lock as many as requested, we exit <<outermost>> immediately and start returning ctids. If we pass through all records without hitting any locks, we return the set even though it's less than requested. The calling code can decide how to react if there aren't as many as requested.

    To avoid artificial deadlocks, with each failed pass of <<outermost>>, we raise exception of the encompassing block. That is, with each failed pass, we start over completely instead of holding on to those records we've already locked. Once a run has finished, it's all or nothing.

    We also mix up the sleep times just a bit so any two transactions that happen to be locked into a dance precisely because of their timing will (likely) break the cycle after the first loop.

    Example of using our new function from within a pl/pgsql function:

    ...
       text_query := $EOQ$
    SELECT ctid
    FROM inventory
    WHERE sku = 'COOLSHOES'
       AND status = 'AVAILABLE'
    ORDER BY age, location
    $EOQ$
    ;
    
       OPEN curs_inv FOR
           SELECT inventory_id
           FROM inventory
           WHERE ctid IN (
                   SELECT *
                   FROM getlockedrows(
                       'inventory',
                       text_query,
                       3
                   )
           );
    
       LOOP
    
           FETCH curs_inv INTO int_invid;
    
           EXIT WHEN NOT FOUND;
    
           UPDATE inventory
           SET status = 'SOLD'
           WHERE inventory_id = int_invid;
    
       END LOOP;
    ...
    

    The risk we run with this approach is that our ordering will not be strictly enforced. In the above example, if it's absolutely critical that the sort on age and location never be violated, then we cannot run our access to the inventory in parallel. The risk comes if T1 grabs the first record, T2 only needs one and grabs the second, but T1 aborts for some other reason and never consumes the record it originally locked.

    Why is my function slow?

    I often hear people ask "Why is my function so slow? The query runs fast when I do it from the command line!" The answer lies in the fact that a function's query plans are cached by Postgres, and the plan derived by the function is not always the same as shown by an EXPLAIN from the command line. To illustrate the difference, I downloaded the pagila test database. To show the problem, we'll need a table with a lot of rows, so I used the largest table, rental, which has the following structure:

    pagila# \d rental
                           Table "public.rental"
        Column    |   Type     |             Modifiers
    --------------+-----------------------------+--------------------------------
     rental_id    | integer    | not null default nextval('rental_rental_id_seq')
     rental_date  | timestamp  | not null
     inventory_id | integer    | not null
     customer_id  | smallint   | not null
     return_date  | timestamp  |
     staff_id     | smallint   | not null
     last_update  | timestamp  | not null default now()
    Indexes:
        "rental_pkey" PRIMARY KEY (rental_id)
        "idx_unq_rental" UNIQUE (rental_date, inventory_id, customer_id)
        "idx_fk_inventory_id" (inventory_id)
    

    It only had 16044 rows, however, not quite enough to demonstrate the difference we need. So let's add a few more rows. The unique index means any new rows will have to vary in one of the three columns: rental_date, inventory_id, or customer_id. The easiest to change is the rental date. By changing just that one item and adding the table back into itself, we can quickly and exponentially increase the size of the table like so:

    INSERT INTO rental(rental_date, inventory_id, customer_id, staff_id)
      SELECT rental_date + '1 minute'::interval, inventory_id, customer_id, staff_id
      FROM rental;
    

    I then ran the same query again, but with '2 minutes', '4 minutes', '8 minutes', and finally '16 minutes'. At this point, the table had 513,408 rows, which is enough for this example. I also ran an ANALYZE on the table in question (this should always be the first step when trying to figure out why things are going slower than expected). The next step is to write a simple function that accesses the table by counting how many rentals have occurred since a certain date:

    DROP FUNCTION IF EXISTS count_rentals_since_date(date);
    
    CREATE FUNCTION count_rentals_since_date(date)
    RETURNS BIGINT
    LANGUAGE plpgsql
    AS $body$
      DECLARE
        tcount INTEGER;
      BEGIN
        SELECT INTO tcount
          COUNT(*) FROM rental WHERE rental_date > $1;
      RETURN tcount;
      END;
    $body$;
    

    Simple enough, right? Let's test out a few dates and see how long each one takes:

    pagila# \timing
    
    pagila# select count_rentals_since_date('2005-08-01');
     count_rentals_since_date
    --------------------------
                       187901
    Time: 242.923 ms
    
    pagila# select count_rentals_since_date('2005-09-01');
     count_rentals_since_date
    --------------------------
                         5824
    Time: 224.718 ms
    

    Note: all of the queries in this article were run multiple times first to reduce any caching effects. Those times appear to be about the same, but I know from the distribution of the data that the first query will not hit the index, but the second one should. Thus, when we try and emulate what the function is doing on the command line, the first effort often looks like this:

    pagila# explain analyze select count(*) from rental where rental_date > '2005-08-01';
                         QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate (actual time=579.543..579.544)
       Seq Scan on rental (actual time=4.462..403.122 rows=187901)
         Filter: (rental_date > '2005-08-01 00:00:00')
     Total runtime: 579.603 ms
    
    pagila# explain analyze select count(*) from rental where rental_date > '2005-09-01';
    
                         QUERY PLAN
    --------------------------------------------------------------------------------
     Aggregate  (actual time=35.133..35.133)
       Bitmap Heap Scan on rental (actual time=1.852..30.451)
         Recheck Cond: (rental_date > '2005-09-01 00:00:00')
         -> Bitmap Index Scan on idx_unq_rental (actual time=1.582..1.582 rows=5824)
             Index Cond: (rental_date > '2005-09-01 00:00:00')
     Total runtime: 35.204 ms
    
    

    Wow, that's a huge difference! The second query is hitting the index and using some bitmap magic to pull back the rows in a blistering time of 35 milliseconds. However, the same date, using the function, takes 224 ms - over six times as slow! What's going on? Obviously, the function is *not* using the index, regardless of which date is passed in. This is because the function cannot know ahead of time what the dates are going to be, but caches a single query plan. In this case, it is caching the 'wrong' plan.

    The correct way to see queries as a function sees them is to use prepared statements. This caches the query plan into memory and simply passes a value to the already prepared plan, just like a function does. The process looks like this:

    pagila# PREPARE foobar(DATE) AS SELECT count(*) FROM rental WHERE rental_date > $1;
    PREPARE
    
    pagila# EXPLAIN ANALYZE EXECUTE foobar('2005-08-01');
                    QUERY PLAN
    --------------------------------------------------------------
     Aggregate  (actual time=535.708..535.709 rows=1)
       ->  Seq Scan on rental (actual time=4.638..364.351 rows=187901)
             Filter: (rental_date > $1)
     Total runtime: 535.781 ms
    
    pagila# EXPLAIN ANALYZE EXECUTE foobar('2005-09-01');
                    QUERY PLAN
    --------------------------------------------------------------
     Aggregate  (actual time=280.374..280.375 rows=1)
       ->  Seq Scan on rental  (actual time=5.936..274.911 rows=5824)
             Filter: (rental_date > $1)
     Total runtime: 280.448 ms
    

    These numbers match the function, so we can now see the reason the function is running as slow as it does: it is sticking to the "Seq Scan" plan. What we want to do is to have it use the index when the given date argument is such that the index would be faster. Functions cannot have more than one cached plan, so what we need to do is dynamically construct the SQL statement every time the function is called. This costs us a small bit of overhead versus having a cached query plan, but in this particular case (and you'll find in nearly all cases), the overhead lost is more than compensated for by the faster final plan. Making a dynamic query in plpgsql is a little more involved than the previous function, but it becomes old hat after you've written a few. Here's the same function, but with a dynamically generated SQL statement inside of it:

    DROP FUNCTION IF EXISTS count_rentals_since_date_dynamic(date);
    
    CREATE FUNCTION count_rentals_since_date_dynamic(date)
    RETURNS BIGINT
    LANGUAGE plpgsql
    AS $body$
      DECLARE
        myst TEXT;
        myrec RECORD;
      BEGIN
        myst = 'SELECT count(*) FROM rental WHERE rental_date > ' || quote_literal($1);
        FOR myrec IN EXECUTE myst LOOP
          RETURN myrec.count;
        END LOOP;
      END;
    $body$;
    

    Note that we use the quote_literal function to take care of any quoting we may need. Also notice that we need to enter into a loop to run the query and then parse the output, but we can simply return right away, as we only care about the output from the first (and only) returned row. Let's see how this new function performs compared to the old one:

    pagila# \timing
    
    pagila# select count_rentals_since_date_dynamic('2005-08-01');
     count_rentals_since_date_dynamic
    ----------------------------------
                               187901
    Time: 255.022 ms
    
    pagila# select count_rentals_since_date('2005-08-01');
     count_rentals_since_date
    --------------------------
                       187901
    Time: 249.724 ms
    
    pagila# select count_rentals_since_date('2005-09-01');
     count_rentals_since_date
    --------------------------
                         5824
    Time: 228.224 ms
    
    pagila# select count_rentals_since_date_dynamic('2005-09-01');
     count_rentals_since_date_dynamic
    ----------------------------------
                                 5824
    Time: 6.618 ms
    

    That's more like it! Problem solved. The function is running much faster now, as it can hit the index. The take-home lessons here are:

    1. Always make sure the tables you are using have been analyzed.
    2. Emulate the queries inside a function by using PREPARE + EXPLAIN EXECUTE, not EXPLAIN.
    3. Use dynamic SQL inside a function to prevent unwanted query plan caching.

    Best practices for cron

    Crontab Best Practice

    Cron is a wonderful tool, and a standard part of all sysadmins toolkit. Not only does it allow for precise timing of unattended events, but it has a straightforward syntax, and by default emails all output. What follows are some best practices for writing crontabs I've learned over the years. In the following discussion, 'cron' indicates the program itself, 'crontab' indicates the file changed by 'crontab -e', and 'entry' begin a single timed action specified inside the crontab file. Cron best practices:

    * Version control

    This rule is number one for a reason. *Always* version control everything you do. It provides an instant backup, accountability, easy rollbacks, and a history. Keeping your crontabs in version control is slightly more work than normal files, but all you have to do is pick a standard place for the file, then export it with crontab -l > crontab.postgres.txt. I prefer RCS for quick little version control jobs like this: no setup required, and everything is in one place. Just run: ci -l crontab.postgres.txt and you are done. The name of the file should be something like the example shown, indicating what it is (a crontab file), which one it is (belongs to the user 'postgres'), and what format it is in (text).

    You can even run another cronjob that compares the current crontab for each user with the version-controlled version and mail an alert and/or check it in automatically on a difference.

    * Keep it organized

    The entries in your crontab should be in some sort of order. What order depends on your preferences and on the nature of your entries, but some options might include:

    • Put the most important at the top.
    • Put the ones that run more often at the top.
    • Order by time they run.
    • Order by job groups (e.g. all entries dealing with the mail system).

    I generally like to combine the above entries, such that I'll put the entries that run the most often at the top. If two entries happen at the same frequency (e.g. once an hour), then I'll put the one that occurs first in the day (e.g. 00:00) at the top of the list. If all else is still equal, I order them by priority. Whatever you do, put a note at the top of your crontab explaining the system used in the current file.

    * Always test

    It's very important to test out your final product. Cron entries have a nasty habit of working from the command line, but failing when called by cron, usually due to missing environment variables or path problems. Don't wait for the clock to roll around when adding or changing an entry - test it right away by making it fire 1-2 minutes into the future. Of course, this is only after you have tested it by creating a simple shell script and/or running it from the command line.

    In addition to testing normal behavior, make sure you test all possible failure and success scenarios as well. If you have an entry that deletes all files older than a day in a certain directory, use the touch command to age some files and verify they get deleted. If your command only performs an action when a rare, hard-to-test criteria is met (such as a disk being 99% full), tweak the parameters so it will pass (such as setting the previous example to 5%).

    Once it's all working, set the time to normal and revert any testing tweaks you made. You may want to make the output verbose as a final 'live' test, and then make things quiet once it has run successfully.

    * Use scripts

    Don't be afraid to call external scripts. Anything even slightly complex should not be in the crontab itself, but inside of an external script called by the crontab. Make sure you name the script something very descriptive, such as flush_older_iptables_rules.pl. While a script means another separate dependency to keep track of, it offers many advantages:

    • The script can be run standalone outside of cron.
    • Different crontabs call all share the same script.
    • Concurrency and error handling is much easier.
    • A script can filter output and write cleaner output to log files.

    * Use aliases

    Use aliases (actually environment variables, but it's easier to call them aliases) at the top of your cron script to store any commands, files, directories, or other things that are used throughout your crontab. Anything that is complex or custom to your site/user/server is a good candidate to make an alias of. This has many advantages:

    • The crontab file as a whole is easier to read.
    • Entries are easier to read, and allow you to focus on the "meat" of the entry, not the repeated constants.
    • Similar aliases grouped together allow for easier spotting of errors.
    • Changes only need to be made in one place.
    • It is easier to find and make changes.
    • Entries can be more easily re-used and cut-n-pasted elsewhere.

    Example:

    PSQL_MASTER='/usr/local/bin/psql -X -A -q -t -d master'
    PSQL_SLAVE='/usr/local/bin/psql -X -A -q -t -d master'
    
    */15 * * * * $PSQL_MASTER -c 'VACUUM pg_listener'
    */5 * * * * $PSQL_SLAVE -c 'VACUUM pg_listener' && $PSQL_SLAVE -c 'VACUUM pg_class'
    
    

    * Forward emails

    In addition to using non-root accounts whenever possible, it is also very important to make sure that someone is actively receiving emails for each account that has cronjobs. Email is the first line of defense for things going wrong with cron, but all too often I'll su into an account and find that it has 6000 messages, all of them from cron indicating that the same problem has been occurring over and over for weeks. Don't let this happen to you - learn about the problem the moment it stops happening by making sure the account is either actively checked, or set up a quick forward to one that is. If you don't want to get all the mail for the account, setup a quick filter - the output of cron is very standard and easy to filter.

    * Document everything

    Heavily document your crontab file. The top line should indicate how the entries are organized, and perhaps have a line for $Id: cronrox.html,v 1.1 2008/12/08 13:41:31 greg Exp greg $, if your version control system uses that. Every entry should have at a minimum a comment directly above it explaining how often it runs, what it does, and why it is doing it. A lot of this may seem obvious and duplicated information, but it's invaluable. People not familiar with crontab's format may be reading it. People not as familiar as you with the flags to the 'dd' command will appreciate a quick explanation. The goal is to have the crontab in such a state that your CEO (or anyone else on down) can read and understand what each entry is doing.

    * Avoid root

    Whenever possible, use some other account than root for cron entries. Not only is is desirable in general to avoid using root, it should be avoided because:

    • The root user probably already gets lots of email, so important cron output is more likely to be missed.
    • Entries should belong to the account responsible for that service, so Nagios cleanup jobs should be in the Nagios user's crontab. If rights are needed, consider granting specific sudo permissions.
    • Because root is a powerful account, its easier to break things or cause big problems with a simple typo.

    * Chain things together

    When possible, chain items together using the && operator. Not only is this a good precondition test, but it allows you to control concurrency, and creates less processes than separate entries.

    Consider these two examples:

    ## Example 1:
    30 * * * * $PSQL -c 'VACUUM abc'
    32 * * * * $PSQL -c 'ANALYZE abc'
    32 * * * * $PSQL -c 'VACUUM def'
    
    ## Example 2:
    30 * * * * $PSQL -c 'VACUUM abc' && $PSQL -c 'VACUUM def' && $PSQL -c 'ANALYZE abc'
    

    The first example has many problems. First, it creates three separate cron processes. Second, the ANALYZE on table abc may end up running while the VACUUM is still going on - not a desired behavior. Third, the second VACUUM may start before the previous VACUUM or ANALYZE has finished. Fourth, if the database is down, there are three emailed error reports going out, and three errors in the Postgres logs.

    The second example fixes all of these problems. The second VACUUM and the ANALYZE will not run until the previous actions are completed. Only a single cron process is spawned. If the first VACUUM encounters a problem (such as the database being down), the other two commands are not even attempted.

    The only drawback is to make sure that you don't stick very important items at the end of a chain, where they may not run if a previous command does not successfully complete (or just takes too long to be useful). A better way around this is to put all the complex interactions into a single script, which can allow you to run later actions even if some of the previous ones failed, with whatever logic you want to control it all.

    * Avoid redirects to /dev/null

    Resist strongly the urge to add 2>/dev/null to the end of your entries. The problem with such a redirect that it is a very crude tool that removes *all* error output, both the expected (what you are probably trying to filter out) and the unexpected (the stuff you probably do not want filtered out). Turning off the error output negates one of the strongest features of cron - emailing of output.

    Rather than using 2>/dev/null or >/dev/null, make the actions quiet by default. Many commands take a -q, --quiet, or --silent option. Use Unix tools to filter out known noise. If all else fails, append the output to a logfile, so you can come back and look at things later when you realize your entry is not working the way you thought it was.

    If all else fails, call an external script. It's well worth the extra few minutes to whip up a simple script that parses the error output and filters out the known noise. That way, the unknown noise (e.g. errors) are mailed out, as they should be.

    * Don't rely on email

    Unfortunately, cron emails all output to you by default - both stdout and stderr. This means that the output tends to be overloaded - both informational messages and errors are sent. It's too easy for the error messages to get lost if you tend to to receive many informational cron messages. Even well-intentioned messages tend to cause problems over time, as you grow numb (for example) to the daily message showing you the output of a script that runs at 2 AM. After a while, you stop reading the body of the message, and then you mentally filter them away when you see them - too much mail to read to look that one over. Unfortunately, that's when your script fails and cron sends an error message that is not seen.

    The best solution is to reserve cron emails for when things go wrong. Thus, an email from cron is a rare event and will very likely be noticed and taken care of. If you still need the output from stdout, you can append it to a logfile somewhere. A better way, but more complex, is to call an external script that can send you an email itself, thus allowing control of the subject line.

    * Avoid passwords

    Don't put passwords into your crontab. Not only is it a security risk (crontab itself, version control files, ps output), but it decentralizes the information. Use the standard mechanisms when possible. For Postgres connections, this means a .pgpass of pg_service.conf file. For ftp and others, the .netrc file. If all else fails, call a script to perform the action, and have it handle the passwords.

    * Use full paths

    Both for safety and sanity, use the full paths to all commands. This is quite easy to do when using aliases, and allows you to also add standard flags as well (e.g. /usr/bin/psql -q -t -A -X). Of course, you can probably get away with not giving the full path to very standard commands such as 'cp' - few sysadmins are *that* paranoid. :)

    * Conditionally run

    Don't run a command unless you have to. This also prevents errors from popping up. Generally, you only want to do this when you know there is a chance a command will not *need* to run, and you don't care if it doesn't in that case. For example, on a clustered system, test for a directory indicating that the node in question is active. You also want to account for the possibility that the previous cron entry of the same kind is still running. The simplest way to do this is with a custom PID file, perhaps in /var/run.

    * Use /etc/cron.* when appropriate

    Consider using the system cron directories for what they were designed for: important system-wide items that run at a regular interval (cron.daily cron.hourly cron.monthly cron.weekly). Personally, I don't use these: for one thing, it' not possible to put them directly into version control.

    * Efficiency rarely matters

    Don't go overboard making your commands efficient and/or clever. Cronjobs run at most once a minute, so it's usually better to be clearer and precise, rather than quick and short.

    * When in doubt, run more often

    Don't be afraid to run things more often than is strictly needed. Most of the jobs that crontab ends up doing are simple, inexpensive, and mundane. Yet they are also very important and sorely missed when not run. Rather than running something once a day because it only *needs* to be run once a day, consider running it twice a day. That way, if the job fails for some reason, it still has another chance to meet the minimum once a day criteria. This rule does not apply to all cronjobs, of course.


    Cron future

    Some things I'd like to see cron do someday:

    • Better granularity than a minute.
    • Built in detection of previously running cronjobs.
    • Rescheduling of missed jobs ala fcron (and most other fcron features as well)
    • Better filtering.
    • Cron entries based on real files in the user's home directory.