End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

File test comparison table for shell, Perl, Ruby, and Python

A few days ago, my co-worker Richard asked how in Python you would do the -x Bourne shell and Perl file test that checks whether a file is executable. This is (for me, at least) a really commonly used function but one I hadn't needed to do yet in Python, so I looked it up.

That wasn't so hard to find, but then I wondered about the other shell and Perl file tests that I use all the time. Finding equivalents for those was harder than I expected. A web search didn't turn much up aside from language holy wars and limited answers, but I didn't find any exhaustive list.

So I made my own. Below is a table comparing file test operators in the original Bourne shell-compatibles bash, ksh, and zsh; Perl's expanded set; Ruby's which was derived first from Perl; and equivalent Python code.

There are still some blanks where I didn't find a good equivalent. Of course I'm sure it's possible with enough custom logic to achieve the same end, but I have tried to stick with relatively simple formulations using built-in functions for now. I'll be happy to fill in the blanks if any readers make suggestions.

Performance notes on avoiding multiple stats of the same file:

  • Starting with Perl 5.9.1, file tests can be "stacked" and will use a single stat for all tests, e.g. -f -x file. In older versions of Perl you can do -f file && -x _ instead.
  • Ruby's File::Stat class can be used to cache a stat for multiple tests.
  • Python's os.stat(file).st_mode can be stored and used for multiple tests.

Unless otherwise specified, these tests follow symbolic links and operate on the target of the link, rather than the link itself.

All tests return boolean true or false unless otherwise noted.

Test bash/ksh/zsh Perl Ruby Python
File is readable by effective uid/gid -r 'file' test ?r, 'file'
File.readable?('file')
File is writable by effective uid/gid -w 'file' test ?w, 'file'
File.writable?('file')
File is executable by effective uid/gid -x 'file' test ?x, 'file'
File.executable?('file')
File is owned by effective uid -O file -o 'file' test ?o, 'file' os.stat('file').st_uid == os.geteuid()
File.owned?('file')
File is owned by the effective gid -G file (stat('file'))[5] == $) test ?G, 'file' os.stat('file').st_gid == os.getegid()
File.grpowned?('file')
File is readable by real uid/gid -r file -R 'file' test ?R, 'file' os.access('file', os.R_OK)
File.readable_real?('file')
File is writable by real uid/gid -w file -W 'file' test ?W, 'file' os.access('file', os.W_OK)
File.writable_real?('file')
File is executable by real uid/gid -x file -X 'file' test ?X, 'file' os.access('file', os.X_OK)
File.executable_real?('file')
File is owned by real uid -O 'file' test ?O, 'file' os.stat('file').st_uid == os.getuid()
File exists -e file -e 'file' test ?e, 'file' os.path.exists('file')
-a file File.exist?('file')
File has zero size (is empty) -f file -a ! -s file -z 'file' test ?z, 'file' os.path.getsize('file') == 0
File.zero?('file') os.stat('file').st_size == 0
File exists and has size greater than zero -s file -s 'file' (boolean and returns size in bytes) test ?s, 'file' (boolean: returns nil if doesn't exist or has zero size, size of the file otherwise) os.path.getsize('file') > 0
File.size?('file') (same)
os.stat('file').st_size > 0
File exists, return size in bytes -s 'file' File.size('file') os.path.getsize('file')
os.stat('file').st_size
File is a plain file -f file -f 'file' test ?f, 'file' os.path.isfile('file')
File.file?('file') stat.S_ISREG(os.stat('file').st_mode)
File is a directory -d file -d 'file' test ?d, 'file' os.path.isdir('file')
File.directory?('file') stat.S_ISDIR(os.stat('file').st_mode)
File is a symbolic link -h file -l 'file' test ?l, 'file' os.path.islink('file')
-L file File.symlink?('file') stat.S_ISLNK(os.lstat('file').st_mode)
File is a named pipe (FIFO) -p file -p 'file' (can also be used on a filehandle) test ?p, 'file' stat.S_ISFIFO(os.stat('file').st_mode)
File.pipe?('file')
File is a socket -S file -S 'file' test ?S, 'file' stat.S_ISSOCK(os.stat('file').st_mode)
File.socket?('file')
File is a block special file -b file -b 'file' test ?b, 'file' stat.S_ISBLK(os.stat('file').st_mode)
File.blockdev?('file')
File is a character special file -c file -c 'file' test ?c, 'file' stat.S_ISCHR(os.stat('file').st_mode)
File.chardev?('file')
File type (returns string 'file', 'directory', 'characterSpecial', 'blockSpecial', 'fifo', 'link', 'socket', or 'unknown' File.ftype('file')
Filehandle or descriptor is opened to a tty -t fd -t $fh fd.isatty os.isatty(fd)
fd.tty?
File has setuid bit set -u file -u 'file' test ?u, 'file' os.stat('file').st_mode & stat.S_ISGID
File.setuid?('file')
File has setgid bit set -g file -g 'file' test ?g, 'file' os.stat('file').st_mode & stat.S_ISUID
File.setgid?('file')
File has sticky bit set -k file -k 'file' test ?k, 'file' os.stat('file').st_mode & stat.S_ISVTX
File.sticky?('file')
File is an ASCII text file (heuristic guess) -T 'file'
File is a "binary" file (opposite of -T) -B 'file'
File modification time (stat('file'))[9] test ?M, 'file' (returns Time object) os.stat('file').st_mtime
-M 'file' (script start time minus file modification time, in days) File.mtime('file') (same)
File access time (stat('file'))[8] test ?A, 'file' (returns Time object) os.stat('file').st_atime
-A 'file' (script start time minus file access time, in days) File.atime('file') (same)
Inode change time (Unix) (stat('file'))[10] test ?C, 'file' (returns Time object) os.stat('file').st_ctime
-C 'file' (script start time minus inode change time, in days) File.ctime('file') (same)
File has been modified since it was last read -N file
file1 is newer (according to modification date) than file2, or if file1 exists and file2 does not file1 -nt file2 (stat('file1'))[9] > (stat('file2'))[9] test ?>, 'file1', 'file2' os.path.exists('file1') and (not os.path.exists('file2') or os.stat('file1').st_mtime > os.stat('file2').st_mtime)
file1 is older than file2, or if file2 exists and file1 does not file1 -ot file2 (stat('file1'))[9] < (stat('file2'))[9] test ?<, 'file1', 'file2' os.path.exists('file2') and (not os.path.exists('file1') or os.stat('file1').st_mtime < os.stat('file2').st_mtime)
file1 and file2 refer to the same device and inode numbers file1 -ef file2 join(':', (stat('file1'))[0,1]) eq join(':', (stat('file2'))[0,1]) test ?-, 'file1', 'file2' os.path.samefile('file1', 'file2')
file1 and file2 have the same modification times (stat('file1'))[9] == (stat('file2'))[9] test ?=, 'file1', 'file2' os.stat('file1').st_mtime == os.stat('file2').st_mtime

Complete details are in the manuals for each language:

Interchange news

Tomorrow we'll be having an Interchange community meeting on IRC. All Interchange users and any other interested parties are invited to participate.

Also, just recently, End Point's own David Christensen joined the Interchange Development Group and became a core committer. Congratulations, David, and keep up the good work!

Perl's Scalar::Util::dualvar

I just came across this fun Perl function that I can't think of a (good) use for, but have to share.

In the Scalar::Util module is the function dualvar:

dualvar NUM, STRING

Returns a scalar that has the value NUM in a numeric context and the value STRING in a string context.

    $foo = dualvar 10, "Hello";
    $num = $foo + 2;                    # 12
    $str = $foo . " world";             # Hello world

Using that in the right place could lead a future programmer down some fun debugging paths!

Defining variables for rpmbuild

RPM spec files offer a way to define and test build variables with a directive like this:

%define <variable> <value>

Sometimes it's useful to override such variables temporarily for a single build, without modifying the spec file, which would make the changed variable appear in the output source RPM. For some reason, how to do this has been hard for me to find in the docs and hard for me to remember, despite its simplicity.

Here's how. For example, to override the standard _prefix variable with value /usr/local:

rpmbuild -ba SPECS/$package.spec --define '_prefix /usr/local'

Two quick tips: egrep & SQL dumps, VIM and deleting things that don't match

Sometimes, I just don't want to restore a full SQL dump. The restore might take too long, and maybe I just want a small subset of the records anyway.

I was in exactly this situation the other day - faced with a 10+ hour restore process, it was way faster to grep out the records and then push them into the production databases, than to restore five different versions.

So! egrep and vim to the rescue!

In my case, the SQL dump was full of COPY commands, and I had a username that was used as a partial-key on all the tables I was interested in. So:

egrep "((^COPY)|username)" PostgresDump.sql > username.out

I get a pretty nice result from this. But, there are some records I'm not so interested in that got mixed in, so I opened the output file in vim and turned line numbers on (:set numbers).

The first thing that I do is insert the '\.' needed to tell Postgres that we're at the end of a COPY statement.

:2,$s/^COPY/\\\.^V^MCOPY/

The '^V^M' is a control sequence that results in a '^M' (a newline character, essentially). And the '2' starts the substitution command on the second line rather than the first COPY statement (which, in my case, was on the first line).

Next, I want to strip out any records that the egrep found that I really don't want to insert into the database:

:.,2000g!/stuff_i_wanna_keep/d

Broken down:

  • '.,2000' - start from the current line and apply the command through line 2000
  • 'g!' - find lines that do not match the following regular expression
  • '/stuff_i_wanna_keep/' - the regular expression
  • 'd' - delete what you find

I also use the ':split' command to divide my vim screen. This lets me look at both the start of a series of records as well as the end, and most importantly find the line number for where I want to stop my line deletion command.

I also add a 'BEGIN;' and 'ROLLBACK;' to the file to run tests on the script before applying to the database.

Once I got the system down, I was able to pull and process about 3000 records I needed out of a 15 GB dump file in about 5 minutes. Testing and finally applying the records took another 10 minutes.

Text sequences

Somebody recently asked on the Postgres mailing list about "Generating random unique alphanumeric IDs". While there were some interesting solutions given, from a simple Pl/pgsql function to using mathematical transformations, I'd like to lay out a simple and powerful solution using Pl/PerlU

First, to paraphrase the original request, the poster needed a table to have a text column be its primary key, and to have a five-character alphanumeric string used as that key. Let's knock out a quick function using Pl/PerlU that solves the generation part of the question:

DROP FUNCTION IF EXISTS nextvalalpha(TEXT);
CREATE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
LANGUAGE plperlu
AS $_$
  use strict;
  my $numchars = 5;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $value = join '' => @chars[map{rand @chars}(1..$numchars)];
  return $value;
$_$;

Pretty simple: it simply pulls a number of random characters from a string (with some commonly confused letters and number removed) and returns a string:

greg=# SELECT nextvalalpha('foo');
 nextvalalpha
--------------
 MChNf
(1 row)

greg=# SELECT nextvalalpha('foo');
 nextvalalpha
--------------
 q4jHm
(1 row)

So let's set up our test table. Since Postgres can use many things column DEFAULTS, including user-defined functions, this is pretty straightforward:

DROP TABLE IF EXISTS seq_test;
CREATE TABLE seq_test (
  id    VARCHAR(5) NOT NULL DEFAULT nextvalalpha('foo'),
  city  TEXT,
  state TEXT
);

A quick test shows that the id column is auto-propagated with some random values:

greg=#< PREPARE abc(TEXT,TEXT) AS INSERT INTO seq_test(city,state) 
greg=# VALUES($1,$2) RETURNING id;

greg=# EXECUTE abc('King of Prussia', 'Pennsylvania');
  id
-------
 9zbsd
(1 row)

INSERT 0 1

greg=# EXECUTE abc('Buzzards Bay', 'Massachusetts');
  id
-------
 4jJ5D
(1 row)

INSERT 0 1

So far so good. But while those returned values are random, they are not in any way unique, which a primary key needs to be. First, let's create a helper table to keep track of which values we've already seen. We'll also track the 'name' of the sequence as well, to allow for more than one unique set of sequences at a time:

DROP TABLE IF EXISTS alpha_sequence;
CREATE TABLE alpha_sequence (
  sname TEXT,
  value TEXT
);
CREATE UNIQUE INDEX alpha_sequence_unique_value ON alpha_sequence(sname,value);

Now we tweak the original function to use this new table.

CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  use strict;
  my $sname = shift;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $numchars = 5;
  my $toomanyloops = 10000; ## Completely arbitrary pick
  my $loops = 0;

  my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = $1 AND value = $2';
  my $sth = spi_prepare($SQL, 'text', 'text');

  my $value = '';
  SEARCHING:
  {
    ## Safety valve
    if ($loops++ >= $toomanyloops) {
      die "Could not find a unique value, even after $toomanyloops tries!\n";
    }
    ## Build a new value, then test it out
    $value = join '' => @chars[map{rand @chars}(1..$numchars)];
    my $count = spi_exec_prepared($sth,$sname,$value)->{processed};
    redo if $count >= 1;
  } 

  ## Store it and commit the change
  $SQL = 'INSERT INTO alpha_sequence VALUES ($1,$2)';
  $sth = spi_prepare($SQL, 'text', 'text');
  spi_exec_prepared($sth,$sname,$value);
  return $value;
$_$;

Alright, that seems to work well, and prevents duplicate values. Or does it? Recall that one of the properties of sequences in Postgres is that they live outside of the normal MVCC rules. In other words, once you get a number via a call to nextval(), nobody else can get that number again (even you!) - regardless of whether you commit or rollback. Thus, sequences are guaranteed unique across all transactions and sessions, even if used for more than one table, called manually, etc. Can we do the same with our text sequence? Yes!

For this trick, we'll need to ensure that we only return a new value if we are 100% sure it is unique. We also need to record the value returned, even if the transaction that calls it rolls back. In other words, we need to make a small 'subtransaction' that commits, regardless of the rest of the transaction. Here's the solution:

CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  use strict;
  use DBI;
  my $sname = shift;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $numchars = 5;
  my $toomanyloops = 10000;
  my $loops = 0;

  ## Connect to this very database, but with a new session
  my $port = spi_exec_query('SHOW port')->{rows}[0]{port};
  my $dbname = spi_exec_query('SELECT current_database()')->{rows}[0]{current_database};
  my $dbuser = spi_exec_query('SELECT current_user')->{rows}[0]{current_user};
  my $dsn = "dbi:Pg:dbname=$dbname;port=$port";
  my $dbh = DBI->connect($dsn, $dbuser, '', {AutoCommit=>1,RaiseError=>1,PrintError=>0});

  my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = ? AND value = ?';
  my $sth = $dbh->prepare($SQL);

  my $value = '';
  SEARCHING:
  {
    ## Safety valve
    if ($loops++ >= $toomanyloops) {
      die "Could not find a unique value, even after $toomanyloops tries!\n";
    }
    ## Build a new value, then test it out
    $value = join '' => @chars[map{rand @chars}(1..$numchars)];
    my $count = $sth->execute($sname,$value);
    $sth->finish();
    redo if $count >= 1;
  } 

  ## Store it and commit the change
  $SQL = 'INSERT INTO alpha_sequence VALUES (?,?)';
  $sth = $dbh->prepare($SQL);
  $sth->execute($sname,$value); ## Does a commit

  ## Only now do we return the value to the caller
  return $value;
$_$;

What's the big difference between this one and the previous version? Rather than examine the alpha_sequence table in our /current/ session, we figure out who and where we are, and make a completely separate connection to the same database using DBI. Then we find an unused value, INSERT that value into the alpha_sequence table, and commit that outside of our current transaction.Only then can we return the value to the caller.

Postgres sequences also have a currval() function, which returns the last value returned via a nextval() in the current session. The lastval() function is similar, but it returns the last call to nextval(), regardless of the name used. We can make a version of these easy enough, because Pl/Perl functions have a built-in shared hash named '%_SHARED'. Thus, we'll add two new lines to the end of the function above:

...
  $sth->execute($sname,$value); ## Does a commit
  $_SHARED{nva_currval}{$sname} = $value;
  $_SHARED{nva_lastval} = $value;
...

Then we create a simple function to display that value, as well as throw an error if called too early - just like nextval() does:

DROP FUNCTION IF EXISTS currvalalpha(TEXT)
CREATE FUNCTION currvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  my $sname = shift;
  if (exists $_SHARED{nva_currval}{$sname}) {
    return $_SHARED{nva_currval}{$sname};
  }
  else {
    die qq{currval of text sequence "$sname" is not yet defined in this session\n};
  }
$_$;

Now the lastval() version:

DROP FUNCTION IF EXISTS lastvalalpha();
CREATE FUNCTION lastvalalpha()
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  if (exists $_SHARED{nva_lastval}) {
    return $_SHARED{nva_lastval};
  }
  else {
    die qq{lastval (text) is not yet defined in this session\n};
  }
$_$;

For the next tests, we'll create a normal (integer) sequence, and see how it acts compared to our newly created text sequence:

DROP SEQUENCE IF EXISTS newint;
CREATE SEQUENCE newint STARTS WITH 42;

greg=# SELECT lastval();
ERROR: lastval is not yet defined in this session

greg=# SELECT currval('newint');
ERROR:  currval of sequence "newint" is not yet defined in this session

greg=# SELECT nextval('newint');
 nextval
---------
      42
(1 row)

greg=# SELECT currval('newint');
 currval
---------
      42

greg=# SELECT lastval();
 lastval
---------
      42
greg=# SELECT lastvalalpha();
ERROR: error from Perl function "lastvalalpha": lastval (text) is not yet defined in this session

greg=# SELECT currvalalpha('newtext');
ERROR:  error from Perl function "currvalalpha": currval of text sequence "newtext" is not yet defined in this session

greg=# SELECT nextvalalpha('newtext');
 nextvalalpha
--------------
 rRwJ6

greg=# SELECT currvalalpha('newtext');
 currvalalpha
--------------
 rRwJ6

greg=# SELECT lastvalalpha();
 lastvalalpha
--------------
 rRwJ6

There is one more quick optimization we could make. Since the %_SHARED hash is available across our session, there is no need to do anything in the function more than once if we can cache it away. In this case, we'll cache away the server information we look up, the database handle, and the prepares. Our final function looks like this:

CREATE OR REPLACE FUNCTION nextvalalpha(TEXT)
RETURNS TEXT
SECURITY DEFINER
LANGUAGE plperlu
AS $_$
  use strict;
  use DBI;
  my $sname = shift;
  my @chars = split // => qw/abcdefghijkmnpqrstwxyzABCDEFGHJKLMNPQRSTWXYZ23456789/;
  my $numchars = 5;
  my $toomanyloops = 10000;
  my $loops = 0;

  ## Connect to this very database, but with a new session
  if (! exists $_SHARED{nva_dbi}) {
    my $port = spi_exec_query('SHOW port')->{rows}[0]{port};
      my $dbname = spi_exec_query('SELECT current_database()')->{rows}[0]{current_database};
    my $dbuser = spi_exec_query('SELECT current_user')->{rows}[0]{current_user};
    my $dsn = "dbi:Pg:dbname=$dbname;port=$port";
    $_SHARED{nva_dbi} = DBI->connect($dsn, $dbuser, '', {AutoCommit=>1,RaiseError=>1,PrintError=>0});
    my $dbh = $_SHARED{nva_dbi};
    my $SQL = 'SELECT 1 FROM alpha_sequence WHERE sname = ? AND value = ?';
    $_SHARED{nva_sth_check} = $dbh->prepare($SQL);
    $SQL = 'INSERT INTO alpha_sequence VALUES (?,?)';
    $_SHARED{nva_sth_add} = $dbh->prepare($SQL);
  }


  my $value = '';
  SEARCHING:
  {
    ## Safety valve
    if ($loops++ >= $toomanyloops) {
      die "Could not find a unique value, even after $toomanyloops tries!\n";
    }
    ## Build a new value, then test it out
    $value = join '' => @chars[map{rand @chars}(1..$numchars)];
    my $count = $_SHARED{nva_sth_check}->execute($sname,$value);
    $_SHARED{nva_sth_check}->finish();
    redo if $count >= 1;
  } 

  ## Store it and commit the change
  $_SHARED{nva_sth_add}->execute($sname,$value); ## Does a commit
  $_SHARED{nva_currval}{$sname} = $value;
  $_SHARED{nva_lastval} = $value;
  return $value;
$_$;

Having the ability to reach outside the database in Pl/PerlU - even if simply to go back in again! - can be a powerful tool, and allows us to do things that might otherwise seem impossible.

lessons = colors.find_all {|c| c.kind_of? Blue}

As noted in this nice article at Slate.com, the much-loved "Kind of Blue" celebrated its 50th anniversary recently. In the article, Fred Kaplan asks, and attempts to answer, "what makes this album so great?"

As somebody who has made a point of introducing his daughter (at infancy) to Miles Davis, who succumbed to the allure of jazz thanks in no small part to this very album, I would be remiss in my duties as a Caring Human not to blog about it. And yet this is a corporate blog for a consulting company that does lots of stuff with free software (as demonstrated so effectively here). What to do?

Fortunately, there's something we in the software world can learn from this album and the lessons Mr. Kaplan derives from it. Let's look at one key paragraph:

So Kind of Blue sounded different from the jazz that came before it. But what made it so great? The answer here is simple: the musicians. Throughout his career, certainly through the 1950s and '60s, Miles Davis was an instinctively brilliant recruiter; a large percentage of his sidemen went on to be great leaders, and these sidemen—especially Evans, Coltrane, and Adderley—were among his greatest. They came to the date, were handed music that allowed them unprecedented freedom (to sing their "own song," as Russell put it), and they lived up to the challenge, usually on the first take; they had a lot of their own song to sing.

How does one write music that providers the players "unprecedented freedom"? In this case, it meant giving them musical forms with relatively few harmonic changes; instead of four, or two, or even one chord per measures, a given chord/scale would last for several measures at a time; the improviser could thus think horizontally (melodically) rather than vertically (harmonically) and be freer in the choice of notes employed.

So of course the quality of the musicians mattered; the musical material was fairly bare, by standards of the day, and the musicians really needed to have something interesting to offer to make something worthwhile of it. As noted, these musicians did.

Let's derive two lessons from this:

  1. Given fairly sparse requirements and room to interpret intelligently, great things can happen
  2. Those great things will depend on the people you hire

Neither notion is particularly startling or insightful. But let's combine it with the next observation from Mr. Kaplan:

The album's legacy is mixed, precisely for this [freedom]. It opened up a whole new path of freedom to jazz musicians: Those who had something to say thrived; those who didn't, noodled. That's the dark side of what Miles Davis and George Russell (and, a few months later, Ornette Coleman, in his own even-freer style of jazz) wrought: a lot of noodling—New Age noodling, jazz-rock-fusion noodling, blaring-and-squealing noodling—all of it baleful, boring, and deadly (literally deadly, given the rise of tight and riveting rock 'n' roll). Some of their successors confused freedom with just blowing whatever came into their heads, and it turned out there wasn't much there.

Let's now note that the musicians on "Kind of Blue" were not inexperienced, and in some cases not markedly young (though by no means "old" or even middle-aged). Davis and Coltrane both were around 32 or 33 (my age; remind me to denigrate myself when I'm done); Adderley, Evans, and Cobb were right around 30. The youngest, Paul Chambers, was around 24, but had played with Davis for four or five years by that time. All of them had been working musicians for years -- in some cases many years -- and had worked with a variety of illustrious colleagues. Given that such musicians typically start musical training and a young age, play their instrument all through their teen years, college years, etc., and often begin working professionally in college if not high school, it's probably fair to say that their respective experience within their profession absolutely dwarfs that of the typical similarly-aged early-21st-century technology employee.

All the people on that recording spoke a common musical syntax. They didn't invent a new syntax by simplifying the tunes; they applied their common knowledge and experience to a new situation and made something beautiful. A ten year-old pianist armed with fingering patterns for the minor keys of D and E-flat could have played in time with the band on "So What". That pianist would not have known what to do with the fourths-laden voicings for the bass/piano call and response at the head of the tune. That pianist would not have known how to interact with Paul Chambers' beautiful bass lines, because he/she would lack the melodic and formal grammar, and the harmonic vocabulary, to know what the bass was saying, let alone conceive of an intelligent response. It took Bill Evans to do these particular things in such a stirring, striking manner. (And Wynton Kelly on the blues; let's not overlook that, because his solo on "Freddie" is burning.)

And yet, in addition to this common syntax, they each brought something different -- akin to an accent, or regional dialect. Evans' Ravel/Debussy-tinged harmonic voicings are perhaps like a more expansive lexicon. The intersection of their respective approaches was most important, but the differences made for a more colorful, inspired union.

So, when you're starting a software project, who do you want? Plenty of people out there can play a few chords in time. Do you want Miles' band? Or would you perhaps prefer these special guys? Who would you trust to create the customized experience that people will return to time and time again, today, tomorrow, next year?

It's not unreasonable to make these comparisons. A software project involves art and craft. The less clear the requirements, the more apparent this becomes. Relatively few projects start with a vision of such clarity that the role of technical interpretation is diminished. Even the most precise specification requires improvisation; if the spec provided the literal instructions on what functions to write and how to implement them, we would call it "code" and roll it out for you. For most projects, one must simply assume a profound need for deep technical interpretation; the only real question is how much business interpretation you need first.

Jazz performance is sometimes popularly construed as something like magic; something that can only be explained by "talent" and is fundamentally unfathomable. I remember my great aunt, who came to see my trio perform at a jazz festival in Vermont many years ago, saying "the music is just inside of you!" Well, yes, and the effort to get that little bit of music inside was colossal. There's nothing magical or unfathomable about it; it's hard work that takes years of dedicated, focused, consistent effort. It takes ear training, theory training, musical memory development, instrumental technique, improvisational technique, self-assessment and analysis, etc. It is a cultivated tradition of hard work, hard-won experience, that celebrates knowledge gained and skills acquired.

Sound familiar?

Why am I spending so much time on the obvious? Because apparently it's not obvious. It's really not obvious at all.

The past few years brought us some great advancements in the webapp framework space, making it easier to design decent-quality web applications with a reasonable separation of concerns. Yet, rather like "Kind of Blue", they open up this idea that "Hey, anybody can do it!" And it's true; most anybody can write an app that achieves a minimal separation of presentation, business, and data logic. But what of it? The first problem is a culture that ignores the rest of the software stack, because of a belief that everything that matters is in the application layer. A decent object-relational mapper can encourage reasonably good data modeling practices at the relational/structural level, but effectively encourages a culture of database ignorance and lowest-common-denominator thinking. But most problematic of all: the simple idea that because somebody knows framework X, that somebody can solve your problem.

I can play three chords, and you will love them forever.

If your problem was easy to solve, you wouldn't need said somebody; the difficulty in your problem isn't the framework, it's the business logic and the proper modeling thereof. In which case, you're best off with people who have built systems both simple and complex, big and small, and thus have a wealth of experience to call upon when interpreting their way to a reasonable technical solution to your problem.

The technical execution of the loose sketches that made up the material on "Kind of Blue" did not pose any particular challenge for Davis and company; the real challenge was in making something worthwhile from it. The answer did indeed lie in the people, and a willingness to let those people exercise their own discretion in pursuit of a common vision, drawing on years of hard-earned experience to find the best expression possible within the established framework.

Debugging prepared statements

I was recently tasked with the all-too-familiar task for DBAs of "why is this script running so slow?". After figuring out exactly which script and where it was running from, I narrowed down the large number of SQL commands it was issuing to one particularly slow one, that looked something like this in the pg_stat_activity view:

current_query 
-------------
SELECT DISTINCT id
FROM containers
WHERE code LIKE $1

Although the query ran too quick to really measure a finite time just by watching pg_stat_activity, it did show up quite often. So it was likely slow *and* being called many times in a loop somewhere. The use of 'LIKE' always throws a yellow flag, so those factors encouraged me look closer into the query.

While the table in question did have an index on the 'code' column, it was not being used. This is because LIKE (on non-C locale databases) cannot work against normal indexes - it needs a simpler character by character index. In Postgres, you can achieve this by using some of the built in operator classes when creating an index. More details can be found at the documentation on operator classes. What I ended up doing was using text_pattern_ops:

SET maintenance_work_mem = '2GB';

CREATE INDEX CONCURRENTLY containers_code_textops
  ON containers (code text_pattern_ops);

Since this was on a production system (yes, I tested on a QA box first!), the CONCURRENTLY phrase ensured that the index did not block any reads or writes on the table while the index was being built. Details on this awesome option can be found in the docs on CREATE INDEX.

After the index was created, the following test query went from 800ms to 0.134ms!:

EXPLAIN ANALYZE SELECT * FROM containers WHERE code LIKE 'foobar%';

I then created a copy of the original script, stripped out any parts that made changes to the database, added a rollback to the end of it, and tested the speed. Still slow! Recall that the original query looked like this:

SELECT DISTINCT id
FROM containers
WHERE code LIKE $1

The $1 indicates that this is a prepared query. This leads us to the most important lesson of this post: whenever you see that a prepared statement is being used, it's not enough to test with a normal EXPLAIN or EXPLAIN ANALYZE. You must emulate what the script (e.g. the database driver) is really doing. So from psql, I did the following:

PREPARE foobar(text) AS SELECT DISTINCT id FROM containers WHERE code LIKE $1;
EXPLAIN ANALYZE EXECUTE('foobar%');

Bingo! This time, the new index was *not* being used. This is the great trade-off of prepared statements - while it allows you to prepare and rewrite the query only once, the planner cannot anticipate what you might pass in as a possible argument, so it makes the best generic plan possible. Thus, your EXPLAIN of the same query using literals or placeholders via PREPARE may look very different.

While it's possible to make workarounds at the database level for the problem of prepared statements using the "wrong" plan, in this case it was simply easier to tell the existing script not to use prepared statements at all for this one query. As the script was using DBD::Pg, the solution was to simply use the pg_server_prepare attribute like so:

$dbh->{pg_server_prepare} = 0;
my $sth = $dbh->prepare('SELECT DISTINCT id FROM containers WHERE code LIKE ?');
$dbh->{pg_server_prepare} = 1;

The effect of this inside of DBD::Pg is that instead of using PQprepare and then PQexecPrepared for each call to $sth->execute(), DBD::Pg will, for every call to $sth->execute(), quote the parameter itself, build a string containing the original SQL statement and the quoted literal, and send it to the backend via PQexec. Normally not something you want to do, but the slight overhead of doing it that way was completely overshadowed by the speedup of using the new index.

The final result: the script that used to take over 6 hours to run now only takes about 9 minutes to complete. Not only are the people using the script much happier, but it means less load on the database.

Site Search on Rails

I was recently tasked with implementing site search using a commercially available site search application for one of our clients (Gear.com). The basic implementation requires that a SOAP request be made and the XML data returned be parsed for display. The SOAP request contains basic search information, and additional information such as product pagination and sort by parameters. During the implementation in a Rails application, I applied a few unique solutions worthy of a blog article. :)

The first requirement I tackled was to design the web application in a way that produced search engine friendly canonical URLs. I used Rails routing to implement a basic search:

map.connect ':id', :controller => 'basic', :action => 'search'

Any simple search path would be sent to the basic search query that performed the SOAP request followed by XML data parsing. For example, http://www.gear.com/s/climb is a search for "climb" and http://www.gear.com/s/bike for "bike".

After the initial search, a user can refine the search by brand, merchant, category or price, or choose to sort the items, select a different page, or modify the number of items per page. I chose to force the order of refinement, for example, brand and merchant order were constrained with the following Rails routes:

map.connect ':id/brand/:rbrand', :controller => 'basic', :action => 'search' 
map.connect ':id/merch/:rmerch', :controller => 'basic', :action => 'search' 
map.connect ':id/brand/:rbrand/merch/:rmerch', :controller => 'basic', :action => 'search' 

Rather than allow different order of refinement parameters in the URLs, such as http://www.gear.com/s/climb/brand/Arcteryx/merch/Altrec and http://www.gear.com/s/climb/merch/Altrec/brand/Arcteryx, the order of search refinement is always limited to the Rails routes specified above and the former URL would be allowed in this example.

For example, http://www.gear.com/s/climb/brand/Arcteryx/merch/Altrec is a valid URL for Arcteryx Altrec climb, http://www.gear.com/s/climb/brand/Arcteryx for Arcteryx climb, and http://www.gear.com/s/climb/merch/Altrec for Altrec climb.

All URLs on any given search result page are built with a single Ruby method to force the refinement and parameter order. The method input requires the existing refinement values, the new refinement key, and the new refinement value. The method builds a URL with all previously existing refinement values and adds the new refinement value. Rather than generating millions of URLs with the various refinement combinations of brand, merchant, category, price, items per page, pagination number, and sort method, this logic minimizes duplicate content. The use of Rails routes and the chosen URL structure also creates search engine friendly URLs that can be targeted for traffic. Below is example pseudocode with the URL-building method:

def build_url(parameters, new_key, new_value)
  # set url to basic search information
  # append brand info to url if parameters[:brand] exists or if new_key is brand
  # append merchant info to url if parameters[:merchant] exists or if new_key is merchant
  # append category info to url if parameters[:cat] exists or if new_key is cat
  # ...
end

The next requirement I encountered was breadcrumb functionality. Breadcrumbs are an important usability feature that provide the ability to navigate backwards in search and refinement history. Because of the canonical URL solution described above, the URL could not be used to indicate the search refinement history. For example, http://www.gear.com/s/climb/brand/Arcteryx/merch/Altrec does not indicate whether the user had refined by brand then merchant, or by merchant then brand. I investigated a few solutions having implemented similar breadcrumb functionality for other End Point clients, including appending the '#' (hash or relative url) to the end of the URL with details of the user refinement path, using JavaScript to set a cookie containing the user refinement path whenever a link was clicked, and using a session variable to track the user refinement path. In the end, I found it easiest to use a single session variable to track the user refinement path. The session variable contained all information needed to display the breadcrumb with a bit of parsing.

For example, for the URL mentioned above, the session variable of 'brand-Arcteryx:merch-Altrec' would yield the breadcrumb: "Your search: climb > Arcteryx > Altrec" And the session variable 'merch-Altrec:brand-Arcteryx' would yield the breadcrumb: "Your search: climb > Altrec > Arcteryx". I could have used more than one session variable, but this solution worked out to be simple and comprised less than 10 lines of code.

Another interesting necessity was determining the best way to parse the XML data. I researched several XML parsers including XmlSimple, Hpricot, ReXML, and libxml. About a year ago, John Nunemaker reported on some benchmark testing of several of these packages (Parsing XML with Ruby). After some investigative work, I chose Hpricot because it was very easy to implement complex selectors that reminded me of jQuery selectors (which are also easy to use). The interesting thing that I noticed throughout the implementation was that the refinement parsing took much more time than the actual product parsing and formatting. For Gear.com, the number of products returned ranges from 20-60 and products were quickly parsed. The number of refinements returned ranged from very small for a distinct search Moccasym (4 refinement options) to a general search jacket (50+ refinement options). If performance is an issue in the future, I can further investigate the use of libxml-ruby or other Ruby XML parsing tools that may improve the performance.

A final point of interest was the decision to tie the Rails application to the same database that drives the product pages (which was easily done). This decision was made to allow access of frontend taxonomy information for the product categorization. For example, if a user chooses to refine a specific by a category (jacket in Kids Clothing), the Rails app can retrieve all the taxonomy information for that category such as the display name, the number of products in that category, subcategories, and subsubcategories. This may be important information required for additional features, such as providing the ability to view the subcategories in this category or view other products in this category that aren't shown in the search results.

I was happy to see the success of this project after working through the deliverables. Future work includes integration of additional search features common to many site search packages, such as implementing refinement by color and size, or retrieving recommended products or best sellers.

Learn more about End Point's Ruby on Rails development.

Google Voice first impressions

I've been using Google Voice on and off for about a week now, and wanted to share my first impressions.

Google Voice is still available only by invite, but I only had to wait about 3 days to get access. Signup was quick and easy, and doesn't cost any money. The hardest part was selecting a phone number -- you can choose one from most anywhere in the United States, but as far as I know you can't change it later, at least not easily. I finally got the Montana phone number I've always wanted!

Simultaneous multiple call forwarding: The feature I was most aware of beforehand is having your Google Voice number forward calls to as many other numbers as you want. This is actually better than I expected, because it calls the other numbers simultaneously, not in series, so there's very little delay to the caller compared with directly calling. Whichever phone you answer first and tell the robot lady "1", is the one that takes the call.

Android integration (not VoIP): I was most excited to use Google Voice as a VoIP client on my Android phone, but that is a feature I apparently imagined all by myself. The Android Google Voice client still uses cell phone minutes and goes through the cell network, but it routes the calls through Google's infrastructure and shows caller ID as your Google Voice number. The point of this is to save money on international calls because you are only making a domestic call to your Google Voice number as far as your cell phone provider is concerned, and then Google routes the international call and charges you a few cents a minute. (This is the only part of Google Voice that costs money at the moment, as far as I know.) That's a neat feature, but what I really wanted was a completely VoIP client that I can use on wifi or a 3G data network to make calls entirely outside the cell phone network. Any Google Voice developers reading: Here's my enthusiastic vote for that! I would use it all the time if it existed.

Google Contacts integration: Because I'm using an Android phone and its native contact management, my contacts are synced with my Google account and thus appear in Google Voice automatically. That is convenient. It means the GOOG has all my contact info. That's more or less the case with anyone routing phone calls for me anyway.

Free SMS: Google Voice does free SMS text messages, both sending and receiving, even if you have no cell phone. If you do have a cell phone configured with it, it also forwards incoming SMS to your phone. Very handy. Every cell phone SMS someone has sent me has made it, but for some reason I haven't been able to get Skype's SMS to go to Google Voice at all. That keeps me for now from having SkypeOut calls show a Google Voice number as the caller ID, unfortunately.

Free domestic long distance: Another nice feature I hadn't known about before is Google Voice's ability to initiate two outgoing calls, one to one of my configured phones, and the other to anyone else, and connect us. This makes it possible to use no long distance but call whoever you want. A feature I haven't tried yet is having others call the Google Voice number and conferencing them in.

Volume attenuation problem: The biggest problem I've noticed with Google Voice call quality so far is a serious attenuation of sound volume. When talking one-on-one with someone it's tolerable, and many people may not notice it. But when I called into a conference call with Google Voice people could barely hear me at all, when I talked loudly. I could hear them ok, but they couldn't hear me. I use that conference call facility all the time with regular phones, cell phones, and SkypeOut, and have no problems there. If Google could somehow boost the call volume to compensate for the extra connections, that might help. But for now I simply can't use it for conference calls, where I spend a fair amount of time every week.

Silent participants dropped: The second-biggest problem is that Google Voice apparently intentionally drops the call without warning whenever one side is silent for more than a few minutes. I haven't seen this documented anywhere, and Google doesn't tell you in any way, but it's pretty clear that's what's happening. As long as both sides are talking fairly frequently it's fine, but on conference calls I often have to stay muted for many minutes while others are talking, and then I'm unceremoniously dropped. So that's strike two against using Google Voice for conference calls right now.

Voice mail transcription: Voice mail left at Google Voice is available via the Android app, the web app, can be emailed to you, but most amazingly, if in English, is automatically transcribed and sent via SMS text message! It was hard to imagine this would work very well. I had Kiel test it for me. Compare his voice mail recording to the transcription:

A little cutie of the corner told me to leave you not Grandmama call messages and so the core in based cheese card test, kittens From, trucking, and you know from getting to fat until all the super caliper at July 6th and if that's not ballistic. I don't even know it's realistic. Because let's just face it. Corn is, where's that. See you.

Note that it displays the transcription lighter or darker depending on its level of confidence in each word. The transcription is not perfect, and Kiel deliberately chose words and phrases to be difficult to decode, but it's definitely close enough to give an idea of what the message is about without having to listen to the recording. Very useful if you're somewhere you can only peek at the phone for a few moments and can decide if the message warrants excusing yourself from the room or not.

Listen in: I haven't used this yet, but can imagine I would. You can let people start leaving voice mail but listen to it in real time and decide to take the call, just as with old answering machines.

Switch phones: I haven't used this yet either, but you can switch phones during a call from a land line to a cell phone or vice versa.

In summary, it is not the be-all and end-all of phone services that I may have hoped for, but it's a very useful addition to the world of phone services, is easy to get going with, and is mostly free of charge. It'll be fun to see where Google takes this.

Perl+Postgres: changes in DBD::Pg 2.15.1

DBD::Pg, the Perl interface to Postgres, recently released version 2.15.1. The last two weeks has seen a quick flurry of releases: 2.14.0, 2.14.1, 2.15.0, and 2.15.1. Per the usual versioning convention, the numbers on the far right (in this case the "dot one" releases) were simply bug fixes, while 2.14.0 and 2.15.0 introduced API and/or major internal changes. Some of these changes are explained below.

From the Changes file for 2.15.0:

CHANGE:
 - Allow execute_array and bind_param_array to take oddly numbered items, 
   such that DBI will make missing entries undef/null (CPAN bug #39829) [GSM]

The Perl Database Interface (DBI) has a neat feature to allow you to execute many sets of items at one time, known as execute_array. The basic format is to pass in an list of arrays, in which each array contains the placeholders needed to execute the query. For example:

## Create a simple test table with two columns
$dbh->do('DROP TABLE IF EXISTS people');
$dbh->do('CREATE TABLE people (id int, fname text)');

## Pass in all ids as a single array
my @numbers = (1,2,3);

## Pass in all names as a single array
my @names = ("Garrett", "Viktoria", "Basso");

## Prepare the statement
my $sth = $dbh->prepare('INSERT INTO people VALUES (?, ?)');

## Execute the statement multiple times (three times in this case)
$sth->execute_array(undef, \@numbers, \@names);
## (the first argument is an optional argument hash which we don't use here)

## Pull back and display the rows from our new table
$SQL = 'SELECT id, fname FROM people ORDER BY fname';
for my $row (@{$dbh->selectall_arrayref($SQL)}) {
    print "Found: $row->[0] : $row->[1]\n";
}

$ perl testscript.pl
Found: 3 : Basso
Found: 1 : Garrett
Found: 2 : Viktoria

In 2.15.0, we loosened the requirement that the number of placeholders in each array match up with the expected number. Per the DBI spec, any "missing" items are considered undef, which maps to a SQL NULL. Thus:

$dbh->do('DROP TABLE IF EXISTS people');
$dbh->do('CREATE TABLE people (id int, fname text)');

## Note that this time there are only two ids given, not three:
my @numbers = (1,2);
my @names = ("Garrett", "Viktoria", "Basso");
my $sth = $dbh->prepare("INSERT INTO people VALUES (?, ?)");

$sth->execute_array(undef, \@numbers, \@names);

## Show a question mark for any null ids
$SQL = q{
SELECT CASE WHEN id IS NULL THEN '?' ELSE id::text END, fname 
FROM people ORDER BY fname
};
for my $row (@{$dbh->selectall_arrayref($SQL)}) {
    print "Found: $row->[0] : $row->[1]\n";
}

$ perl testscript2.pl
Found: ? : Basso
Found: 1 : Garrett
Found: 2 : Viktoria

Also note that bind_param_array is an alternate way to add the list of arrays before the execute is called. This is similar in concept to a regular execute: if you bind the values first, you can call execute without any arguments:

...
$sth->bind_param_array(1, \@numbers);
$sth->bind_param_array(2, \@names);
$sth->execute_array(undef);
...

CHANGE:
 - Use PQexecPrepared even when no placeholders (CPAN bug #48155) [GSM]

Sending queries to Postgres via DBD::Pg usually involves two steps: prepare and execute. The prepare is done one time, while the execute can be called many times, often times with different arguments. Previously, DBD::Pg would call PQexec for queries that had no placeholders. However, the ability to handle placeholders smoothly is only one advantage of using server-side prepares in Postgres. The other advantage is that Postgres only has to parse the query a single time, in the initial prepare. In 2.15.0, we use PQexecPrepared for all queries, whether they have placeholders or not. The upshot of this is that multiple calls to the execute() function will be a little bit faster, and that we only use PQexec when we really have to.


CHANGE:
 - Fix quoting of booleans to respect more Perlish variants (CPAN bug #41565) [GSM]

In previous versions, the mapping of Perl vars to booleans was very simple, and did only simple 0/1 mapping. However, Perl's values of "truth" is richer than that. We can now do things like this:

for my $name ('0', '1', '0E0', '0 but true', 'F', 'T', 'TRUE', 'false') {
  printf qq{Value '%s' is %s\n}, $name, $dbh->quote($name, {pg_type => PG_BOOL});
}

$ perl testscript3.pl
Value '0' is FALSE
Value '1' is TRUE
Value '0E0' is TRUE
Value '0 but true' is TRUE
Value 'F' is FALSE
Value 'T' is TRUE
Value 'TRUE' is TRUE
Value 'false' is FALSE

CHANGE:
  - Return ints and bools-cast-to-number from the db as true Perlish numbers.
    (CPAN bug #47619) [GSM]

This one is a little more subtle. When a value is returned from the database, it gets mapped back to a string. So even if the value in the database came from an INTEGER column, by the time it made it's way back to your Perl script it was a string that happened to hold an integer value. DBD::Pg now attempts to cast some types to their Perl equivalent. This is normally hard to see without peering inside Perl internals, but using Data::Dumper can show you the difference:

## Ask Postgres to return a string and an integer
$SQL = 'SELECT 123::text, 123::integer';
$info = $dbh->selectall_arrayref($SQL)->[0];
print Dumper $info;

## Older versions of DBD::Pg give:
$VAR1 = [
          '123',
          '123'
        ];

## The new and improved version gives:
$VAR1 = [
          '123',
          123
        ];

A small difference, but not unimportant - this change came about through a bug request, as it was causing problems when DBD::Pg was interacting with JSON::XS. Special thanks to Tim Bunce, (author of DBI, maintainer of the amazing NYTProf, and all around Perl guru) who found an important bug regarding this solution in 2.14.0, which led to the quick release of 2.14.1. Lesson learned: don't try converting ints to floats via sv_setnv.


Most of the other changes to 2.14 and 2.15 are bug fixes of one sort or another. To keep up on the changes or to talk about the project more, please join the mailing list

More PostgreSQL and SystemTap

Recently I've been working on a database with many multi-column indexes, and I've wondered how often all the columns of the index were used. Many of the indexes in question are primary key indexes, and I need all the columns to guarantee uniqueness, but for non-unique indexes, it would make sense to remove as many indexes from the column as possible. Especially with PostgreSQL 8.3 or greater, where I can take advantage of heap-only tuples[1], leaving columns out of the index would be a big win. PostgreSQL's statistics collector will already tell me how often an index is scanned. That shows up in pg_stat_all_indexes. But for a hypothetical index scanned 100 times, there's no way to know how many of those 100 scans used all the columns of the index, or, for instance, just the first column.

First, an example. I'll create a table with three integer columns, and fill it with random data:

5432 josh@josh# CREATE TABLE a (i INTEGER, j INTEGER, k INTEGER);
CREATE TABLE
5432 josh@josh*# INSERT INTO a SELECT i, j, k FROM (SELECT FLOOR(RANDOM() * 10) AS i, FLOOR(RANDOM() * 100) AS j, FLOOR(RANDOM() * 1000) AS k, GENERATE_SERIES(1, 1000)) f;
INSERT 0 1000
5432 josh@josh*# CREATE INDEX a_ix ON a (i, j, k);
CREATE INDEX
5432 josh@josh*# ANALYZE a;
ANALYZE
5432 josh@josh*# COMMIT;
COMMIT

This leaves me with a three-column index on 1000 rows of the following:

5432 josh@josh*# SELECT * FROM a LIMIT 10;
 i | j  |  k  
---+----+-----
 3 |  6 | 380
 7 | 94 | 933
 1 | 73 | 326
 2 | 86 | 224
 2 | 59 | 336
 9 | 44 | 220
 9 | 48 | 694
 3 | 27 | 268
 3 |  0 | 410
 8 | 25 | 337
(10 rows)

Now I need to make a query that will use the index. That's easy enough, with these two queries. As shown by the index condition, the first query uses all three columns of the index, and the second, only two.

5432 josh@josh# EXPLAIN SELECT * FROM a WHERE i > 8 AND j > 80 AND k > 800;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Bitmap Heap Scan on a  (cost=5.64..10.74 rows=4 width=12)
   Recheck Cond: ((i > 8) AND (j > 80) AND (k > 800))
   ->  Bitmap Index Scan on a_ix  (cost=0.00..5.64 rows=4 width=0)
         Index Cond: ((i > 8) AND (j > 80) AND (k > 800))
(4 rows)

5432 josh@josh*# EXPLAIN SELECT * FROM a WHERE i > 8 AND j > 80;
                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan on a  (cost=5.37..10.67 rows=20 width=12)
   Recheck Cond: ((i > 8) AND (j > 80))
   ->  Bitmap Index Scan on a_ix  (cost=0.00..5.36 rows=20 width=0)
         Index Cond: ((i > 8) AND (j > 80))
(4 rows)

Inside PostgreSQL, these queries result in a call to _bt_first() inside src/backend/access/nbtree/nbtsearch.c. This function two parameters: an IndexScanDesc object called scan, which describes the index to scan, the key to look for, and some other stuff, and a ScanDirection parameter to tell _bt_first() which direction to scan the index. It's this call that tells the statistics collector about each index scan, and it's this call that we'll instrument with SystemTap. I'm interested in the value in scan->numberOfKeys, which tells me how many of the index's keys will be considered in each scan. SystemTap makes getting this information really easy. I gave an introduction to SystemTap and using it with PostgreSQL in an earlier post; the following assumes familiarity with that material.

Since PostgreSQL doesn't come with a DTrace probe built into the _bt_first() function, I'll use SystemTap's ability to probe directly into a function. Conveniently, SystemTap also allows access to the values of variables in the function's memory space at runtime. Note that the technique shown below requires a PostgreSQL binary built with --enable-debug. Without debug information in the binary, different techniques are used, and the information is harder to get.

The test script I used is as follows:

probe process("/usr/local/pgsql/bin/postgres").function("_bt_first")
{
          /* Time of call */
        printf ("_bt_first at time %d\n", get_cycles())
          /* Number of scan keys */
        printf("%d scan keys\n", $scan->numberOfKeys)
          /* OID of index being scanned */
        printf("%u index oid\n\n", $scan->indexRelation->rd_id)
}

Note that the script above accesses variables in the _bt_first() function just as standard C functions would. The script has the following output:

[josh@localhost ~]$ sudo /usr/local/bin/stap -v test.d
Pass 1: parsed user script and 59 library script(s) in 130usr/70sys/196real ms.
Pass 2: analyzed script: 2 probe(s), 3 function(s), 0 embed(s), 0 global(s) in 50usr/50sys/103real ms.
Pass 3: translated to C into "/tmp/stapzCCwZE/stap_1854c2da59908c3e3633d6385ca6ce52_2782.c" in 120usr/80sys/209real ms.
Pass 4, preamble: (re)building SystemTap's version of uprobes.
Pass 4: compiled C into "stap_1854c2da59908c3e3633d6385ca6ce52_2782.ko" in 2240usr/3270sys/8102real ms.
Pass 5: starting run.
_bt_first at time 49379911010213
1 scan keys
2703 index oid

_bt_first at time 49379982691988
1 scan keys
2684 index oid

_bt_first at time 49379987397126
1 scan keys
2684 index oid

You'll note several indexes get scanned immediately. These are indexes from the PostgreSQL catalog. The index we created above has OID 16388. First, I'll run the query with three scan keys, followed by the query with two keys:

_bt_first at time 50357469430819
3 scan keys
16388 index oid

_bt_first at time 50363763650571
2 scan keys
16388 index oid

As expected, SystemTap reported first three and then two scan keys used, along with the OID of the a_ix index I created. With a technique like this I could, at least theoretically, get an exact usage profile for each index, and determine whether they need all the columns they have.

[1] See, for example, this page.

Look Ma, I made an iPhone Enterprise Application!

One of our clients recently came to us for help with implementing an iPhone Enterprise application. This particular application involved deploying specific signed mail configurations for customers with iPhones. This was primarily a server-side application, although the front-end interface was created using the DashCode development tools from Apple. Although this was a web application, the DashCode integration enabled us to create the interface in a way that it appeared to be a native application on the iPhone. Client-side validation was performed in a way that for all intents and purposes appeared native.

The backend was a traditional CGI script which generated the .mobileconfig configuration payloads to easily integrate the customer's mail server information into the iPhone's Mail application. The backend was written to support any number of accounts deployed per customer, and the resulting payload was signed and verified by the customer's PEM key.

We integrated openssl's pkcs12 support to transparently sign the generated dynamic mobileconfig. This was keyed off of the client's deployment key, so all of the generated keys were automatically indicated as trusted and registered by the client when installed on the iPhone.

Action shots: