End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

PostgreSQL tip: using pg_dump to extract a single function

A common task that comes up in PostgreSQL is the need to dump/edit a specific function. While ideally, you're using DDL files and version control (hello, git!) to manage your schema, you don't always have the luxury of working in such a controlled environment. Recent versions of psql have the \ef command to edit a function from within your favorite editor, but this is available from version 8.4 onward only.

An alternate approach is to use the following invocation:

  pg_dump -Fc -s | pg_restore -P 'funcname(args)'

The -s flag is the short form of --schema-only; i.e., we don't care about wasting time/space with the data. -P tells pg_restore to extract the function with the following signature.

As always, there are some caveats: the function name must be spelled out explicitly using the full types as they occur in the dump's custom format (i.e., you must use 'foo_func(integer)' instead of 'foo_func(int)'). You can always see a list of all of the available functions by using the command:

  pg_dump -Fc -s | pg_restore -l | grep FUNCTION

Slony: Cascading Subscriptions

Sometime you run into a situation where you need to replicate one dataset to many machines in multiple datacenters, with different costs associated with sending to each (either real costs as in bandwidth, or virtual costs as in the amount of time it takes to transmit to each machine). Defining a Slony cluster to handle this is easy, as you can specify the topology and paths taken to replicate any changes.

    Basic topology:
  • Data center A, with machines A1, A2, A3, and A4.
  • Data center B, with machines B1, B2, B3, and B4.
  • Data center C, with machines C1, C2, C3, and C4.


Figure 1: Non-cascaded slony replication nodes/pathways.

Node A1 is the master, which propagates its changes to all other machines. In the simple setup, A1 would push all of its changes to each node, however if data centers B and C have high costs associated with transfer to the nodes, you end up transferring 4x the data needed for each data center. (We are assuming that traffic on the local subnet at each data center is cheap and fast.)

The basic idea then, is to push the changes only once to each datacenter, and let the "master" machine in the data center push the changes out to the others in the data center. This reduces traffic from the master to each datacenter, plus removes any other associated costs associated with pushing to every node.


Figure 2: Cascaded slony replication nodes/pathways

Let's look at an example configuration:

cluster_init.sh:
    #!/bin/bash

    # admin node definitions and other slony-related information are
    # stored in our preamble file.  This will define the $PREAMBLE
    # environment variable that contains basic information common to all
    # Slony-related scripts, such as slony cluster name, the nodes
    # present, and how to reach them to install slony, etc.

    . slony_preamble.sh

    slonik <<EOF
    $PREAMBLE

    init cluster ( id = 1, comment = 'A1' );

    store node (id=2,  comment = 'A2', event node=1);
    store node (id=3,  comment = 'A3', event node=1);
    store node (id=4,  comment = 'A4', event node=1);
    store node (id=5,  comment = 'B1', event node=1);
    store node (id=6,  comment = 'B2', event node=1);
    store node (id=7,  comment = 'B3', event node=1);
    store node (id=8,  comment = 'B4', event node=1);
    store node (id=9,  comment = 'C1', event node=1);
    store node (id=10, comment = 'C2', event node=1);
    store node (id=11, comment = 'C3', event node=1);
    store node (id=12, comment = 'C4', event node=1);

    # pathways from A1 -> A2, A3, A4 and back
    store path (server = 1, client = 2, conninfo = 'dbname=data host=node2.datacenter-a.com');
    store path (server = 1, client = 3, conninfo = 'dbname=data host=node3.datacenter-a.com');
    store path (server = 1, client = 4, conninfo = 'dbname=data host=node4.datacenter-a.com');
    store path (server = 2, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');
    store path (server = 3, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');
    store path (server = 4, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');

    # pathway from A1 -> B1 and back
    store path (server = 1, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');
    store path (server = 5, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');

    # pathways from B1 -> B2, B3, B4 and back
    store path (server = 5, client = 6, conninfo = 'dbname=data host=node2.datacenter-b.com');
    store path (server = 5, client = 7, conninfo = 'dbname=data host=node3.datacenter-b.com');
    store path (server = 5, client = 8, conninfo = 'dbname=data host=node4.datacenter-b.com');
    store path (server = 6, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');
    store path (server = 7, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');
    store path (server = 8, client = 5, conninfo = 'dbname=data host=node1.datacenter-b.com');

    # pathway from A1 -> C1 and back
    store path (server = 1, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');
    store path (server = 9, client = 1, conninfo = 'dbname=data host=node1.datacenter-a.com');

    # pathways from C1 -> C2, C3, C4 and back
    store path (server = 9, client = 10, conninfo = 'dbname=data host=node2.datacenter-c.com');
    store path (server = 9, client = 11, conninfo = 'dbname=data host=node3.datacenter-c.com');
    store path (server = 9, client = 12, conninfo = 'dbname=data host=node4.datacenter-c.com');
    store path (server = 10, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');
    store path (server = 11, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');
    store path (server = 12, client = 9, conninfo = 'dbname=data host=node1.datacenter-c.com');

    EOF

As you can see in the initialization script, we're defining the basic topology for the cluster. We're defining each individual node, and the paths that slony will use to communicate events and other status. Since slony needs to communicate status both ways, we need to define the paths for each node's edge both ways. In particular, we've defined pathways from A1 to each of the other A nodes, A1 to B1 and C1, and B1 and C1 to each of their respective nodes.

Now it's a matter of defining the replication sets and describing the subscriptions for each. We will use something like the following for our script:

cluster_define_set1.sh:
    #!/bin/bash

    # reusing our standard cluster information
    . slony_preamble.sh

    slonik <<EOF
    $PREAMBLE

    create set ( id = 1, origin = 1, comment = 'set 1' );

    set add table ( set id = 1, origin = 1, id = 1, fully qualified name = 'public.table1');
    set add table ( set id = 1, origin = 1, id = 2, fully qualified name = 'public.table2');
    set add table ( set id = 1, origin = 1, id = 3, fully qualified name = 'public.table3');

    EOF

Here we've defined the tables that we want replicated from A1 to the entire cluster; there is nothing specific to this particular scenario that we need to consider.

cluster_subscribe_set1.sh:
    #!/bin/bash

    # reusing our standard cluster information
    . slony_preamble.sh

    slonik <<EOF
    $PREAMBLE

    # define our forwarding subscriptions (i.e., A1 -> B1, C1)
    subscribe set ( id = 1, provider = 1, receiver = 5, forward = yes);
    subscribe set ( id = 1, provider = 1, receiver = 9, forward = yes);

    # define the subscriptions for each of the datacenter sets
    # A1 -> A2, A3, A4
    subscribe set ( id = 1, provider = 1, receiver = 2, forward = no);
    subscribe set ( id = 1, provider = 1, receiver = 3, forward = no);
    subscribe set ( id = 1, provider = 1, receiver = 4, forward = no);

    # B1 -> B2, B3, B4
    subscribe set ( id = 1, provider = 5, receiver = 6, forward = no);
    subscribe set ( id = 1, provider = 5, receiver = 7, forward = no);
    subscribe set ( id = 1, provider = 5, receiver = 8, forward = no);

    # C1 -> C2, C3, C4
    subscribe set ( id = 1, provider = 9, receiver = 10, forward = no);
    subscribe set ( id = 1, provider = 9, receiver = 11, forward = no);
    subscribe set ( id = 1, provider = 9, receiver = 12, forward = no);

    EOF

The key points here are that you specify the provider nodes and the receiver nodes to specify how the particular replication occurs. For the subscription to any cascade point (i.e., B1 and C1), you need to have the 'forward = yes' parameter to ensure that the events properly cascade to the sub-nodes. In any of the other nodes' subscription, you should set 'forward = no'.

In actual deployment of this setup, you would want to wait for the subscription from A1 -> B1 and A1 -> C1 to complete successfully before subscribing the sub-nodes. Additionally, this solution assumes high availability between nodes and does not address failure of particular machines; in particular, A1, B1, and C1 are key to maintaining the full replication.

Postgres: Hello git, goodbye CVS

It looks like 2010 *might* be the year that Postgres officially makes the jump to git. Currently, the project uses CVS, with a script that moves things to the now canonical Postgres git repo at git.postgresql.org. This script has been causing problems, and is still continuing to do so, as CVS is not atomic. Once the project flips over, CVS will still be available, but CVS will be the slave and git the master, to put things in database terms. The conversion from git to CVS is trivial compared to the other way around, so there is no reason Postgres cannot continue to offer CVS access to the code for those unwilling or unable to use git.

On that note, I'm happy to see that the number of developers and committers who are using git - and publicly stating their happiness with doing so - has grown sharply in the last couple of years. Peter Eisentraut (with some help from myself) set up git.postgresql.org in 2008, but interest at that time was not terribly high, and there was still a lingering question of whether git was really the replacement for CVS, or if it would be some other version control system. There is little doubt now that git is going to win. Not only for the Postgres project, but across the development world in general (both open and closed source).

To drive the point home, Andrew has announced he is working on git integration with the Postgres build farm. Of course, I submitted a patch to do just that back in March 2008, but I was ahead of my time :). Besides, mine was a simple proof of concept, while it sounds like Andrew is actually going to do it the right way. Go Andrew!

Of all the projects I work on, the great majority are using git now. We've been using git at End Point as our preferred VCS for both internal projects and client work for a while now, and are very happy with our choice. There is only one other project I work on besides Postgres that uses CVS, but it's a small project. I don't know of any other project of Postgres' size that is still using CVS (anyone know of any?). Even emacs recently switched away from CVS, although they went with bazaar instead of git for some reason. Subversion is still being used by a substantial minority of the projects I'm involved with, mostly due to the historical fact that there was a window of time in which CVS was showing its limitations, but subversion was the only viable option. Sure would be nice if perl.org would offer git for Perl modules, as they do for subversion currently (/hint). Finally, there are a few of my projects that use something else (mercurial, monotone, etc.). Overall, git accounts for the lion's share of all my projects, and I'm very happy about that. There is a very steep learning curve with git, but the effort is well worth it.

If you want to try out git with the Postgres project, first start by installing git. Unfortunately, git is still new enough, and actively developed enough, that it may not be available on your distro's packaging system, or worse, the version available may be too old to be useful. Anything older than 1.5 should *not* be used, period, and 1.6 is highly preferred. I'd recommend taking the trouble to install from source if git is older than 1.6. Once installed, here's the steps to clone the Postgres repo.

git clone git://git.postgresql.org/git/postgresql.git postgres

This step may take a while, as git is basically putting the entire Postgres project on your computer - history and all! It took me three and a half minutes to run, but your time may vary.

Once that is done, you'll have a directory named "postgres". Change to it, and you can now poking around in the code, just like CVS, but without all the ugly CVS directories. :)

For more information, check out the "Working with git" page on the Postgres wiki.

Here's to 2010 being the year Postgres finally abandons CVS!

Blog versus Forum, Blogger versus WordPress in Ecommerce

Today, Chris sent me an email with two questions for one of our ecommerce clients:

  • For ecommerce client A, should a forum or blog be added?
  • For ecommerce client A, should the client use Blogger or WordPress if they add a blog?

These are relevant questions to all of our clients because forums and blogs can provide value to a static site or ecommerce site. I answered Chris' question and thought I'd expand on it a bit for a brief article.

First, a rundown comparing the pros and cons of blog versus forum:

  Blog Forum
Pros
  • Content tends to be more organized.
  • Content can be built to be more targeted for search.
  • Content can be syndicated easily.
  • There can be much more content because users are contributing content.
  • Since there is more user generated content, it has the potential to cover more of the long tail in search.
  • There is more potential for user involvement and encouragement to build and contribute to a community.
Cons
  • User generated content will remain minimal if comments are the only form of user generated content in a blog.
  • If internal staff is responsible for authoring content, you can't write as much content as users can contribute.
  • A forum requires management to prevent user spam.
  • A forum requires organization to maintain usability and search engine friendliness.

If we assume that it takes the same amount of effort to write articles as it does to manage user generated content, the decision comes down to whether or not you want to utilize user contributions as part of the content. If the effort involved to write content or manage user generated content is different, a decision should be made based on how much effort the site owners want to make. Other opportunities for user generated content include product reviews and user QnA.

Next, a rundown comparing the pros and cons of Blogger versus self-hosted WordPress:

 
Pros
  • There are a decent amount of widgets available to integrate into a Blogger instance.
  • Fast Google indexing of content may result since the content is hosted by Google.
  • There is decent search implementation on Blogger.
  • A Blogger instance is very easy to create and easy to use.
  • There is a very large feature set available through the WordPress plugin community.
  • Self-hosted WordPress blogs are relatively easy to set up. Many hosting platforms include WordPress installation and setup at the click of a button.
  • WordPress gives you control over the URL structure (articles, categories, tags) through permalinks.
  • Self-hosted WordPress can live at www.yoursite.com/blog/ which can strengthen your domain value in search through external links.
  • WordPress has a very flexible taxonomy system.
Cons
  • The Blogger taxonomy system is limited (using labels) and labels pages are blocked in robots.txt to reduce indexation and search traffic of the label pages.
  • Blogger does not allow for a flexible URL structure. Once an article is published and a title is changed, the URL does not change.
  • Developers must be familiar with the Blogger template language to customize the template.
  • With Blogger, a blog can't be hosted at http://www.yoursite.com/blog/. It can be hosted at http://blog.yoursite.com/. While this results in a strong subdomain, it does not strengthen your domain for search through external links to the blog.
  • Self-hosted WordPress requires your own hosting, setup and installation.
  • Self-hosted WordPress requires management of upgrades and plugins. Plugins may require code changes to the template files.
  • Self-hosted WordPress allows you to select existing themes, but you must be familiar with the WordPress template structure if you want a custom blog look.

The decision to create a Blogger blog or install a WordPress blog will depend on resources such as engineering or designer involvement. A self-hosted blog solution will likely provide a larger feature set and more flexibility, but it also requires more time to enhance, manage and maintain the software. A hosted blog solution such as Blogger will be easy to set up and maintain, but has disadvantages because it is a less flexible solution. I didn't discuss a WordPress-hosted solution because I'm not very familiar with this type of setup, however, I believe the WordPress-hosted solution limits the use of plugins and themes.

For our ecommerce clients, installing a self-hosted WordPress instance on top of their Spree or Interchange ecommerce site has been relatively simple. For another one of our clients, we developed a Radiant plugin to integrate Blogger article links into their site, which has worked well to fit their needs.

SEO 2010 Trends and Strategies

Yesterday I attended SEOmoz's webinar titled "SEO Strategies for 2010". Some interesting factoids, comments and resources for SEO in 2010 were presented that I thought I'd highlight:

  • Mobile browser search
    • Mobile search and ecommerce will be a large area of growth in 2010.
    • Google Webmaster Tools allows you to submit mobile sitemaps, which can help battle duplicate content between non-mobile and mobile versions of site content. Another way to handle duplicate content would be to write semantic HTML that allows sites to serve non-mobile and mobile CSS.
  • Social Media: Real Time Search
    • Real time search marked its presence in 2009. The involvement of Twitter in search is evolving.
    • Tracking and monitoring on URL shortening services should be set up to measure traffic and benefit from Twitter.
    • Dan Zarrella published research on The Science of Retweeting. This is an interesting resource with fascinating statistics on retweets.
  • Social Media: Facebook's Dominance
    • Recent research by comScore has shown that 5.5% of all time on the web is spent in Facebook.
    • Facebook has very affordable advertising. Facebook has so much demographic and psychographic data that allows sites to deliver very targeted advertisements.
    • Facebook shouldn't be ignored as a potential business network, but metrics should be put in place to determine the value it brings.
  • Social Media: Shifting LinkGraph
    • In the past, sites received links from blog resources which became a factor in the site's popularity rankings in search. Now, linking has shifted to microblogging such as twitter or other social media platforms. Some folks are stingier about passing links through sites rather than social media. It's interesting to observe how links and information is passed through the web and consider how this can affect search.
  • Bing
    • Despite the fact that Google is responsible for a large percentage of search, Bing shouldn't be ignored.
    • Bing has shown some differences in ranking such as being less sensitive to TLDs (.info, .cc, .net, etc.), and giving more weight to sites with keywords in the domain than other search engines.
  • Other
    • Personalized search is on the rise. This is something to pay attention to, but hard to measure.
    • QDF (query deserves freshness), a search factor related to the freshness of content, has led to search engines indexing content faster. 2010 search strategies recommend becoming a news source to improve search performance.
    • Local search is definitely something to be aware of in 2010. Google's Place Rank algorithm is similar to the PageRank algorithm - it looks at specific location or local attributes as a factor in local search.

I found that a trend of the discussion revolved around having good metrics, not just good metrics, but the right metrics such as conversion and engagement. Testing any of the recommendations above (improving your mobile browsing, getting involved in social media, optimizing for Bing) should be measured against conversion to determine the value of the efforts. Also, multivariate or A/B testing were recommended for testing local search optimization and other efforts.

Splitting Postgres pg_dump into pre and post data files

I've just released a small Perl script that has helped me solve a specific problem with Postgres dump files. When you use pg_dump or pg_dumpall, it outputs things in the following order, per database:

  1. schema creation commands (e.g. CREATE TABLE)
  2. data loading command (e.g. COPY tablename FROM STDIN)
  3. post-data schema commands (e.g. CREATE INDEX)

The problem is that using the --schema-only flag outputs the first and third sections into a single file. Hence, if you load the file and then load a separate --data-only dump, it can be very slow as all the constraints, indexes, and triggers are already in place. The split_postgres_dump script breaks the dump file into two segments, a "pre" and a "post". (It doesn't handle a file with a data section yet, only a --schema-only version)

Why would you need to do this instead of just using a full dump? Some reasons I've found include:

  • When you need to load the data more than once, such as debugging a data load error.
  • When you want to stop after the data load step (which you can't do with a full dump)
  • When you need to make adjustments to the schema before the data is loaded (seen quite a bit on major version upgrades)

Usage is simply ./split_postgres_dump.pl yourdumpfile.pg, which will then create two new files, yourdumpfile.pg.pre and yourdumpfile.pg.post. It doesn't produce perfectly formatted files, but it gets the job done!

It's a small script, so it has no bug tracker, git repo, etc. but it does have a small wiki page at http://bucardo.org/wiki/Split_postgres_dump from which you can download the latest version.

Future versions of pg_dump will allow you to break things into pre and post data sections with flags, but until then, I hope somebody finds this script useful.

Update: There is now a git repo:
git clone git://bucardo.org/split_postgres_dump.git

Gathering server information with boxinfo

I've just publicly released another Postgres-related script, this one called "boxinfo". Basically, it gathers information about a box (server), hence the catchy and original name. It outputs the information it finds into an HTML page, or into a MediaWiki formatted page.

The goal of boxinfo is to have a simple, single script that quickly gathers important information about a server into a web page, so that you can get a quick overview of what is installed on the server and how things are configured. It's also useful as a reference page when you are trying to remember which server was it that had Bucardo version 4.5.0 installed and was running pgbouncer.

As we use MediaWiki internally here at End Point (running with a Postgres backend, naturally), the original (and default) format is HTML with some MediaWiki specific items inside of it.

Because it is meant to run on a wide a range of boxes as possible, it's written in Perl. While we've run into a few boxes over the years that did not have Perl installed, the number that had any other language you choose (except perhaps sh) is much greater. It requires no other Perl modules, and simply makes a lot of system calls.

Various information about the box is gathered. System wide things such as mount points, disk space, schedulers, packaging systems are gathered first, along with versions of many common Unix utilities. We also gather information on some programs where more than just the version number is important, such as puppet, heartbeat, and lifekeeper. Of course, we also go into a great amount of detail about all the installed Postgres clusters on the box as well.

The program tries its best to locate every active Postgres cluster on the box, and then gathers information about it, such as where pg_xlog is linked to, any contrib modules installed, any interesting configuration variables from postgresql.conf, the size of each database, and lots of detailed information about any Slony or Bucardo configurations it finds.

The main page for it is on the Bucardo wiki at http://bucardo.org/wiki/Boxinfo. That page details the various command line options and should be considered the canonical documentation for the script. The latest version of boxinfo can be downloaded from that page as well. For any enhancement requests or problems to report, please visit the bug tracker at http://bucardo.org/bugzilla/.

What exactly does the output look like? We've got an example on the wiki showing the sample output from a run against my laptop. Some of the items were removed, but it should give you an idea of what the script can do, particularly with regards to the Postgres information: http://bucardo.org/wiki/Boxinfo/Example

The script is still a little rough, so we welcome any patches, bug reports, requests, or comments. The development version can be obtained by running: git clone git://bucardo.org/boxinfo.git

Rails Ecommerce with Spree: Customizing with Hooks Comments

Yesterday, I went through some examples using hook and theme implementation in Spree, an open source Ruby on Rails ecommerce platform. I decided to follow-up with closing thoughts and comments today.

I only spent a few hours working with the new Spree edge code (Version 0.9.5), but I was relatively happy with the Spree theme and hook implementation, as it does a better job decoupling the extension views with Spree core functionality and views. However, I found several issues that are potential areas for improvement with this release or releases to come.

Theme too clunky?

One concern I have is that the entire "views" directory from SPREE_ROOT/app was moved into the theme with this theme-hook work (all of the "V" in MVC). Yesterday, I discussed how WordPress had designed a successful theme and plugin interaction and one thing I mentioned was that a WordPress theme was lightweight and comprised of several customer-facing PHP files (index, single post page, archive pages, search result page). Moving all of the Spree core views to the theme presents a couple of issues, in my opinion:

  • A developer that jumps into theme development is immediately met with more than 50 files in the theme directory to understand and work with. What you may notice from my tutorial yesterday is that I actually changed the look of Spree through an extension rather than creating a new theme - I believe there is better separation of my custom design and the Spree core if I included the custom styling in the extension rather than creating a new theme and copying over 50+ files to edit. I'm also more comfortable working with CSS to manipulate the appearance rather than editing and maintaining those files. Now, the next time the Spree core and default template are updated, I don't have to worry about copying and pasting all the theme files into my custom theme and managing modifications. I think over time, Spree should aim to improve separation of theme views and core views and simplify the theme views.
  • The new default Spree includes the admin views. Spree developers and users are probably more interested in changing and modifying customer-facing pages than admin pages. I believe that Spree should focus on developing a strong admin interface and assume that only more advanced developers will need to override the admin views. The admin view would contain a set of predefined core hooks to add tabs and reports. Rather than having a theme with all of the rails views, the theme should be a lightweight collection of files that are likely to be edited by users and the Spree core should include files that are less likely to be modified (and in theory, have an awesome admin interface that would only be extended with additional reports or additional fields for object updates and edits).

Theme-Hook Decoupling?


Extension views or text are hooked through the hooks defined in the theme.

Another big concern I have is the tight coupling between Spree themes and hooks. All of the hooks are defined in the Spree theme. If someone were to switch from one theme to another, there is the potential for functionality to be lost if consistency between theme hooks isn't enforced. This issue piggybacks off of the first issue: I think the Spree core should have control of all the admin views and admin hooks. It would be great to see the views simplified or refactored and allow Spree core to control and instantiate many hooks. I think it's great to provide the flexibility to instantiate hooks in themes, but I think the core code (admin, especially) should be more opinionated and contain its own set of views with hooks that would likely be overridden less frequently.


A more ideal approach to decouple appearance and functionality would require hooks to be defined in the Spree core.

Conclusion

In the tutorial, I also didn't address extended core functionality with models and controllers in the extensions. The logic discussed the article Rails Ecommerce Product Optioning in Spree and Rails Approach for Spree Shopping Cart Customization should work with some view modifications to use existing hooks instead of overriding core views.


A screenshot of the tutorial app in yesterday's article.

Despite the issues mentioned above, I think that the hook and theme work in the upcoming Spree 0.9.5 release is a big step in the right direction to improve the customization building blocks of Spree. It was mentioned in yesterday's article that the release hasn't been made official, but several developers have expressed an interest in working with the code. Hopefully the final kinks of theme and hook implementation will be worked out and the new release will be announced soon. Over time, the hook and theme implementation will advance and more examples and documentation will become available.

Learn more about End Point's general Rails development and Rails shopping cart development.

Rails Ecommerce with Spree: Customizing with Hooks Tutorial

In the last couple months, there's been a bit of buzz around theme and hook implementation in Spree. The Spree team hasn't officially announced the newest version 0.9.5, but the edge code is available at http://github.com/railsdog/spree and developers have been encouraged to work with the edge code to check out the new features. Additionally, there is decent documentation here about theme and hook implementation. In this article, I'll go through several examples of how I would approach site customization using hooks in the upcoming Spree 0.9.5 release.

Background

I've been a big proponent of how WordPress implements themes, plugins, and hooks in the spree-user Google group. The idea behind WordPress themes is that a theme includes a set of PHP files that contain the display logic, HTML, and CSS for the customer-facing pages:

  • index
  • a post page
  • archive pages (monthly, category, tag archives)
  • search result page
  • etc.

In many cases, themes include sections (referred to as partial views in Rails), or components that are included in multiple template pages. An example of this partial view is the sidebar that is likely to be included in all of the page types mentioned above. The WordPress theme community is abundant; there are many free or at-cost themes available.

The concept behind WordPress plugins is much like Spree extension functionality - a plugin includes modular functionality to add to your site that is decoupled from the core functionality. Judging from the popularity of the WordPress plugin community, WordPress has done a great job designing the Plugin API. In most cases, the Plugin API is used to extend or override core functionality and add to the views without having to update the theme files themselves. An example of using the WordPress plugin API to add an action to the wp_footer hook is accomplished with the following code:

/* inside plugin */
function add_footer_text() {
    echo '<p>Extra Footer Text!!</p>';
}
add_action('wp_footer', 'add_footer_text');

WordPress themes and plugins with hooks are the building blocks of WordPress: with them, you piece together the appearance and functionality for your site. I reference WordPress as a basis of comparison for Spree, because like WordPress users, Spree users aim to piece together the appearance and functionality for their site. One thing to note is that the hook implementation in Spree is based on hook implementation in Redmine.

Spree Code

I grabbed the latest code at http://github.com/railsdog/spree. After examining the code and reviewing the SpreeGuides documentation, the first thing I learned is that there are four ways to work with hooks:

  • insert before a hook component
  • insert after hook component
  • replace a hook component's contents
  • remove a hook component

The next thing I researched was the hook components or elements. Below are the specific locations of hooks. The specific locations are more meaningful if you are familiar with the Spree views. The hooks are merely wrapped around parts of pages (or layouts) like the product page, product search, homepage, etc. Any of the methods listed above can act on any of the components listed below.

  • layout: inside_head, sidebar
  • homepage: homepage_sidebar_navigation, homepage_products
  • product search: search_results
  • taxon: taxon_side_navigation, taxon_products, taxon_children
  • view product: product_description, product_properties, product_taxons, product_price, product_cart_form, inside_product_cart_form
  • etc.

After I spent time figuring out the hook methods and components, I was ready to do stuff. First, I got Spree up and running (refer to the SpreeGuides for more information):


Spree startup with seed data and images.

Next, I updated the product list with a few pretend products. Let's take a quick look at the site with the updated products:


Spree with new product data for test site.

Example #1: Replace the logo and background styling.

First, I created an extension with the following code. Spree's extensions are roughly based off of Radiant's extension system. It's relatively simple to get an extension up and running with the following code (and server restart).

script/generate extension StephsPhotos

Next, I wanted to try out the insert_after method to append a stylesheet to the default theme inside the <head> html element. I also wanted to remove the sidebar because my test site only has 8 products (lame!) and I don't need sidebar navigation. This was accomplished with the following changes:

  • First, I added the insert_after hook to add a view that contains my extra stylesheet. I also added the remove hook to remove the sidebar element:
    # RAILS_ROOT/vendor/extensions/stephs_photos/stephs_photos_hooks.rb
    insert_after :inside_head, 'shared/styles'
    remove :sidebar
    
  • Next, I added a new view in the extension to include the new stylesheet.
    # RAILS_ROOT/vendor/extensions/stephs_photos/app/views/shared/_styles.erb
    <link type="text/css" rel="stylesheet" href="/stylesheets/stephs_photos.css">
    
  • Next, I created a new stylesheet in the extension.
    /* RAILS_ROOT/vendor/extensions/stephs_photos/public/stylesheets/stephs_photos.css */
    body { background: #000; }
    body.two-col div#wrapper { background: none; }
    a, #header a { color: #FFF; text-decoration: none; }
    
    ul#nav-bar { width: 280px; line-height: 30px; margin-top: 87px; font-size: 1.0em; }
    ul#nav-bar li form { display: none; }
    
    .container { width: 750px; }
    #wrapper { padding-top: 0px; }
    
    .product-listing li { background: #FFF; height: 140px; }
    .product-listing li a.info { background: #FFF; }
    
    body#product-details div#wrapper { background: #000; }
    body#product-details div#content, body#product-details div#content h1 { color: #FFF; margin-left: 10px; }
    #taxon-crumbs { display: none; }
    #product-description { width: 190px; border: none; }
    .price.selling { color: #FFF; }
    #product-image #main-image { min-height: 170px; }
    
    /* Styling in this extension only applies to product and main page */
    
    div#footer { display: none; }
    

One more small change was required to update the logo via a Rails preference. I set the logo preference variable to a new logo image and uploaded the logo to RAILS_ROOT/vendor/extensions/stephs_photos/public/images/.

# RAILS_ROOT/vendor/extensions/stephs_photos/stephs_photos_extension.rb
def activate
 AppConfiguration.class_eval do
   preference :logo, :string, :default => 'stephs_photos.png'
 end
end

After restarting the server, I was happy with the new look for my site accomplished using the insert_after and remove methods:


New look for Spree acomplished with several small changes.

Note: You can also add a stylesheet with the code shown below. However, I wanted to use the hook method described above for this tutorial.

def activate
  AppConfiguration.class_eval do 
    preference :stylesheets, :string, :default => 'styles'
  end
end 
Example #2: Use insert_before to insert a view containing Spree core functionality.

The next requirement I imagined was adding promo functionality to the product listing page. I wanted to use core Spree logic to determine which promo image to use. The first promo image would be a 10% off discount to users that were logged in. The second promo image would be a 15% off discount offered to users who weren't logged in and created an account. I completed the following changes for this work:

  • First, I added the insert_before method to add the promo view before the homepage_products component, the component that lists the products on the homepage.
    # RAILS_ROOT/vendor/extensions/stephs_photos/stephs_photos_hooks.rb
    insert_before :homepage_products, 'shared/stephs_promo'
    
  • Next, I added the view using core Spree user functionality.
    # RAILS_ROOT/vendor/extensions/stephs_photos/app/views/shared/_stephs_promo.erb
    <% if current_user -%>
    <img src="http://www.blogger.com/images/promo10.png" alt="10 off" />
    <% else -%>
    <img src="http://www.blogger.com/images/promo15.png" alt="15 off" />
    <% end -%>
    
  • Finally, I uploaded my promo images to RAILS_ROOT/vendor/extensions/stephs_photos/public/images/

After another server restart and homepage refresh, I tested the logged in and logged out promo logic.


vs.


Spree core functionality used to display two different promo images inside a partial view.

Note: The promo coupon logic that computes the 10% or 15% off was not included in this tutorial.

Example #3: Use replace method to replace a component on all product pages.

In my third example, I imagined that I wouldn't have time to manage product descriptions when I was rich and famous. I decided to use the replace hook to replace the product description on all product pages. I completed the following steps for this change:

  • First, I added the replace method to replace the :product_description component with a rails partial view.
    # RAILS_ROOT/vendor/extensions/stephs_photos/stephs_photos_hooks.rb
    replace :product_description, 'shared/generic_product_description'
    
  • Next, I created the view with the generic product description.
    # RAILS_ROOT/vendor/extensions/stephs_photos/app/views/shared/_generic_product_description.erb
    all prints are 4x6 matte prints.<br />
    all photos ship in a folder.
    

After yet another server restart and product refresh, I tested the generic product description using the replace hook.


The replace hook was used to replace product descriptions on all product pages.

Intermission

OK, so hopefully you see the trend:

  1. Figure out which component you want to pre-append, post-append, replace, or remove.
  2. Modify extension_name_hooks.rb to include your hook method (and pass the view, if necessary).
  3. Create the new view in your extension.
  4. Restart server and be happy!

I'll note a couple other examples below.

Example #4: Bummer that there's no footer component

In the next step, I intended to add copyright information to the site's footer. I was disappointed to find that there was no hook wrapped around the footer component. So, I decided not to care for now. But in the future, my client (me) may make this a higher priority and the options for making this change might include:

  • Clone the default template and modify the template footer partial view.
  • Clone the default template, create a hook to wrap around the footer component, add the changes via a hook in an extension.
  • Add a view in the extension that overrides the theme footer view.
Example #5: Add text instead of partial view.

Since I couldn't add copyright information below the footer, I decided to add it using after the inside_product_cart_form component using the insert_after hook. But since it's a Friday at 5:30pm, I'm too lazy to create a view, so instead I'll just add text for now with the following addition to the extension hooks file:

# RAILS_ROOT/vendor/extensions/stephs_photos/stephs_photos_hooks.rb
insert_after :inside_product_cart_form, :text => '<p>&copy; stephpowell. all rights reserved.</p>'

Server restart, and I'm happy, again:


Text, rather than a partial view, was appended via a hook.

Hopefully my examples were exciting enough for you. There's quite a lot you can do with the hook methods, and over time more documentation and examples will become available through the Spree site, but I wanted to present a few very simple examples of my approach to customization in Spree. I've uploaded the extension to http://github.com/stephskardal/stephs_photos for this article.

Tomorrow, I'm set to publish closing thoughts and comments on the hook implementation since this article is now too long for a blog post. Stay tuned.

Learn more about End Point's general Rails development and Rails shopping cart development.

Postgres Upgrades - Ten Problems and Solutions

Upgrading between major versions of Postgres is a fairly straightforward affair, but Murphy's law often gets in the way. Here at End Point we perform a lot of upgrades, and the following list explains some of the problems that come up, either during the upgrade itself, or afterwards.

When we say upgrade, we mean going from an older major version to a newer major version. We've (recently) migrated client systems as old as 7.2 to as new as 8.4. The canonical way to perform such an upgrade is to simply do:

pg_dumpall -h oldsystem > dumpfile
psql -h newsystem -f dumpfile

The reality can be a little more complicated. Here are the top ten gotchas we've come across, and their solutions. The more common and severe problems are at the top.

1. Removal of implicit casting

Postgres 8.3 removed many of the "implicit casts", meaning that many queries that used to work on previous versions now gave an error. This was a pretty severe regression, and while it is technically correct to not have them, the sudden removal of these casts has caused *lots* of problems. Basically, if you are going from any version of PostgreSQL 8.2 or lower to any version 8.3 or higher, expect to run into this problem.

Solution: The best way of course is to "fix your app", which means specifically casting items to the proper datatype, for example writing "123::int" instead of "123". However, it's not always easy to do this - not only can finding and changing all instances across your code base be a huge undertaking, but the problem also exists for some database drivers and other parts of your system that may be out of your direct control. Therefore, the other option is to add the casts back in. Peter Eisentraut posted a list of casts that restore some of the pre-8.3 behavior. Do not just apply them all, but add in the ones that you need. We've found that the first one (integer AS text) solves 99% of our clients' casting issues.

2. Encoding issues (bad data)

Older databases frequently were not careful about their encoding, and ended up using the default "no encoding" mode of SQL_ASCII. Often this was done because nobody was thinking about, or worrying about, encoding issues when the database as first being designed. Flash forward years later, and people want to move to something better than SQL_ASCII such as the now-standard UTF-8. The problem is that SQL_ASCII accepts everything without complaint, and this can cause you migration to fail as the data will not load into the new database with a different encoding. (Also note that even UTF-8 to UTF-8 may cause problems as it was not until Postgres version 8.1 that UTF-8 input was strictly validated.)

Solution: The best remedy is to clean the data on the "old" database and try the migration again. How to do this depends on the nature of the bad data. If it's just a few known rows, manual updates can be done. Otherwise, we usually write a Perl script to search for invalid characters and replace them. Alternatively, you can pipe the data through iconv in the middle of the upgrade. If all else fails, you can always fall back to SQL_ASCII on the new database, but that should really be a last resort.

3. Time

Since the database is almost always an integral part of the business, minimizing the time it is unavailable for use is very important. People tend to underestimate how much time an upgrade can take. (Here we are talking about the actual migration, not the testing, which is a very important step that should not be neglected.) Creating the new database and schema objects is very fast, of course, but the data must be copied row by row, and then all the constraints and indexes created. For large databases with many indexes, the index creation step can take longer than the data import!

Solution: The first step is to do a practice run with as similar hardware as possible to get an idea of how long it will take. If this time period does not comfortably fit within your downtime window (and by comfortable, I mean add 50% to account for Murphy), then another solution is needed. The easiest way is to use a replication system like Bucardo to "pre-populate" the static part of the database, and then the final migration only involves a small percentage of your database. It should also be noted that recent versions of Postgres can speed things up by using the "-j" flag to the pg_restore utility, which allows some of the restore to be done in parallel.

4. Dependencies

When you upgrade Postgres, you're upgrading the libraries as well, which many other programs (e.g. database drivers) depend on. Therefore, it's important to make sure everything else relying on those libraries still works. If you are installing Postgres with a packaging system, this is usually not a problem as the dependencies are taken care of for you.

Solution: Make sure your test box has all the applications, drivers, cron scripts, etc. that your production box has and make sure that each of them either works with the new version, or has a sane upgrade plan. Note: Postgres may have some hidden indirect dependencies as well. For example, if you are using Pl/PerlU, make sure that any external modules used by your functions are installed on the box.

5. Postgres contrib modules

Going from one version of Postgres to another can introduce some serious challenges when it comes to contrib modules. Unfortunately, they are not treated with the same level of care as the Postgres core is. To be fair, most of them will continue to just work, simply by doing a "make install" on the new database before attempting to import. Some modules, however, have functions that no longer exist. Some are not 100% forward compatible, and some even lack important pieces such as uninstall scripts.

Solution: Solving this depends quite a bit on the exact nature of the problem. We've done everything from carefully modifying the --schema-only output, to modifying the underlying C code and recompiling the modules, to removing them entirely and getting the functionality in other ways.

6. Invalid constraints (bad data)

Sometimes when upgrading, we find that the existing constraints are not letting the existing data back in! This can happen for a number of reasons, but basically it means that you have invalid data. This can be mundane (a check constraint is missing a potential value) or more serious (multiple primary keys with the same value).

Solution: The best bet is to fix the underlying problem on the old database. Sometimes this is a few rows, but sometimes (as in a case with multiple identical primary keys), it indicates an underlying hardware problem (e.g. RAM). In the latter case, the damage can be very widespread, and your simple upgrade plan has now turned into a major damage control exercise (but aren't you glad you found such a problem now rather than later?) Detecting and preventing such problems is the topic for another day. :)

7. tsearch2

This is a special case for the contrib module situation mentioned above. The tsearch2 module first appeared in version 7.4, and was moved into core of Postgres in version 8.3. While there was a good attempt at providing an upgrade path, upgrades can still cause an occasional issue.

Solution: Sometimes the only real solution is edit the pg_dump output by hand. If you are not using tsearch in that many places (e.g. just a few indexes or columns on a couple tables), you can also simply remove it before the upgrade, then add it back in afterwards.

8. Application behavior

In addition to the implicit casting issues above, applications sometimes have bad behaviors that were tolerated in older versions of Postgres, but now are not. A typical example is writing queries without explicitly naming all of the tables in the "FROM" section.

Solution: As always, fixing the app is the best solution. However, for some things you can also flip a compatibility switch inside of postgresql.conf. In the example above, one would change the "add_missing_from" from its default of 'off' to 'on'. This should be considered an option of last resort, however.

9. System catalogs

Seldom a major update goes by that doesn't see a change in the system catalogs, the low-level meta-data tables used by Postgres to describe everything in the database. Sometimes programs rely on the catalogs looking a certain way.

Solution: Most programs, if they use the system catalogs directly, are careful about it, and upgrading the program version often solves the problem. At other times, we've had to rewrite the program right then and there, either by having it abstract out the information (for example, by using the information_schema views), or (less preferred) by adding conditionals to the code to handle multiple versions of the system catalogs.

10. Embedded data

This is a rare but annoying problem: triggers on a table rely on certain data being in other tables, such that doing a --schema-only dump before a --data-only dump will always fail when importing.

Solution: The easiest way is to simply use pg_dumpall, which loads the schema, then the data, then the constraints and indexes. However, this may not be possible if you have to separate things for other reasons (such as contrib module issues). In this case, you can break the --schema-only pg_dump output into pre and post segments. We have a script that does this for us, but it is also slated to be an option for pg_dump in the future.

That's the list! If you've seen other things, please make a note in the comments. Don't forget to run a database-wide ANALYZE after importing into your new database, as the table statistics are not carried across when using pg_dump.

Postgres SQL Backup Gzip Shrinkage, aka DON'T PANIC!!!

I was investigating a recent Postgres server issue, where we had discovered that one of the RAM modules on the server in question had gone bad. Unsurprisingly, one of the things we looked at was the possibility of having to do a restore from a SQL dump, as if there had been any potential corruption to the data directory, a base backup would potentially have been subject to the same possible errors that we were trying to restore to avoid.

As it was already the middle of the night (anyone have a server emergency during the normal business hours?), my investigations were hampered by my lack of sleep.

If there had been some data directory corruption, the pg_dump process would likely fail earlier than in the backup process, and we'd expect the dumps to be truncated; ideally this wasn't the case, as memory testing had not shown the DIMM to be bad, but the sensor had alerted us as well.

I logged into the backup server and looked at the backup dumps; from the alerts that we'd gotten, the memory was flagged bad on January 3. I listed the files, and noticed the following oddity:

 -rw-r--r-- 1 postgres postgres  2379274138 Jan  1 04:33 backup-Jan-01.sql.gz
 -rw-r--r-- 1 postgres postgres  1957858685 Jan  2 09:33 backup-Jan-02.sql.gz

Well, this was disconcerting. The memory event had taken place on the 3rd, but there was a large drop in size of the dumps between January 1st and January 2nd (more than 400MB of *compressed* output, for those of you playing along at home). This indicated that either the memory event took place earlier than recorded, or something somewhat catastrophic had happened to the database; perhaps some large deletion or truncation of some key tables.

Racking my brains, I tried to come up with an explanation: we'd had a recent maintenance window that took place between January 1 and January 2; we'd scheduled a CLUSTER/REINDEX to reclaim some of the bloat which was in the database itself. But this would only reduce the size of the data directory; the amount of live data would have stayed the same or with a modest increase.

Obviously we needed to compare the two files in order to determine what had changed between the two days. I tried:

 diff <(zcat backup-Jan-01.sql.gz | head -2300) <(zcat backup-Jan-02.sql.gz | head -2300)

Based on my earlier testing, this was the offset in the SQL dumps which defined the actual schema for the database excluding the data; in particular I was interested to see if there had been (say) any temporarily created tables which had been dropped during the maintenance window. However, this showed only minor changes (updates to default sequence values). It was time to do a full diff of the data to try and see if some of the aforementioned temporary tables had been truncated or if some catastrophic deletion had occurred or...you get the idea. I tried:

 diff <(zcat backup-Jan-01.sql.gz) <(zcat backup-Jan-02.sql.gz)

However, this approach fell down when diff ran out of memory. We decided to unzip the files and manually diff the two files in case it had something to do with the parallel unzips, and here was a mystery; after unzipping the dumps in question, we saw the following:

 -rw-r--r-- 1 root root 10200609877 Jan  8 02:19 backup-Jan-01.sql
 -rw-r--r-- 1 root root 10202928838 Jan  8 02:24 backup-Jan-02.sql

The uncompressed versions of these files showed sizes consistent with slow growth; the Jan 02 backup was slightly larger than the Jan 01 backup. This was really weird! Was there some threshold in gzip where given a particular size file it switched to a different compression algorithm? Had someone tweaked the backup script to gzip with a different compression level? Had I just gone delusional from lack of sleep? Since gzip can operate on streams, the first option seemed unlikely, and something I would have heard about before. I verified that the arguments to gzip in the backup job had not changed, so that took that choice off the table. Which left the last option, but I had the terminal scrollback history to back me up.

We finished the rest of our work that night, but the gzip oddity stuck with me through the next day. I was relating the oddity of it all to a co-worker, when insight struck: since we'd CLUSTERed the table, that meant that similar data (in the form of the tables' multi-part primary keys) had been reorganized to be on the same database pages, so when pg_dump read/wrote out the data in page order, gzip had that much more similarity in the same neighborhood to work with, which resulted in the dramatic decrease in the compressed gzip dumps.

So the good news was that CLUSTER will save you space in your SQL dumps as well (if you're compressing), the bad news was that it took an emergency situation and an almost heart-attack for this engineer to figure it all out. Hope I've saved you the trouble... :-)

DevCamps on different systems, including Plesk, CPanel and ISPConfig

In the last few months I've been active setting up DevCamps for several of our newer clients. DevCamps is an open source development environment system, that once setup, allows for easily starting up and tearing down a development environment for a specific site/code base.

I've done many camps setups, and you tend to run into surprises from system to system, but what was most interesting and challenging about these latest installs was that they were to be done on systems running Plesk, CPanel, and ISPConfig. Some things that are different between a normal deployment and one on the above mentioned platforms are:

  • On the Plesk system there was a secured Linux called 'Atomic Secured Linux' which includes the grsecurity module. One restriction of this module is (TPE) Trusted Path Execution which required the camp bin scripts to be owned by root and the bin directory could not be writable by other groups, otherwise they would fail to run.
  • Permissions are a mixed bag, where typically we set all of the files to be owned by the site owner, in Plesk there are special groups such as psacln that the files need to be owned by.
  • On the CPanel system we needed to move the admin images for Interchange to a different directory since CPanel includes Interchange and has aliases for /interchange/ and /interchange-5/ to point at a central location which we would not be using.
  • On ISPConfig and Plesk the home directories of the sites are in different places, which required deploying the code in such places as /var/www/clients/client/user/domain.com or /var/www/vhosts/domain.com.

In the end we were able to get DevCamps to run properly on these various platforms both in development and production. If you are starting a new project or working on an existing project and could use a strong development environment, consider DevCamps.

SSHFS and ServerAliveInterval

If you're using SSHFS (as I do recently since OpenVPN started crashing frequently on my OpenBSD firewall), note that the ServerAliveInterval option for SSH can have significant impact on the stability of your mounts.

I set it to 10 seconds on my system and have been happy with the results so far. It could probably safely go considerably higher than that.

It's not on by default, which leaves the stability of your SSH tunnels up to the success of TCP keepalive (which is on by default). On my wireless network, that alone has not been sufficient.

End Point Blogging Stats Year In Review

Many of the blogs I follow recently published a list of most popular articles from 2009, so I thought we should too. End Point started the blog in July of 2008, and has since then backported our older technical articles and company news to the blog, which is why you may see older articles in the archives. In 2009, we published just over 150 articles.

Here's a breakdown of article categories (the categorization was a bit difficult: the "Tips, Community, Etc." was a fallback category for any articles that didn't have more than a few similar articles):

In 2009, the top 10 articles with the highest number of unique visitors were:

The top 10 "Top Pages", according to a tool provided by SEOmoz, which is calculated using SEOmoz popularity metrics based on external links and quality of external linking pages to the url, are:

It would probably be more interesting to normalize the data above by the total number of days published to get an idea of which articles were the most impacting. This is likely an example of normalization that search engines employ in popularity algorithms.

The lamest articles, ordered by the number of unique visitors ascending, were:

Phew - none of my articles were on that list :)

I should probably note that those aren't really lame articles. They likely didn't get a ton of traffic because they were published around the time we started our blog that received little traffic at the time and because they probably weren't marketed through social media as some of our recent blog articles have been.

But what does this all mean?

This past year, I've approached blogging as a sales or marketing tool to get traffic from potential clients or to inform existing clients of the variety of services we provide. So, I was interested in comparing traffic between our blog and main site. The graph below shows the total page views for the main site and blog:

And I was interested in looking at the amount of referral traffic (page views again) our main site receives from the blog:

It's pretty cool that in the last few months, our blog traffic has been 2 to 3 times as much as our main web site. And also interesting that 1/4 to 1/3 of the referral traffic to our main site came from our blog in the last few months. Referral traffic makes up roughly one third of our main site traffic, so we get about 10% of total traffic from blog referral traffic - which is fairly effective.

There are many other metrics I'm interested in, specifically segmenting our traffic into categories similar to the blog categories in the pie chart above to examine which topics bring us the most traffic. I hope we continue to blog about topics that bring us unique (and converting) traffic. But I'll save discussion of those metrics for another blog article, that hopefully won't make the "2010 Lamest Blog Articles" list :)

Common Topics in Scalability

It rarely makes sense for a startup business to tackle scalability questions from the outset, because it raises the cost and complexity of development and operational support, while solving problems that a start-up business doesn't typically yet need solved (i.e. handling tons of users). From a cost-effectiveness perspective, it makes sense to solve these problems when the need is actually evident.

That said, systems can be designed with scalability in mind such that the system easily lends itself to incremental changes that scale up its capacity.

The basic principles and techniques that typically come up in scalability discussions:

  • horizontal scalability: a particular component is "horizontally scalable" if you can deploy redundant instances of that component in parallel; this is the ultimate scalability win, as it means you can readily add raw processing power for that component at low cost.
  • vertical scalability: the practice of increasing a single component's power/capacity to improve performance/throughput is referred to as "vertically scaling" that component. From the layperson's perspective, this is the most easily-understood technique, as it effectively amounts to buying a faster server, adding RAM to an existing server, etc.
  • caching: caching can be about raw speed, but is more important from a scalability perspective; caches reduce the overall work a system needs to do for any given request, and can simplify the overall request to entirely eliminate the overhead of multiple-component involvement (for instance, a cache in the application layer can potentially eliminate any calls to the database for a given request)

Horizontal scalability brings the biggest win in terms of potential system capacity, as it allows you to handle increased demand (whether that demand is visitors, orders, etc.) through the simple means of adding more servers to your infrastructure. Virtualization and, its logical result, cloud hosting make this kind of infrastructure expansion simpler and often more cost-effective than ever before.

Some examples of horizontal scalability:

  • Running your appserver (Rails, Django, etc.) on multiple servers, with a load balancer distributing traffic more-or-less evenly across those servers
    • The entire application tier is scaled horizontally and can expand/contract as needed.
    • Session management becomes an issue; sessions either get moved to a shared component like the database, or server affinity techniques are used with the load balancer such that a single client/user always hits the same application server, so sessions can be file-based on each app server.
    • The database likely becomes the bottleneck for system capacity and performance.
  • Master/slave database replication, with multiple slave databases fronted by a load balancer distributing database reads across the slaves.
    • Database reads, which in typical webapps account for the bulk of database queries, can be spread across multiple servers and are thus scaled horizontally.
    • The total number of slave databases is likely limited by the capacity of the master; each additional slave adds some overhead to the master, so diminishing returns eventually kick in.

The ease with which a given component can be scaled horizontally largely comes down to how it manages state, or, in other words: how it manages the data that it works with.

Application servers are generally designed to be "stateless", which effectively means that the response for a given request is not dependent on a previous request (though the idea of the "session" is obviously a big exception here). Due to this stateless nature, it's usually cheap and easy to run your application server in a horizontally-scalable configuration.

By contrast, relational databases are all about state: the entire point of the database is to act as the arbiter of state, maintain that data on disk, and answer questions about that data. We typically expect the database to tell the truth and give a consistent answer about each piece of data. The consistency expectation leads to the need for each piece of data to have one and only one canonical location, which means it cannot be scaled across multiple servers. You can scale copies of data across multiple servers (as done in master/slave replication), but the True Value for a bit of state has to live in one place. Therefore, master databases are generally cut off from the glories of horizontal scalability. (Note: "sharding" offers a way to scale writes, but it doesn't make your database literally horizontally scalable; any given datum still has one canonical location which is limited to vertical scalability per usual).

Enter caching. When you cannot horizontally scale a given component, you can instead store and reuse copies of the results of that component's operations, to reduce the overall work done by that component. Most modern application server frameworks provide a variety of helpful caching tools right out of the box, and a good cache strategy can squeeze a great deal of throughput out of a simple architecture. However, caching is bigger than the app tier, as the examples show:

  • HTTP caching: HTTP clients (i.e. web browsers) can cache the resources they request from your system, if you give them the proper instructions on how to do so (via various HTTP headers). At a minimum, browsers ought to be able to cache the images, CSS, and JavaScript files that make up your site's large scale visual design, which means they don't need to request those files repeatedly.
  • HTTP caching redux: HTTP caching reverse proxies (Varnish, Squid, etc.) can sit between your web/app tier and your users' browsers, and can cache resources from your site based on HTTP headers and other configurable aspects; when users request a resource, they hit the HTTP reverse proxy first, and if the resource is available in the cache, the cached version is used. This means the user gets a faster response and your actual application does less work.
  • Page caching: By caching full copies of your dynamically-generated resources (web pages), your system can see enormous scalability gains. This can fall logically under HTTP caching or under application-tier caching or somewhere between; the important point is to consider the idea that you cache an entire dynamic page in some manner, as it brings both such big performance wins and potential design complexities/constraints.
  • Application tier caching: using simple file-based caches, or scalable, shared, distributed cache layers like memcached or redis, your application can cache the results of expensive queries, frequently-used operations/widgets, etc. and thus reuse the fruits of its own labors; this can reduce the computational cost for handling any given request and thus improve both raw request speed and overall throughput (scalability).
  • Database replication: though not typically referred to as "caching", the classic master/slave database replication strategy is effectively a very fancy cache. By pushing out copies of data at near-real-time, this lets your master database server do less work while still giving your application servers highly accurate results.
  • Controlled denormalization: within your database, you can use triggers and such to manage denormalized data, allowing frequently-used or expensive calculations to be cached as part of the database schema. This allows the application to rely upon such calculations at lower cost. Materialized views fit within this category.

These caching examples/categories vary in complexity of implementation, but they share a common principle: increase system capacity by reusing the system's work. The use of caching naturally involves trade-offs, but in the simple case, a straightforward expiry-based cache can have a dramatic impact on performance. A sluggish webapp can get a second wind by wrapping the most common database-driven components within a timed cache. For new development, for which your caching strategy can factor in at design time, caching can yield great performance/scalability with extremely high accuracy. In particular, the refresh-on-write strategy (in which the code paths in your app responsible for changing state are also responsible for updating related caches) can be a huge scalability win (this was exactly the strategy we used -- with great results -- for Backcountry.com's SteepandCheap site in fall of 2007 and the initial launch of their product Q&A features in early 2008).

Ideally, a good caching strategy does not merely reuse the results of earlier work, but in fact cuts out calls to other services; good caching in the application tier may mean that most requests do not need to involve the database at all; good HTTP caching techniques cut down the number of requests that involve your application server. And so on.

Beyond these topics, scalability benefits can frequently come from:

  • query optimization: poor database usage can result in slow page load times, excess I/O, and can overload the database unnecessarily; query optimization can have a significant impact there. This is rather like a corollary to vertically scaling your database: your database scales better relative to the hardware because its demands on the hardware are reduced.
  • data management optimization: big scalability gains can often come from revising the structure of the data involved at the scalability pinch points; for instance, having a single inventory count for a given SKU is a scalability bottleneck compared to having an inventory item record per 1 inventory count per sku. The former results in lock contention in high-order-volume situations, while the latter can minimize such locking and prevent one user's order from blocking another.
  • application database usage: related to query optimization is the issue of how the application structures its queries. In an age where people increasingly interact with their datasources through an object-relational mapper (ORM), one commonly finds cases in which the application issues per-record queries within a loop, meaning that a set of N records yields N+1 queries (1 query for the initial set, 1 query per iteration). This kind of database usage brings unnecessary I/O overhead and leads to sluggish app performance; rewrite the queries or tweak the ORM as necessary such that the first query loads all the data, or that a second query fetches all the extra data for all members of the first query's result set (ORMs often will do the second strategy for you if you let them know what you want).

The unfortunately-named "NoSQL" movement has a lot of exciting possibilities for systems that need to handle large volumes of data, or scale out to very high write volumes (I find both Cassandra and HBase to be particularly interesting). However, unless you know from the outset that your dataset size or write volume will be pushing the limits of the traditional relational database, or that you need distributed operations for availability purposes, the NoSQL offerings are quite possibly counterproductive. These solutions typically offer some particular advantage but with a trade-off of some kind; for small businesses with a rapidly-evolving sense of self and the problem space, the traditional RDBMS brings a huge amount of flexibility in how one works with data, as well as a well-understood operational paradigm (for high availability, reliability, backups, etc.). The "Big Data" benefits of NoSQL probably don't apply for such systems and the ease with which RDBMSes handle arbitrarily-complex queries allow the growing business/system to develop iteratively and arrive at a better understanding of the true underlying needs. The trade-offs and such are beyond the scope of these musings, so I just won't give them further consideration here; however, the NoSQL ecosystem is good to know about and certainly can factor into scalability discussions depending on your use cases. Several of them may in fact fit well within a caching strategy, rather than as an authoritative datasource.

Here are some general rules of thumb I would recommend if you want to be ready to scale up when the need arises (as is the whole point of these ramblings):

  • Write well-organized code: Maintain separation of concerns in your application code, maximize code reuse, and keep your classes, modules, functions, etc. limited in their scope. This keeps each individual code piece simple, maximizes the ease with which things like caching can be introduced, and minimizes the code paths that have to be changed in order to optimize the performance of any given component.
  • Keep your canonical data normalized: This could probably be stated instead as "model your data relationally", but that assumes an RDBMS, which isn't necessarily your only datastore. In any case, normalized data is easier to manage over the life of a system, and just as better-organized code is more-easily optimized for performance/scalability than poorly-organized code, well-organized data is more-easily rearranged or cached, etc. Introduce denormalization in a targeted, needs-based manner, but don't let that denormalization ever be considered part of your canonical dataset; denormalized data is broken data that you should be able to throw out and rebuild at any time.
  • Avoid session usage: Sessions are handy, but they in fact violate the statelessness of the application server, introduce one point of complexity (though a manageable point) for horizontal scaling, potentially introduce database scaling problems (if you're using database-backed sessions), etc. If the state matters enough to keep it from one request to the next, consider modeling it and storing it properly in the database. For state that just can't go in the database and you just can't live without...
  • Use the client: The majority of clients for the majority of webapps are web browsers. Those web browsers typically have storage (cookies for now, and more options coming) and frequently have programming capacity (JavaScript). Use it! In particular, cookies can take the role of the user session, and potentially eliminate (or at least reduce) the need for server-side session state. Work done by the client is work not done by your servers.
  • Think RESTfully: While you don't necessarily need to literally design a RESTful application, be mindful of the RESTful design principles and keep your URLs well-designed. Critically, don't allow the same URL to front wildly-varying resources; such resources are the bane of page/HTTP caching.
  • Be mindful of user-specific resources: Any resource, document, etc. that you serve with user (or session) specific information therein is a resource that is potentially more difficult to cache effectively. Consider crafting your URLs such that the URL is per-user, if that can fit your problem-space effectively (this allows HTTP or page-level caching to still be an option). If common resources (like a "home page") need to show per-user details, consider using the client (cookies, JavaScript) to encapsulate the user-specific stuff such that the common resource itself is identical for all users (and thus also remains open to HTTP or page-level caching).

Note: the original posting said "Keep Canonical Data Denormalized" when I meant "Normalized"; I've corrected this in the above text. -- Ethan

State of the Postgres project

It's been interesting watching the MySQL drama unfold, but I have to take issue when people start trying to drag Postgres into it again by spreading FUD (Fear, Uncertainty, and Doubt). Rather than simply rebut the FUD, I thought this was a good opportunity to examine the strength of the Postgres project.

Monty recently espoused the following in a blog comment:

"...This case is about ensuring that Oracle doesn't gain money and market share by killing an Open Source competitor. Today MySQL, tomorrow PostgreSQL. Yes, PostgreSQL can also be killed; To prove the case, think what would happen if someone managed to ensure that the top 20 core PostgreSQL developers could not develop PostgreSQL anymore or if each of these developers would fork their own PostgreSQL project."

Later on in his blog he raises the same theme again with a slight bit more detail:

"Note that not even PostgreSQL is safe from this threat! For example, Oracle could buy some companies developing PostgreSQL and target the core developers. Without the core developers working actively on PostgreSQL, the PostgreSQL project will be weakened tremendously and it could even die as a result."

Is this a valid concern? It's easy enough to overlook it considering the Sturm und Drang in Monty's recent posts, but I think this is something worth seriously looking into. Specifically, is the Postgres project capable of withstanding a direct threat from a large company with deep pockets (e.g. Oracle)?

To get to the answer, let's run some numbers first. Monty mentions the "top 20" Postgres developers. If we look at the community contributors page, we see that there are in fact 25 major developers listed, as well as 7 core members, so 20 would indeed be a significant chunk of that page. To dig deeper, I looked at the cvs logs for the year of 2009 for the Postgres project, and ran some scripts against them. The 9185 commits were spread across 16 different people, and about 16 other people were mentioned in the commit notes as having contributed in some way (e.g. a patch from a non-committer). So again, it looks like Monty's number of 20 is a pretty good approximation.

However (and you knew there was a however), the catch comes from being able to actually stop 20 of those people from working on Postgres. There are basically two ways to do this: Oracle could buy out a company, or they could hire (buy out) a person. The first problem is that the Postgres community is very widely distributed. If you look at the people on the community contributors page, you'll see that the 32 people work for 24 different companies. Further, no one company holds sway: the median is one company, and the high water mark is a mere three developers. All of this is much better than it was years ago, in the total number and in the distribution.

The next fly in the ointment is that buying out a company is not always easy to do, despite the size of your pockets. Many companies on that list are privately held and will not sell. Even if you did buy out the company, there is no way to prevent the people working there from then moving to a different company. Finally, buying out some companies just isn't possible, even if you are Oracle, because there are some big names on the list of people employing major Postgres developers: Google, Red Hat, Skype, and SRA. Then of course there is NTT, which is a really, really big company (larger than Oracle). NTT's Postgres developers are not always as visible as some of the English-speaking ones, but NTT employs a lot of people to work on Postgres (which is extremely popular in Japan).

The second way is hiring people directly. However, people can not always be bought off. Sure, some of the developers might choose to leave if Oracle offered them $20 million dollars, but not all of them (Larry, I might go for $19 million, call me :). Even if they did leave, the depth of the Postgres community should not be underestimated. For every "major developer" on that page, there are many others who read the lists, know the code well, but just haven't, for one reason or another, made it on to that list. At a rough guess, I'd say that there are a couple hundred people in the world who would be able to make commits to the Postgres source code. Would all of them be as fast or effective as some of the existing people? Perhaps not, but the point is that it would be nigh impossible to thin the pool fast enough to make a dent.

The project's email lists are as strong as ever, to such a point that I find it hard to keep up with the traffic, a problem I did not have a few years ago. The number of conferences and people attending each is growing rapidly, and there is a great demand for people with Postgres skills. The number of projects using Postgres, or offering it as an alternative database backend, is constantly growing. It's no longer difficult to find a hosting provider that offers Postgres in addition to MySQL. Most important of all, the project continues to regularly release stable new versions. Version 8.5 will probably be released in 2010.

In conclusion, the state of the Postgres project is in great shape, due to the depth and breadth of the community (and the depth and breadth of the developer subset). There is no danger of Postgres going the MySQL route; the PG developers are spread across a number of businesses, the code (and documentation!) is BSD, and no one firm holds sway in the project.