End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Search Engine Optimization Thoughts

Search engine optimization and search engine marketing cover a wide range of opportunities to improve a website's search engine traffic. When performining a search engine site review, here are several questions that I would investigate. Note that the questions are biased towards technical search engine optimization efforts. Some of the questions provide links to help define common search engine optimization terms. Although this is not typical End Point blog fashion, the answers to these questions can potentially lead to search engine optimization improvements.

Technical Topics

  • Do the pages indexed by major search engines accurately represent the site's content?
  • Are there duplicate index pages indexed?
  • Are there old index pages or domains that aren't redirected?
  • Are there pages missing from major search engine indexes?
  • Are there too many pages in major search engine indexes?
  • Are 301 redirects used permanently on the site?
  • Can rel="canonical" or the use of 301s be applied as a temporary solution to fix duplicate content issues?
  • Is there low hanging fruit to fix duplicate content issues?
  • Is there low hanging fruit to fix duplicate content generated by external links?
  • Are there on site duplicate content issues that can be remedied with '#' (relative link)?
  • Can external affiliate link structures be modified to use '#' (relative link) instead of query strings?
  • Is there a trend that pages deep in a site's structure don't get indexed?
  • Is there a trend that a certain group of pages don't get indexed?
  • Does any duplicate content originate from links in on site javascript?
  • Are there any pages indexed that don't need to be indexed?
  • Is there a robots.txt file?
  • Can the use of rel="nofollow" on site help direct link juice?
  • Do url structures follow standard guidelines?
  • Is there pagination on site that produces an ever-increasing crawl depth?
  • Can navigation be modified to reduce on site pagination?
  • Is there a sitemap?
  • Would a sitemap provide an alternate way to index pages on a site?
  • Is there a mechanism that causes creation of new urls and deletion of old urls (example: items going out of stock)?
  • Is there a mechanism that causes urls to fluctuate in and out of navigation (example: items going in and out of stock)?
  • Does the website owner have a way to monitor bot crawl data and trends?
  • Are there google webmaster tool errors that need to be addressed (examples: duplicate page title, meta data)?
  • Are there any problems with server response (examples: frequent server down time)?
  • Are there frequent 404 responses?
  • Can relevant, helpful content that can be added to the 404 page?
  • Are 302 (temporary redirects) ever served when 301s should be served?
  • Does any page content have excessive content that isn't indexed by major search engines?
  • Can pages with excessive content be divided into different pages target different keywords?
  • How does the YSlow score compare to competitor sites?
  • Are there low hanging performance improvements that can be made?
  • Is there excessive inline javascript or css?

Keyword Topics

  • Does the website owner have a set of defined target keywords?
  • Is the site architecture built around the target keywords?
  • Are on page elements appropriately targeting keywords?
  • Do images have alt text that targets relevant keywords?
  • Are page titles targetting keywords appropriately (example: Keyword - Category - Website Name)?
  • Is there a mechanism that allows for page titles, headers, and content to be refreshed frequently?
  • Does the site target short tail and long tail content?
  • Does the website owner have a separate site that generates relevant content and produces inbound links?
  • Does the website owner have a tool to measure search engine traffic over time?
  • Does the website owner have a tool to measure incoming links?
  • Does the website owner have a tool to measure incoming link anchor text trends?

User Generated Content Topics

  • Does the site have user generated content?
  • Would building content around user generated content lead to an increase of long tail search engine traffic?
  • Is there a way to rank user generated content by quality?
  • Does the site have a social community that encourages production of user generated content?

I hope to address some of these topics in depth in the future as part of this blog.

Learn more about End Point's technical SEO services.

osCommerce dead and reborn

Here's some interesting history of osCommerce. Sounds like it's been quite the ride for its users. Congrats to those who finally forked it and started making releases.

Replicate only parts of my table

A day or two ago in #slony, someone asked if Slony would replicate only selected columns of a table. A natural response might be to create a view containing only the columns you're interested in, and have Slony replicate that. But Slony is trigger-based -- the only reason it knows there's something to replicate is because a trigger has told it so -- and you can't have a trigger on a view. So that won't work. Greg chimed in to say that Bucardo could do it, and mentioned a Bucardo feature I'd not yet noticed. Bucardo is trigger-based, like Slony, so defining a view won't work. But it allows you to specify a special query string for each table you're replicating. This query is called a "customselect", and can serve to limit the columns you replicate, transform the rows as they're being replicated, etc., and probably a bunch of other stuff I haven't thought of yet. A simple example:
  1. Create a table in one database as follows:
    CREATE TABLE synctest (
       id INTEGER PRIMARY KEY,
       field1 TEXT,
       field2 TEXT,
       field3 TEXT
    );
    
  2. Also create this table in the replication destination database; Bucardo won't replicate schema changes or database structure.
  3. Tell Bucardo about the table. I won't give the SQL here because it's already available in the Bucardo documentation. Suffice it to say you need to tell the goat table about a customselect query. For my testing, I used 'SELECT id, field1 FROM synctest'. Note that the fields returned by this query must
    • Include all the primary key fields from the table. Bucardo will complain if it can't find the primary key in the results of the customselect query.
    • Return field names matching those of the table. This means, for example, that if you somehow transform the contents of a field, you need to make sure the query explicitly names the results something Bucardo can recognize, e.g. 'SELECT id, do_some_transformation(field1) AS field1 FROM synctest'
  4. Tell the sync to use the custom select statements by setting the 'usecustomselect' field in the sync table to TRUE for the sync in question
  5. Fire up Bucardo and see the results. Here's my source table:
    58921 josh@bucardo_test# select * from uniq_test ;
     id |  field1  | field2 | field3  
    ----+----------+--------+---------
      1 | alpha    | bravo  | charlie
      2 | delta    | echo   | foxtrot
      3 | hotel    | india  | juliet
      4 | kilo     | lima   | mike
      5 | november | oscar  | papa
      6 | romeo    | sierra | tango
      7 | uniform  | victor | whiskey
      8 | xray     | yankee | zulu
    (8 rows)
    
    ...and here's my destination table...
    58922 josh@bucardo_test# select * from uniq_test;
     id |  field1  | field2 | field3 
    ----+----------+--------+--------
      1 | alpha    |        | 
      2 | delta    |        | 
      3 | hotel    |        | 
      4 | kilo     |        | 
      5 | november |        | 
      6 | romeo    |        | 
      7 | uniform  |        | 
      8 | xray     |        | 
    (8 rows)
    
Note that at least for now, customselect works only with fullcopy sync types. Also, the destination table must match the source table in structure, even if you're not going to copy all the fields. That is, even though I'm only replicating the 'id' and 'field1' fields in the example above, the destination table needs to contain all the fields in the source table. This is one of Bucardo's TODO items...

Bare git repositories and newspapers

During a recent discussion about git, I realized yet again that previous knowledge of a Version Control System (VCS) actively hinders understanding of git: this is especially challenging when trying to understand the difference between bare vs non-bare repositories.

An analogy might be helpful: assume a modern newspaper, where the actual contents of the physical pages are stored in a database; i.e., the database might store contents of articles in one table, author information in another, page layout information in yet another table, and information on how an edition is built in yet another table, or perhaps in an external program. Any particular edition of the paper just happens to be a particular instantiation of items that live in the database.

Suppose an editor walks in and tells the staff "Create a special edition that consists of the front pages of the past week's papers." That edition could easily be created by taking all the front page articles from the past week from the database. No new content would be needed in the content tables themselves, just some metadata changes to label the new edition and description of how to build it.

One could consider the database, then, to be the actual newspaper.

Let's apply that analogy to git:

A git repository is the newspaper database. A particular git branch is the equivalent of a particular day's paper: e.g., the edition for February 5, 2009 consisting of a set of articles, glued together by a layout specification, tied to a label 'February 5, 2009'. In git terms, that would be blobs of data, glued together by references, perhaps labeled by either a branch or a tag.

A bare git repository, then, is the newspaper database itself, not a huge stack of all the editions ever printed. That's a large contrast to some other VCSs where a repository is the first edition ever printed, with diff's stored on top of that. Running git clone is equivalent to a database copy of all the tables of the database. Doing a git checkout of a branch is the equivalent of asking the newspaper factory to read in the metadata and content from the database and produce a physical paper instance of the newspaper.

Puppet PDX will meet on 2/27/09

Even zombies like puppets!

If you live in or near Portland, OR, come join us for our first Puppet PDX meeting. We're meeting from 6-7pm, at Paddy's on SW 1st and Yamhill in Downtown Portland. It's right on the MAX line.

Inspired by Puppet and the Reductive Labs team, we're gathering people interested in all things related to configuration management. (Not sock puppet-making, sorry!) Cfengine user? Thinking about trying AutomateIt? Just have a pile of obsessively managed scripts? Come on down! We'll discuss tools, best practices, and generally how to make your systems run so well you can get to the pub by 4 o'clock.

This is the first meetup. Hopefully we can get organized, get to know each other, and decide on what the goals of the group should be.

So if you are interested in automation, configuration management, cloud computing, and large scale computing environments, come join us for a few drinks and some lively chatter.

Please RSVP by sending an email to puppetpdx@reductivelabs.com so that we can get a bigger room if needed.

Announcing Release of PostgreSQL System Impact (PGSI) Log Analyzer

The PostgreSQL System Impact (PGSI) log analyzer is now available at http://bucardo.org/wiki/Pgsi.

System Impact (SI) is a measure of the overall load a given query imposes on a server. It is expressed as a percentage of a query's average duration over the its average interval between successive calls.

Queries are collected into canonical form with respect to literals and bind params; further, IN lists of varying cardinality are collapsed. Thus, queries that differ only in argument composition will be collected together in the evaluation. However, logically equivalent queries that differ in any other manner of structure (say two comparisons between AND that are transposed) will be seen as distinct.

The goal of SI is to identify those queries most likely to cause performance degradation on the database during heaviest traffic periods. Focusing exclusively on the least efficient queries can hide relatively fast-running queries that saturate the system more because they are called far more frequently. By contrast, focusing only on the most-frequently called queries will tend to emphasize small, highly optimized queries at the expense of slightly less popular queries that spend much more of their time between successive calls in an active state. These are often smaller queries that have failed to be optimized and punish a system severely under heavy load.

PGSI requires full PostgreSQL logging through syslog with a prescribed format. Specifically, log_statement must be 'all' and log_duration must be 'on'. Given a continuous log interval of any duration, PGSI will calculate reports in wiki-ready format with the following data over that interval:

  • First line defines suggested wiki page name for the given report
  • Log interval over which the report applies
  • SI, sorted from worst to best
  • Average duration of execution for the canonical query
  • Total count of times canonical query was executed
  • Average interval between successive executions
  • Standard deviation of the duration
  • Display of the canonical query
  • List of log entries for best- and worst-duration instances of the canonical query (only if report was generated using the --offenders option).

PGSI can be downloaded in tar.gz format or can be accessed from Git, its version-control system. To obtain it from git, run:

git clone http://bucardo.org/pgsi.git/

Contributions are welcome. Send patches (unified output format, please) to mark@endpoint.com.

Testing in the Web Environment

Introduction

Testing is an important part of good software engineering practices. In fact, it can be said that it is at once the most important, and yet most neglected part of software engineering. Testing methodology for software engineering developed out of its hardware engineering roots: software was defined in terms of its inputs and outputs, and testing was similarly defined in terms of applied inputs and expected outputs.

However, software testing is more complex than that: this is because software almost always incorporates "state" or memory that affects subsequent operations. For instance, the following pseudocode:

if (VALUE is not defined)
then
VALUE := 1.0
fi
FRACTION := 1.0 / VALUE

In this simple case, the code fragment will always operate correctly on the first execution, but subsequent executions may fail if VALUE is zero.

Testing web applications involves planning for this kind of memory, because in essence a web application runs within a larger program (the web server and perhaps the application server) and may inherit state from the environment, or indeed may preserve its own state from one page reference to the next.

In addition, web applications involve human factors.

  • Does the application "display correctly" (whatever that means)?
  • Does the page load "quickly enough"?
  • Do dynamic elements (e.g. Ajax) respond appropriately?

Such factors are harder to measure than verifying that a sales tax calculation returns an accurate number.

For these reasons, we turn to web application testing frameworks. Loosely defined these frameworks provide either a substitute for, or an interface to, a web browser that is under programmatic control. So for instance, a test script can invoke the web application via URL just as a browser would. Then it can test for page content or metadata (title, etc.), and even in some cases access embedded media such as image files. The framework provides a way to operate the web application: through it, the test script can submit forms, click on objects, respond to dynamic events such as JavaScript alerts, and even operate the browser in other ways: navigating via the "Back" button, saving files, etc.

Using such frameworks, the software engineer can automate the testing process. The application's performance can be defined in terms of the test scripts that it passes, so that modifications to the application (new functionality or bug repairs) can be validated against the existing tests (regression testing).

In this article, I'll briefly survey several approaches to web application testing frameworks that are in use or under study at End Point.

WWW::Mechanize

The first framework is a Perl module called "WWW::Mechanize", and its associated extension "Test::WWW::Mechanize". This framework provides an object-oriented interface to an HTTP connection which allows a test script, written in Perl, to perform operations on a web site much like a browser, and to test the results in various ways. By way of example, here is a script that operates on the End Point website:

use strict;
use Test::WWW::Mechanize;
use Test::More tests => 4;
my $mech = Test::WWW::Mechanize->new();
$mech->get_ok('http://www.endpoint.com', 'Home page fetched');
$mech->title_like(qr/End Point/, 'Page mentions us');
$mech->follow_link_ok({ text_regex => qr/Team Bios/ }, 'Found team bios');
$mech->content_contains('Jeff Boes', 'Author was mentioned');

This test declares that we will run four tests. It initializes the test framework with a call to the "new" method. Then it executes the four tests, annotating each one with a message that lets us identify which test failed by a human-friendly string rather than a bare number.

The first test just checks that the framework can retrieve the home page; failure would be caused by a server problem, DNS failure, etc. The second test just verifies that the page title contains a particular text pattern (the name of our company). The third test finds a link (in this case, based on a pattern of text in the link; we could also locate a link by URL, for example) and verifies that the framework can follow the link. The fourth and final test verifies that the author's name appears on the page.

From simple building blocks like this, more and more complex tests can be built up. Through the underlying framework, a test script can:

  • set and retrieve form field values, including checkboxes and selectors
  • submit forms
  • set and retrieve cookie values
  • analyze images
  • provide credentials for HTTP Basic Authentication (for password-protected sites)

End Point has used this approach with success. For example, the order and checkout process on CityPass uses a sequence of tests designed to place orders for every product offered, in various combinations. The test script makes a connection to the site's PostgreSQL database allowing it to compare the resulting order receipts with the matching database entries.

The major failing of the Mechanize family is that JavaScript is not supported. Thus, this framework is not suitable for testing pages for which major parts of the functionality are provided through JavaScript.

HTTP::Recorder

This framework, another Perl module, is really a system for constructing test scripts for use with WWW::Mechanize. It doesn't offer any testing facility on its own; instead, it is designed to line up between a browser and a web application, recording the mouse clicks and keystrokes made, and emitting a test script that is then fed through WWW::Mechanize (perhaps after suitable manual adjustment).

Again, this system doesn't recognize, operate on, or record JavaScript events, so it's not as useful for testing sites with large amounts or critical sections of JavaScript.

Selenium

Selenium is a framework rather unlike the previous entries, although from the view of the programmer developing a test script or suite, it doesn't seem that much different. Selenium has several components; the one that interests us most for this particular survey is "Selenium RC" (Remote Control). This component services requests from a test script written much like the WWW::Mechanize scripts. The Selenium RC server will start up a browser and translate test script requests into actual mouse and keyboard events on the controlled browser.

Selenium works with several different browsers, such as Firefox and Microsoft Internet Explorer. For the vast majority of test scripts, the only change required to switch from testing one browser platform to another is to change a single line in the initial server request.

Selenium works with JavaScript events and functionality. You can, for instance, test JavaScript "onmouseover" events, or field validation through "onchange" or "onsubmit". Your test scripts can check for JavaScript alerts and respond to them, and behave in nearly every way just as a real user would, sitting in front of a real browser.

Selenium RC is implemented as a Java application, which means that its environment must include a Java installation (JVM).

The drawback of Selenium is that since it must be run in an environment that includes a browser and window display system, you'll almost certainly need to run your test script on a workstation, or a server with all the windowing software installed.

Other approaches

  • OpenSTA (Open System Testing Architecture) is more of a heavy-load testing framework, although it does provide a scripted setup.
  • Usability testing environments such as WAUTER are designed to observe and record end-user actions (such as scrolling and mouse clicks) for later analysis.

Hot topic: user groups!

Theodore T'so, Linux filesystem hacker, recently asked:

Do you attend or participate in a LUG? How active is it? What do you get out of participating in a LUG?

I attend and participate in my local LUG, PLUG. I've attended meetings since about 2000, and presented once to a group of about 45 people. They meet every month, and have an additional "advanced topics" meeting every month I've presented at as well.

What I get out of PLUG has changed over the years. First I was attending to meet other Linux enthusiasts and learn things. Now I go to meetings to see my friends, swap war stories, and to present on topics of interest to me. My goals are less about general networking, and more about directly connecting to, sharing information with and learning from people I already know, who are for the most part my friends and colleagues.

The comment thread on Ted's post is kind of long, and there were lots of interesting comments. I decided to respond to some requests for howtos and weigh in with my opinion on the rise of user groups:

My experience has been that user groups (not just Linux User Groups) are exploding in number, but that the total number of people in each group is relatively small - like somewhere between 15-30 people (or less).

It’s so easy to throw up a wiki page, or create a mailing list, and start meeting to talk about things weekly or monthly. Portland, OR has a unique culture developing around this, and the manic activity around it is somewhat documented in our community’s custom-developed calendaring system — http://www.calagator.org.

I gave a user groups talk at FOSDEM over the weekend — http://tr.im/fjbl and Gabrielle Roth and I produced a printable pamphlet for OSCON last year that details how you can run your own user group: http://tr.im/fn81

We need to convert that printable thing into a wiki page, but you know, there’s never enough time in the day :)

One interesting thing I noticed about the FOSDEM crowd was that the audience was intrigued by the idea of creating an ad hoc group, without a formal leadership structure or a non-profit/legal status established. Several people commented to me after the talk that this seemed like a great way to get people interacting with one another without all of the pain of establishing a non-profit. Not sure if other countries/cultures have the same issues (favoring formal organization, over the informal, sometimes chaotic groups I tend to be interested in and work with).

Oh, and someone in the audience asked specifically about non-hierarchical organizations, and groups that truly have no leader. That was some food for thought — I’ve seen a couple groups operate that way, but there tended to be forks in the community after a couple years. Not sure that’s really a bad thing, but certainly an interesting part of user group culture and something I would like to blog/speak about more in the future.

Test::Database Postgres support

At our recent company meeting, we organized a 'hackathon' at which the company was split into small groups to work on specific projects. My group was Postgres-focused and we chose to add Postgres support to the new Perl module Test::Database.

This turned out to be a decent sized task for the few hours we had to accomplish it. The team consisted of myself (Greg Sabino Mullane), Mark Johnson, Selena Deckelmann, and Josh Tolley. While I undertook the task of downloading the latest version and putting it into a local git repository, others were assigned to get an overview of how it worked, examine the API, and start writing some unit tests.

In a nutshell, the Test::Database module allows an easy interface to creating and destroying test databases. This can be a non-trivial task on some systems, so putting it all into a module make sense (as well as the benefits of preventing everyone from reinventing this particular wheel). Once we had a basic understanding of how it worked, we were off.

While all of our tasks overlapped to some degree, we managed to get the job done without too much trouble, and in a fairly efficient manner. We made a new file for Postgres, added in all the required API methods, wrote tests for each one, and documented everything as we went along. The basic method to create a test database is to use the initdb program to create a new Postgres cluster, then modify the cluster to use a local Unix socket in the newly created directory (this side-stepping completely the problem of using an already occupied port). Then we can start up the new cluster via the pg_ctl command, and create a new database.

At the end of the day, we had a working module that passed all of its tests. We combined our git patches into a single one mailed it to the author of the module, so hopefully you'll soon see a new version of Test::Database with Postgres support!

David Mamet on software development

I recently read the book True and False: Heresy and Common Sense for the Actor (1997) by David Mamet. (A gift from Uncle Steve. Thanks!) Mamet’s film The Spanish Prisoner is one of my favorites, and I liked Oleanna, Things Change, and Glengarry Glen Ross a lot too. Wait, looking over his writing credits at IMDb, I see I’d forgotten The Winslow Boy, State and Main, and Wag the Dog, which were good too.

Enough of the fanboyism. This little book of his is packed with what sounds like good advice for actors. Not having been an actor in anything more than a school play or two, I can only imagine, but I was surprised to find myself nodding in agreement during many passages because they applied so well to my field, software development.

I’ve selected a few quotations that seemed especially apt:

I was once at a marriage ceremony where the parties swore “to try to be faithful, to try to be considerate ...” That marriage was, of course, doomed. Any worthwhile goal is difficult to accomplish. To say of it “I’ll try” is to excuse oneself in advance. Those who respond to our requests with “I’ll try” intend to deny us, and call on us to join in the hypocrisy—as if there were some merit in intending anything other than accomplishment. (p. 34)

This reminded me of Richard Stallman’s take on the same subject:

Yoda’s philosophy (“There is no ‘try’”) sounds neat, but it doesn’t work for me. I have done most of my work while anxious about whether I could do the job, and unsure that it would be enough to achieve the goal if I did. But I tried anyway, because there was no one but me between the enemy and my city. Surprising myself, I have sometimes succeeded. (from the essay, The GNU Project)

Though they appear to be contradictory, I think both approaches can apply depending on circumstances and I’ve employed each at various times. Whoa, there ... nuance.

I don’t have comments about the rest of these excerpts:

The simple performance of the great deed, onstage or off, is called “heroism.” The person who will not be swayed, who perseveres no matter what—that hero has the capacity to inspire us, to suggest that we reexamine our self-imposed limitations and try again. (p. 13)

The middle-class work ethic: “But I did my preparation. It is not my fault if the truth of the moment does not conform.” That ethic is not going to avail. Nobody cares how hard you worked. Nor should they.
   Acting, which takes places for an audience, is not as the academic model would have us believe. It is not a test. It is an art, and it requires not tidiness, not paint-by-numbers intellectuality, but immediacy and courage. (p. 32)

Let us learn acceptance. This is one of the greatest tools an actor can have. The capacity to accept: to wish things to happen as they do. It is the root of all happiness in life, and it is the root of wisdom for an actor. Acceptance. Because the capacity to accept derives from the will and the will is the source of character. Applying our intention to use only one meaning for words, character is the same onstage and off. It is habitual action. ...
   The habit of cheerful acceptance is an aide in the greater life in the theatre, too, because it induces truthful consideration: “The world is as it is, what can I do about it?” ... (pp. 70-71)

Carve the big tasks up into small tasks and perform these small tasks. (p. 76)

The more you are concerned with yourself, the less you are worthy of note.
   The more a person’s concentration is outward, the more naturally interesting that person becomes. ...
   The person with attention directed outward becomes various and provocative. The person endeavoring to become various and provocative is stolid and unmoving. (p. 95)

“Luck,” in one’s business dealings, and “talent,” its equivalent onstage, seem to reward those with an active and practicable philosophy. ... [H]ard work and perseverence will be rewarded. (p. 99)

In the theatre, as in other endeavors, correctness in the small is the key to correctness in the large. (p. 101)

Be generous to others. ... There is certainly something you can correct or improve in yourself today—over which you have control. That habit will make you strong. Yearning to correct or amend something in someone else will make you petty. ...
   Cultivate the habit of mutuality. Create with your peers, and you are building a true theatre. When you desire and strive to rise from the ranks rather than with the ranks, you are creating divisiveness and loneliness in yourself, in the theatre, and in the world. (pp. 102-103)

It is not a sign of ignorance not to know the answers. But there is great merit in facing the questions. (p. 127)

That’s enough for now, but there’s much more in the book that I found useful and applicable to other areas of life as well. It’s a quick read and I really enjoyed it.

NY Puppet Users Group meeting

Several of us from End Point attended the first NY Puppet Users Group meeting last night, and we had a great time. Brian Gupta and Larry Ludwig did an excellent time coordinating, even though the weather and the venue didn't cooperate. After a little shuffling, we found ourselves at The Perfect Pint, where we had a large table in a fairly quiet room where we could meet each other and talk about Puppet. The group was split fairly evenly between those who have used Puppet and those who are considering it: everyone there understood the need for solid configuration management (CM), so much of the discussions (at least at my end of the table) centered around
  • comparisons of different CM systems
  • challenges in getting Puppet on legacy systems (e.g., that may not even have Ruby)
  • features people would like to see in Puppet
  • how to make the business case for Puppet
All in all it was a great time, and I really appreciated Brian and Larry's leadership, as well as their flexibility. Not only did they move the venue at the last minute to better accommodate the group, but they were super hosts in coordinating the date and time: several of us were only in NY for a short time, but they managed to juggle things around so that as many as possible could attend. That flexibility says a lot for the Puppet community, and we're looking forward to participating in future meetings.

Slony1-2.0.0 + PostgreSQL 8.4devel

Many people use Slony to replicate PostgreSQL databases in various interesting ways. Slony is a bit tough to get used to, but works very well, and can be found in important places at a number of high-load, high-profile sites. A few weeks back I set up Slony1-2.0.0 (the latest release) with a development version of PostgreSQL 8.4, and kept track of the play-by-play, as follows:

Starting Environment

On this machine, PostgreSQL is installed from the CVS tree. I updated the tree and reinstalled just to have a well-known starting platform (output of each command has been removed for brevity).
jtolley@uber:~/devel/pgsql$ make distclean
jtolley@uber:~/devel/pgsql$ cvs update -Ad
jtolley@uber:~/devel/pgsql$ ./configure --prefix=/home/jtolley/devel/pgdb
jtolley@uber:~/devel/pgsql$ make
jtolley@uber:~/devel/pgsql$ make install
jtolley@uber:~/devel/pgsql$ cd ../pgdb
jtolley@uber:~/devel/pgdb$ initdb data
jtolley@uber:~/devel/pgdb$ pg_ctl -l ~/logfile -D data start
The --prefix option in ./configure tells PostgreSQL where to install itself. Slony uses a daemon called slon to do its work, and slon connects to a database over TCP, so I needed to configure PostgreSQL to allow TCP connections by editing postgresql.conf appropriately and restarting PostgreSQL. [edit] Installing Slony Next, I downloaded slony1-2.0.0.tar.bz2 and checked its MD5 checksum
jtolley@uber:~/downloads$ wget http://www.slony.info/downloads/2.0/source/slony1-2.0.0.tar.bz2
--2008-12-17 11:29:54--  http://www.slony.info/downloads/2.0/source/slony1-2.0.0.tar.bz2
Resolving www.slony.info... 207.173.203.170
Connecting to www.slony.info|207.173.203.170|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 909567 (888K) [application/x-tar]
Saving to: `slony1-2.0.0.tar.bz2'
...
jtolley@uber:~/downloads$ md5sum slony1-2.0.0.tar.bz2 
d0c4955f10fe8efb7f4bbacbe5ee732b  slony1-2.0.0.tar.bz2
The MD5 checksum matches the one given on the slony website, so we can continue. First, I unzipped the download into /home/jtolley/devel/slony1-2.0.0. Now we need to configure and build the source. Again, the output of each command has been removed for brevity.
jtolley@uber:~/devel/slony1-2.0.0$ ./configure --with-pgconfigdir=/home/jtolley/devel/pgdb/bin/ \
> --prefix=/home/jtolley/devel/slony --with-perltools=/home/jtolley/devel/pgdb/bin
jtolley@uber:~/devel/slony1-2.0.0$ make
jtolley@uber:~/devel/slony1-2.0.0$ make install
The configure options tell slony where to find pg_config, a program that reports the locations of various important PostgreSQL libraries and other components, where to install slony, and where to put slony's perl-based toolset, which we'll use later. I also added /home/jtolley/devel/slony/bin to my PATH.

Setting Up Replication

Configuring PostgreSQL

The Slony documentation demonstrates setting up a database with pgbench and replicating it to another database. This document demonstrates the same thing. We'll create a slony user, databases pgbench and pgbenchslave, use the pgbench utility to create a schema, and then copy that schema to pgbenchslave. We'll then set up slony to replicate changes in pgbench to pgbenchslave. Each slony database needs a slon process connected to it using a superuser account. First, we'll create the superuser account, called slony, and a pair of databases called pgbench and pgbenchslave:
jtolley@uber:~/devel/pgdb$ createuser -sP slony
Enter password for new role: 
Enter it again: 
jtolley@uber:~/devel/pgdb$ createdb pgbench
jtolley@uber:~/devel/pgdb$ createdb pgbenchslave
Now we'll create some schema objects in the pgbench database using the pgbench utility (the output from pgbench isn't shown here):
jtolley@uber:~/devel/pgdb$ pgbench -i -s 1 pgbench
Slony requires PL/pgSQL, so we'll install it now, in both databases:
jtolley@uber:~/devel/pgdb$ for i in pgbench pgbenchslave ; do createlang plpgsql $i ; done
Note: Here we have to make changes from what older versions of slony expect. Slony requires every replicated table to have a primary key, and used to be able to create keys for tables that didn't otherwise have them, if instructed to do so. As of version 2.0.0 that's no longer possible, perhaps because it was a bad idea anyway, in most cases, for users to do it. So we have to make sure each table has a primary key. The pgbench schema consists of four tables, called accounts, branches, tellers, and history. Of these four, history doesn't have a primary key, so we need to create one. Here's how I did it:
jtolley@uber:~/devel/pgdb$ psql -Xc "alter table history add id serial primary key;" pgbench
NOTICE:  ALTER TABLE will create implicit sequence "history_id_seq" for serial column "history.id"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "history_pkey" for table "history"
ALTER TABLE
Note the -X in the call to psql; this prevents the "\set AUTOCOMMIT off" setting in my psqlrc file from taking effect, so I didn't have to add a "commit" command to the stuff I send psql. Now that our schema is set up properly, let's copy it from pgbench to pgbenchslave. In this case we want to replicate all tables, so we'll copy everything.
jtolley@uber:~/devel/pgdb$ pg_dump -s pgbench | psql -X pgbenchslave

Configuring Slony's altperl Scripts

Now we're ready to set up slony, and we'll make use of slony's altperl scripts to do most of the configuration grunt work for us. To make altperl work, we need to set up slon_tools.conf. A sample already lives in slony/etc/slon_tools.conf-sample.
jtolley@uber:~/devel/pgdb$ cd ../slony/etc/
jtolley@uber:~/devel/slony/etc$ cp slon_tools.conf-sample slon_tools.conf
jtolley@uber:~/devel/slony/etc$ vim slon_tools.conf
This file defines nodes and sets, and is written in Perl. First, a group of nodes makes a slony cluster, which is a named object. You can set that name with the $CLUSTER_NAME parameter. We also need a directory where log information will be written, which goes in the $LOGDIR parameter. In this case, I've set it to "/home/jtolley/devel/slony/log", which I've manually created. The slon daemons need write access to this directory; since they'll be running as me on this machine, that's fine. Next we add all the nodes. In this case, there are only two nodes, defined as follows:
    add_node(node     => 1,
             host     => 'localhost',
             dbname   => 'pgbench',
             port     => 5432,
             user     => 'slony',
             password => 'slony');
    add_node(node     => 2,
             host     => 'localhost',
             dbname   => 'pgbenchslave',
             port     => 5432,
             user     => 'slony',
             password => 'slony');
I had to remove definitions of nodes 3 and 4 from the sample configuration. Now we define replication sets. This involves defining tables and the unique, not null indexes slony can use as a primary key. If the table has an explicitly defined primary key, slony will use it automatically. Because of our modifications to the history table above, all four of our tables have primary keys, so this part is simple. All our table names go in the pkeyedtables element of the $SLONY_SETS hash, as follows:
        "pkeyedtables" => [     
                                'public.accounts',
                                'public.tellers',
                                'public.history',
                                'public.branches'
                           ],
We don't have any tables without primary keys, so we don't need the keyedtables element, and Slony no longer creates serial indexes for you as of v2.0.0, so we can delete the serialtables element. We do need to replicate the history_id_seq we created as part of the history table's primary key, so add that to the sequences element, as follows:
        "sequences" => ['history_id_seq' ],
Finally, remove the sample configuration for set 2, and save the file.

Generating Slonik Configuration

Now that we've configured the altperl stuff, we can use it to generate scripts that will be passed to slonik, that will actually set things up.
jtolley@uber:~/devel/pgdb$ slonik_init_cluster > initcluster
jtolley@uber:~/devel/pgdb$ slonik_create_set 1 > createset
jtolley@uber:~/devel/pgdb$ slonik_subscribe_set 1 2 > subscribeset
This creates three files each containing slonik code to set up a cluster and get it running. If you tried to use the serialtables stuff, you'll run into problems here with new versions of slony (not that I had that problem or anything...). Note that the arguments to slonik_subscribeset differ from those given in the documentation. This script requires two arguments: the set you're interested in, and the node that's subscribing to it.

Starting Everything Up

We're ready to do real work. Tell slonik to initialize the cluster:
jtolley@uber:~/devel/pgdb$ slonik < initcluster 
:6: Possible unsupported PostgreSQL version (80400) 8.4, defaulting to 8.3 support
:6: could not open file /home/jtolley/devel/slony/share/postgresql/slony1_base.sql
The complaints about version 8.4 aren't surprising, as I'm using bleeding-edge PostgreSQL. But I think I had something wrong with my directories when I built slony. The files in question ended up in /home/jtolley/devel/pgdb/share/postgresql, so I did this:
jtolley@uber:~/devel/pgdb$ mkdir ../slony/share
jtolley@uber:~/devel/pgdb$ cp -r share/postgresql/ ../slony/share/
jtolley@uber:~/devel/pgdb$ slonik < initcluster 
:6: Possible unsupported PostgreSQL version (80400) 8.4, defaulting to 8.3 support
:9: Possible unsupported PostgreSQL version (80400) 8.4, defaulting to 8.3 support
:10: Set up replication nodes
:13: Next: configure paths for each node/origin
:16: Replication nodes prepared
:17: Please start a slon replication daemon for each node
This looks right, so the next step is to start the slon daemon for each node:
jtolley@uber:~/devel/pgdb$ slon_start 1
Invoke slon for node 1 - /home/jtolley/devel/slony/bin/slon -s 1000 -d2 replication 'host=localhost dbname=pgbench user=slony port=5432 passwor
d=slony' > /home/jtolley/devel/slony/log/slony1/node1/pgbench-2008-12-17_12:33:18.log 2>&1 &                                                   Slon successfully started for cluster replication, node node1
PID [24745]
Start the watchdog process as well...
jtolley@uber:~/devel/pgdb$ syntax error at /home/jtolley/devel/pgdb/bin/slon_watchdog line 47, near "open "
Execution of /home/jtolley/devel/pgdb/bin/slon_watchdog aborted due to compilation errors.
Slony shipped a bug in slon_watchdog -- line 46 needs to have a semicolon at the end.
jtolley@uber:~/devel/pgdb$ pkill slon
jtolley@uber:~/devel/pgdb$ vim ../pgdb/bin/slon_watchdog
Change line 46 to read:
    my ($logfile) = "$LOGDIR/slon-$dbname-$node.err";
... and try again:
jtolley@uber:~/devel/pgdb$ slon_start 1
Invoke slon for node 1 - /home/jtolley/devel/slony/bin/slon -s 1000 -d2 replication 'host=localhost dbname=pgbench user=slony port=5432 passwor
d=slony' > /home/jtolley/devel/slony/log/slony1/node1/pgbench-2008-12-17_12:35:29.log 2>&1 &                                                   Slon successfully started for cluster replication, node node1
PID [24918]
Start the watchdog process as well...
jtolley@uber:~/devel/pgdb$ slon_start 2
Invoke slon for node 2 - /home/jtolley/devel/slony/bin/slon -s 1000 -d2 replication 'host=localhost dbname=pgbenchslave user=slony port=5432 pa
ssword=slony' > /home/jtolley/devel/slony/log/slony1/node2/pgbenchslave-2008-12-17_12:35:31.log 2>&1 &                                         Slon successfully started for cluster replication, node node2
PID [24962]
Start the watchdog process as well...
Now we need to create the cluster and subscribe:
jtolley@uber:~/devel/pgdb$ slonik < createset 
:16: Subscription set 1 created
:17: Adding tables to the subscription set
:21: Add primary keyed table public.accounts
:25: Add primary keyed table public.tellers
:29: Add primary keyed table public.history
:33: Add primary keyed table public.branches
:36: Adding sequences to the subscription set
:40: Add sequence public.history_id_seq
:41: All tables added
jtolley@uber:~/devel/pgdb$ slonik < subscribeset 
:10: Subscribed nodes to set 1

Watching It Work

Now we can make it do something interesting. First, start watching the logs. They live in /home/jtolley/devel/slony/log/slony1, and we can watch them like this, since there aren't too many log files involved:
jtolley@uber:~/devel/slony/log/slony1$ find . -type f | xargs tail -f
This shows lots of log info. If you want to see more, run another pgbench instance:
jtolley@uber:~/devel/pgdb$ pgbench -s 1 -c 5 -t 1000 pgbench
For extra credit, add another table to the replication set, get it replicated, and manually insert data. See if the new data come across.

Using cron and psql to transfer data across databases

I recently had to move information from one database to another in an automatic function. I centralized some auditing information such that specific information about each database in the cluster could be stored in a single table, inside a single database. While I still needed to copy the associated functions and views to each database, I was able to make use of the new "COPY TO query"feature to do it all on one step via cron.

At the top of the cron script, I added two lines defining the database I was pulling the information from ("alpha"), and the database I was sending the information to ("postgres"):

PSQL_ALPHA='/usr/bin/psql -X -q -t -d alpha'
PSQL_POSTGRES='/usr/bin/psql -X -q -t -d postgres'

From left to right, the options tell psql to not use any psqlrc file found (-X), to be quiet in the output (-q), to print tuples only and no header/footer information (-t), and the name of the database to connect to (-d).

The cron entry that did the work looked like this:

*/5 * * * * (echo "COPY audit_mydb_stats FROM STDIN;" && $PSQL_ALPHA -c "COPY (SELECT *, current_database(), now(), round(date_part('epoch'::text, now())) FROM audit_mydb_stats()) TO STDOUT" && echo "\\.") | $PSQL_POSTGRES -f -

From right to left, the command does this:

  • Run once every five minutes.
  • Take the entire output of the first parenthesized command and pipe it to the second command.
  • We build a complete COPY command to feed to the second database.
    • First, we echo the line that tells it where to store the data (COPY ... FROM STDIN)
    • Next, we run the 'COPY TO' command on the first database, which, instead of dumping a table, outputs the results of a function, plus three other columns indicating the current database, the current time and the current time as an epoch value.
    • After all the data is dumped out, we echo a "backslash dot" to indicate the end of the copied data
  • All of this is now piped to the second database by calling psql with a -f argument, indicating that we are reading from a file. In this case, the file is stdin via the newly opened pipe, indicated by a single dash after the -f argument.

This allowed me to simply move the data from one database to the other, with a transformation in the middle, neatly avoiding any need to make changes on either the functions output or the columns on the target table.

Git it in your head

Git is an interesting piece of software. For some, it comes pretty naturally. For others, it's not so straightforward.

Comprehension and appreciation of Git are not functions of intellectual capacity. However, the lack of comprehension/appreciation may well indicate one of the following:

  1. Mistakenly assuming that concepts/procedures from other VCSes (particularly non-distributed "traditional" ones like CVS or Subversion) are actually relevant when using Git
  2. Not adequately appreciating the degree to which Git's conception of content and history represent a logical layer, as opposed to implementation details

CVS and Subversion both invite the casual user to basically equate the version control repository and all operations around it to the file system itself. They ask you to understand how files and directories are treated and tracked within their respective models, but that model is basically oriented around files and directories, period. Yes, there are branches and tags. Branches in particular are entirely inadequate in both systems. They don't really account for branching as a core possibility that should be structured into the logical model itself; consequently, both systems can keep things simple (the model basically amounts to files and directories), neither one challenges the user mentally, and neither one does much for you when you have real problems to solve that involve branching.

With such a low barrier to entry, where the logical model is barely distinguishable for day-to-day use from the file system, it's easy for engineers to think of the VCS as a taken-for-granted utility, that should "Just Work" and be really easy and not challenge assumptions, etc. Then Git comes along and punishes anyone who takes that view; if you try to treat Git as a simple utility to drop in the place of CVS/SVN, you will eventually suffer. The engineer must grasp the logical layer in order to make effective use of the tool on anything beyond the shallowest of levels.

So, here's the deal: when they say that Git tracks content, not files, they mean it. They're telling you that Git isn't just a nice versioned history of your file system. Rather, Git offers you a deal: you take an hour or two to learn its logical layer (its object model, in a sense), and in exchange Git will give you branching and distributed workflow as a basic way of life. It's a good deal.

Consequently, those new to Git or those having trouble with Git may do well to throw out any assumptions. Instead, memorize this:

  • Objects are just things stored in Git that have a type, some data, some Git-oriented headers, and a unique ID consisting of a SHA1 hash of the components of the object. There aren't very many object types to learn
  • Blobs are simple objects that just contain some data and could be thought of as "leaves" on a tree. They might represent text files, binary files, symlinks, etc. But they are "blobs", not files, because the blob only represents the data, not any real-world identity of that data ("content, not files")
  • Trees are objects that contain a list of blob ids paired with some properties and a real-world identity (relative to the tree) for each blob. File system directories map to trees, but they aren't the same thing.
  • Commits are objects that reference a single tree object (the top level tree of the repository) and some arbitrary number of parent commits.
  • Refs aren't standard Git objects, in that they aren't storing versioned data or anything like that; rather, they are simply named pointers, each of which references a particular commit object. Branches are refs. Magic things like HEAD are refs. Again, all the ref needs to do is specify a particular commit object.

That's really not all that much stuff to remember. Then, before thinking about how it relates to files, think through the implications of the object model above. You'll never get how it works with the file system if you don't get it as a standalone model first:

  • Blobs aren't versioned. They are standalone pieces of content that are referenced by trees. They represent content state, period.
  • A tree's identity is determined by the content of its blobs and their identity relative to that tree
  • Consequently, two trees may ultimately only have one different blob (tree A has blob X under branch Z while tree B instead has blob Y under branch Z), but they are two unique trees, that happen to reference some arbitrary number of common trees/blobs (the member trees/blobs that are the same between both will literally be "the same" between both, as they are identified by SHA1)
  • Since the state of the tree determines its identity, it's easy for Git to determine where differences between two trees occur.
  • Since the tree and the parent commits make up a commit object, it's easy for Git to easily determine whether or not a specific state (combination of tree state and revision history) exists or not in a given history; this is what allows for flexible branching and distributed operations

Once you get all that in your head, then map it to the filesystem.

  • Files map to blobs
  • Directories map to trees
  • Your "checkout" is the mapping of a particular commit's tree to your file system. That's where your working tree starts.
  • Changing a file means introducing a new blob, which introduces a new tree, which cascades up to the top of your working tree. That's how the magic happens.

Study the object model, the logical layer, whatever you want to call it. It's not an implementation detail; it's the model that makes everything possible. You have to understand Git's concept of revision history and whatnot if you're going to make it work for you. Just like you need to learn something new and idiomatic whenever you pick up a new piece of sophisticated software.