End Point


Welcome to End Point’s blog

Ongoing observations by End Point people

An Odd Reason Why I Like Google Wave

Others have noted reasons why Google might have decided Wave is a failure, but for me the most significant reason is that when it was announced at Google I/O 2009, it needed to be open to all interested parties. With the limited sandbox, followed by the limited number of accounts given out, there was no chance for the network effect to kick in, and without that, a communication tool no matter how useful will not gain much traction.

I still use Google Wave, along with several of my co-workers. Wave has great real-time collaborative editing, and for me fills a useful niche between email, wiki, and instant messaging.

However, every once in a while a completely different advantage hits me: The absurdly, almost comically bad, threading and quoting in regular email. This is both a technical problem with email clients that handle quoting stupidly, and a problem of conventions with people quoting haphazardly, too much, or with wacky trimming and line wrapping. To say nothing of multipart email with equally hideous plain text and bloated HTML parts.

I munged the text of the following example from a mailing list to show an example. It's not the most egregious mess of quoting, and had no pile of attachments as sometimes happens, but it shows how messy it can be.

In Wave, replies and quoting were quite a bit better. Not perfect, but nothing like this. If Wave goes away, I'm not sure when we'll see another candidate to compete with email that is likewise an open protocol with a solid reference implementation that's free of charge to use. It's a shame.

Date: Mhx, 3 Hqi 1499 37:76:16 +1546
From: Pswr iwo zvs Aburf 
To: gdgmguezfet-list@lykkkafnnj.wjo
Subject: Re: [gdgmguezfet-list] Lgj Cjroiojezg: Jmxc Mezfcfeqh

> -----Original Message-----
> From: gnszgqzspfx-whto-zhnmchx@cpevwgofsv.isj [mailto:iiyqktwjujg-fnlq-
> evkvpty@eczdtgakvm.oac] Jl Waopuq Vq Exho Jxprx
> Sent: Ojamlrtxg, Hkzn 16, 2906 41:72 AE
> To: gdgmguezfet-list@lykkkafnnj.wjo
> Subject: Re: [gdgmguezfet-list] Bqr Dqjdiarbbf: Usdq Frweivcmz
> Turlmyx Mhek rhh gye Llkat (utbm@6pgbr.bfb):
> > > -----Original Message-----
> > > From: iilxvyjwayx-bllf-ptrecjc@heaukjmdkq.htj [mailto:rrnjpeneara-
> ooxd-
> > > xalsxai@hxugschjny.xsq] Re Fvbihv Yz Fhlg Hmcrf
> > > Sent: Eezdaevyk, Pgyh 99, 9587 0:76 QQ
> > > To: gdgmguezfet-list@lykkkafnnj.wjo
> > > Subject: Re: [gdgmguezfet-list] Kld Jzecudkxki: Dstw Tyjekgeah
> > >
> > > Tdhdojq Atrg wbf mvf Cudfw (wphe@8gwbs.lgu):
> > > > > -----Original Message-----
> > > > > From: wrlsvvealuw-wsod-hwwvjdo@ihgudlwfrk.kgs
> [mailto:ckajpyzyzoi-
> > > ccnq-
> > > > > rwmpxyk@ronwonbthg.air] Jn Zubqus Sr Sohf Ovswrrbhd
> > > > > Sent: Ndasppj, Banc 94, 8862 2:10 KH
> > > > > To: gdgmguezfet-list@lykkkafnnj.wjo
> > > > > Subject: Re: [gdgmguezfet-list] Pbu Zntzwmgimt: Gxel Wdjpzgemo
> > > > >
> > > > > Sonpxm Kjluxxmr (Itqeg) wrote:
> > > > > > Yxhw xb oftwdhs gtiogosk qkv opusb (yfrtg qxegsju rem
> qnhmsoox)
> > > os
> > > > > bkv
> > > > > > ptqa mq vlyd edtomyuz eikclpf:
> > > > > >
> > > > > > kptd://qfx.bdtqmqabro.chk/opnamgt/oikgome/cwppzfezxru-
> fdpf/1149-
> > > > > Smxw/9
> > > > > > 74694.txke
> > > > > >
> > > > > > Lyvuivy, Ucqg!
> > > > >
> > > > > Dpynbi Unpnk!
> > > > >
> > > > > Gl xu syc iqh.
> > > > >
> > > > > S oztdn agm hagxp X okqf hazgjohvhx kijip lxpq djna.  Te ffc
> myuoxy
> > > > > kjisroyk cb x wdvrg jp lkdeqqw gf c xdx bpfs sicpb giohed cz
> paz
> > > ntrtb
> > > > > boqsnoxc cazeupca msxscw dqqdv piuvapku paahujdimn bjb afueakh.
> > > > >
> > > > > Eefj ik ss ejmlkis edchi inkfg puiqovazf y gflmel ol etnvgxyz
> fl
> > > gxp
> > > > > fnicp fndjc.  Z xjqb zgnb asfpjjm ht kp aiiupstbsh rijcacj
> xewre
> > > qof.
> > > >
> > > > M talpk mpoj bzprpown oywg we zxz qtu fjeb yydclxzr uts gjxp
> > > > CynkyOmnareHvmdyl hlclxaxvs pw cn ehqxfd krqs gfeq bwn ywzynpo
> > > qheelwu busk
> > > > amlptm, ao qxpb qqal udh yanqoin
> > > >
> > > > HboiyErxaftByifwd  prka cmu
> > > >
> > >
> > > Hhmtun lujf emfcfwlysstn, qoplid jpt mdnfb imbuyy. A pr rhb xl
> > > oulydlqvu byxo
> > > mjfossxp zsm neqna uxvjou sby b odo ibuc.
> > >
> > > X zagkk v toevkk vpwfmon idirul dx:
> > >
> > >   MfIrpktz vmqklb jolquatuzjqz ngmffe
> > >
> > > dcff ri kdrice xcem vx g fjuxp qnw ekmylkjs nhdrjtlj. Zh osuvxnx
> vpe
> > > jedp khbfuhibx jlupzdsnwen jv ljkhy kqoyeh, ncoh bgr uok ylas.
> >
> > Bvy iopef kuje's aiec mcf tlvyp msvsw hl fti ghqqn lwff? Crdwj qrgi
> gpnavlq
> > hljshpg qgq kocel knrp kgmshwlbms cq wztz zkbu mbpwq ph pynzqpsd
> ahjjy,
> > nplwl gms xaftc vtw srri uw nzbv dv tgt noo hnfmokm p katlc xmskoss $
> > mukgegrqu pogr.
> Scciwgd kj zjfbrecy miu yscw npzmb iu. IbTibflw * tkckv rchxz cyhu. 5-)
> >
> >
> > > U fbfaejarxux pd eexpwcme so=wai-yyh-zjoocg + [mzed-rxjf] gq n
> > > xollrj slmmiq ffe suhvjcjyam LfFzhvap ja wxqd eviaql se rmm ic W uj
> > > dcjtmbyss. Xtbvmmhl lkzsv sxfegic bd lho devcb bcmlh'h xmzi xn js
> uweoh
> > > jtucuu rwpmtme zhprrwd zc tnjhdrqz ergbp rkrzb xzjypfz snrzaqhti.
> > >
> > > Duodc
> > > > ivzl vq fjicz aof bxg mzemgedp kscj gbtzcjf. Ugd ti vnh ptoa
> bfnshu
> > > rq k
> > > > xbkhi ' quc jzvm xfkfqcmx ' pk qgo om hoszfrv dxbc cmw cbhfsa fwb
> ntc
> > > bezs-bkjdnw zcs gb. 0-)
> >
> > Phkq ynohufr jtwq xe pqzyco lkoqn qzvm ggcfqfeqxh hqgw pknxgoma
> wtblpu
> > lisdkw xw atok?
> Ykxbr auksxkx auujvb. Qbefjsg kemw odhe e wsmmdvpy lhfjd jz kne
> "pqynmmwp
> bjoqrd".

Gaosgf'm aoih tu tw vhqwkl zp dotgpa:
GpclzPtypqeVknqkp owcrmhgm

> >
> > Fzjhf vor hdabr sxzwtylu ox hze xdabn vdo tyzssjor fh rnnol gp ny i
> arhnfo
> > bvv fd ccvjz jlki xfzu ebcwt tr htbatha zmro zb jh mxnz nrugdz o yap
> keor
> > nahe vdvfh xog?
> Fp pmg qinel okk qrfju utzmmuuib, dnsy. Zn zqh jfccu gji hqkt ba eiu
> Xsomgqxyldd, edtz.
> > Fbf qjh segbu tjb lsfwcewqmh fs qimdvdxdp, mzb xxdl maf ebgujoheazk
> mgis? ...
> --
> Dbnq Chiauz
> Zrwintky -- Amrtws Mgztvpfpbwp Jfocmutnso    http://arn.oosbscvt.inw/
> eocsk +5.415.700.6883  
> Nmejbtdh vxani: Nguu xh cwziz bqjtb.
> _______________________________________________
> gdgmguezfet-list mailing list
> gdgmguezfet-list@lykkkafnnj.wjo
> http://ivd.xitxowuodc.eci/vccqskc/xettnigw/bliyqvhrdse-list

gdgmguezfet-list mailing list

Aaaah! My eyes!

Learning Spree: 10 Intro Tips

In climbing the learning curve with Spree development here are some observations I've made along the way:

  1. Hooks make view changes easier — I was surprised at how fast I could implement certain kinds of changes because Spree's hook system allowed me to inject code without requiring overriding a template or making a more complicated change. Check out Steph's blog entries on hooks here and here, and the Spree documentation on hooks and themes.
  2. Core extensions aren't always updated — One of the biggest surprises I found while working with Spree is that some Spree core extensions aren't maintained with each release. My application used the Beanstream payment gateway. Beanstream authorizations (without capture) and voids didn't work out of the box with Spree 0.11.0.
  3. Calculators can be hard to understand — I wrote a custom shipping calculator and used calculators with coupons for the project and found that the data model for calculators was a bit difficult to understand initially. It took a bit of time for me to be comfortable using calculators in Spree. Check out the Spree documentation on calculators for more details.
  4. Plugins make the data model simpler after learning what they do — I interacted with the plugins resource_controller, state_machine, and will_paginate in Spree. All three simplified the models and controllers interface in Spree and made it easier to identify the core behavior of Spree models and controllers.
  5. Cannot revert migrations — Spree disables the ability to revert migrations due to complications with extensions which makes it difficult to undo simple database changes. This is more of a slight annoyance, but it complicated some aspects of development.
  6. Coupons are robust, but confusing — Like calculators, the data model for coupons is a bit confusing to learn but it seems as though it's complicated to allow for robust implementations of many kinds of coupons. Spree's documentation on coupons and discounts provides more information on this topic.
  7. Solr extension works well — I replaced Spree's core search algorithm in the application to allow for customization of the indexed fields and to improve search performance. I found that the Solr extension for Spree worked out of the box very well. It was also easy to customize the extension to perform indexation on additional fields. The only problem is that the Solr server consumes a large amount of system resources.
  8. Products & Variants — Another thing that was a bit strange about Spree is that every product has at least one variant referred to as the master variant that is used for baseline pricing information. Spree's data model was foreign to me as most ecommerce systems I've worked with have had a much different product and variant data model.
  9. Routing — One big hurdle I experienced while working with Spree was how Rails routing worked. This probably stemmed from my inexperience with the resource_controller plugin, or from the fact that one of the first times I worked with Rails routing was to create routes for a nested resource. Now that I have learned how routing works and how to use it effectively, I believe it was well worth the initial struggle.
  10. Documentation & Community — I found that the documentation for Spree was somewhat helpful at times, but the spree-user Google group was more helpful. For instance, I got a response on Beanstream payment gateway troubleshooting from the Spree extension author fairly quickly after asking on the mailing list.

I believe that Spree is an interesting project with a somewhat unusual approach to providing a shopping cart solution. Spree's approach of trying to implement 90% of a shopping cart system is very different from some other shopping cart systems which overload the code base to support many features. The 90% approach made some things easier and some things harder to do. Things like hooks and extensions makes it far easier to customize than I expected it would be, and it also seems like it helps avoid the build up of spaghetti code which comes from implementing a lot of features. However, allowing for a "90%" solution seems to make some things like calculators a bit harder to understand when getting started with Spree, since the implementation is general and robust to allow for customization.

Hopefully Useful Techniques for Git Rebase

I recently had to spend a few hours merging Git branches to get a development branch in line with the master branch. While it would have been a lot better to do this more frequently along the way (which I'll do going forward), I suspect that plenty of people find themselves in this position occasionally.

The work done in the development branch represents significant new design/functionality that refactors a variety of older components. My preference was to use a rebase rather than a merge, to keep the commit history clean and linear and, more critically, because the work we're doing really can be thought of as being "applied to" the master branch.

No doubt there are a variety of strategies to apply here. This worked for me and perhaps it'll help someone else.

Some Key Concerns for a Big Rebase

Beyond the obvious concern of having sufficient knowledge of the application itself, so that you can make intelligent choices with respect to the code, there are a number of key operational concerns specific to rebase itself. This list is not exhaustive, but it is not an unreasonable set of key considerations to keep in mind.

  1. Rebase is destructive

    Remember what you're doing! While a merge literally combines two or more revision histories, a rebase takes a chunk of revision history and applies it on top of another related history. It's like a cherry-pick on steroids (really nice, friendly steroids that provoke neither rage nor senate hearings): each commit gets logically applied on top of the specified head, and as such gets rewritten. The commits are not the same afterwards. The history of your working tree's branch is rewritten.

    So, before you rebase, protect yourself: Make sure you have more than one reference (either a branch or a tag) pointing to your current work.

  2. Conflict resolution can bring about bugs

    When resolving merge conflicts along the way, you'll need to manually inspect things to try to figure out the right path forward. If it's been a while since you merged/rebased, you may find that merge conflict resolution is not so simple: rather than picking one version or the other, you're literally merging them in some logical manner. You may end up writing new code, in other words.

    Because you are involved and you are a mammal, there is a decent possibility that you will screw this up.

    So, again, protect yourself: Look at what's coming before you rebase and take note of likely conflict resolution points.

  3. Things go wrong and an abort can be necessary

    Some times it becomes quite clear that a mistake has been made along the way, and you need to bail out and regroup. If you're doing a gigantic rebase in one big shot, this can happen after you're 15, 45, 90, or 120+ minutes into the task. Do you really want to have to go all the way back to the beginning of your rebase excursion and start fresh?

    Don't let this happen. When approaching the rebase, show humility, expect things to go wrong, and embrace a strategy that lets you recover from mistakes:

    Break the rebase into smaller chunks and proceed through them incrementally

  4. You may not immediately know that something went wrong

    Unless the code base is pretty trivial or you are 100% committed to that code base all the time, it is unlikely that you'll be completely on top of everything that's happened in both revision histories. You can test the stuff you know, you can run test suites, etc., but it's critical to work defensively.

    Prepare for the possibility of delayed mistake revelation: Keep track of what you do as you go

Addressing the Concerns

The technique I've come to use to address the stated concerns is fairly simple to learn, understand, and apply in practice. It's iterative in nature and is therefore Agile and therefore grants me a sense of personal validation, which is very, very important.

For a real-world use case, you'll probably want to use more helpful, specific branch and tag names than this. The names in this discussion are deliberately simple for illustrative purposes.

Say you have a master branch which represents the canonical state of the code base. You've been working on the shiny branch where everything is more awesome. But shiny really needs to keep up with master, it's been a while, and so you want to rebase shiny onto master.

We're going to have the following things:

  • Multiple stages of rebasing, leading incrementally from shiny to the full rebase of shiny on master.
  • A "target" for each stage: the commit from master onto which your rebasing the work from shiny
  • A tag providing an intuitive name for each target
  • A branch providing the revision history for each stage

Given those things, we can follow a simple process:

  1. Make a branch from the latest shiny named for the next stage (i.e. from shiny we make shiny_rebase_01, from shiny_rebase_02 we make shiny_rebase_03, and so on).

    When you're just starting the rebase, this might mean:

    [you@yours repo] git checkout -b shiny_rebase_01 shiny
    But for the next iteration, you would have shiny_rebase_01 checked out, and use it as your starting place:
    # The use of "shiny_rebase_01" is implied assuming our previous checkout above
    [you@yours repo] git checkout -b shiny_rebase_02
    # A subsequent one, again assuming we're on our most recent stage's branch already
    [you@yours repo] git checkout -b shiny_rebase_03
    And so on.

    This addresses concerns 1, 3, and 4: you're protecting yourself against rebase's inherent destructiveness, by always working on new branches; you're facilitating the staging of work in smaller chunks, and you're keeping track of your work by having a separate branch representing the state of each change.

  2. Review the revision history of master, look for commits likely to contain significant conflicts or representing significant inflection points, and pick your next target commit around them; if you have a pile of simple commits, you might want the target to be the last such simple commit prior to a big one, for instance. If you have a bunch of big hairy commits you may want each to be its own target/stage, etc. Use your knowledge of the app.

    The git whatchanged command is very useful for this, as by default it lists the files changed in a commit, which is the right granularity for this kind of work. You want to quickly scan the history for commits that affect files you know to be affected by your work in shiny, because they will be a source of conflict resolution points. You don't want to look at the full diff output of git log -p for this purpose; you simply want to identify likely conflict points where manual intervention will be required, where things may go wrong. After having identified such points, you can of course dig into the full diffs if that's helpful.

    Make your life easy by using the last target tag as the starting place for this review, so you only wade through the commits on master that are relevant to the current rebase stage (since the last target tag is where your branches diverge, it's where the rebase will start from).

    At this point you may say "but I don't have a last target tag!" The first time through, you won't have one because you haven't done an iteration yet. So for the first time, you can start from where git rebase itself would start:

    [you@yours repo] git whatchanged `git merge-base master shiny`..master

    But subsequent iterations will have a tag to reference (see the next step), so the next couple times through might look like:

    [you@yours repo] git whatchanged shiny_rebase_target_01..master
    [you@yours repo] git whatchanged shiny_rebase_target_02..master


    This is addressing items 2 and 3: we're looking at what's coming before we leap, and structuring our work around the points where things are likely to be inconvenient, difficult, etc.

  3. Having identified the commit you want to use as your next rebasing point, make a tag for it. Name the tags consistently, so they reflect the stage to which they apply. So, if this is our first pass through and we've determined that we want to use commit a723ff127 for our first rebase point, we say:

    [you@yours repo] git tag shiny_rebase_target_01 a723ff127

    This gives us a list of tags representing the different points in the master onto which we rebased shiny in our staged process. It therefore addresses item 4, keeping track as you go.

  4. You're now on a branch for the current stage, you have a tag representing the point from master onto which you want to rebase. So do it, but capture the output of everything. Remember: mistakes along the way may not be immediately apparent. You will be a happier person if you've preserved all the operational output so you can review to track down where things potentially went wrong.

    So, for example:

    [you@yours repo] git rebase shiny_rebase_target_01 >> ~/shiny_rebase_work/target_01.log 2>&1
    You would naturally update the tag and logfile per stage.

    Review the logfile in your pager of choice. Is there a merge conflict reported at the bottom? Well, capture that information before you dive in and resolve it:

    # Log the basic info about the current state
    [you@yours repo] git status >> ~/shiny_rebase_work/target_01.log 2>&1
    # Log specifically what the conflicts are
    [you@yours repo] git diff >> ~/shiny_rebase_work/target_01.log 2>&1

    Now go and resolve your conflicts per usual, but remember to preserve your output when you resume:

    [you@yours repo] git rebase --continue >> ~/shiny_rebase_work/target_01.log 2>&1

    This addresses point 4: keeping track of what happened as you go.

  5. Now you finished that stage of the rebase, you resolved any conflicts along the way, you've preserved history of what happened, what was done, etc. So the final step is: test.

    Run the test suite. You did implement one, right?

    Test the app manually, as appropriate.

    Don't put it off until the end. Test as you go. Seriously. If something is broken, use git blame, git bisect, and your logs and knowledge of the system to figure out where the problem originates. Consider blowing away the branch you just made, going back to the previous stage's branch, selecting a new target, and moving forward with a smaller set of commits. Etc. But make sure it works as you go.

    This does not necessarily fit any specific point, but is more to ensure the veracity of the overall staged rebase process. The point of iterative work is that each iteration delivers a small bit of working stuff, rather than a big pile of broken stuff.

  6. Repeat this process until you've successfully finished a rebase stage for which the target is in fact the head of master. Done.

So, that's the process I've used in the past. It's been good for me, maybe it can be good for you. If anybody has criticisms or suggestions I'd love to hear about them in comments.

Backcountry.com, CityPASS, and Zapp's in the news

I want to call attention to a few of our clients who have been in the news lately.

Today, Backcountry.com's CIO Kelly Phillipps spoke at the NRFtech 2010 conference in California on cultivating community and encouraging user-contributed content, and why Backcountry.com invests so much in building its community.

About a month ago, CityPASS was featured in a spot on the Today Show about hostess gifts. A lot of viewers visited the website as the spot aired in each of the U.S. time zones!

Finally, we were sad to learn that Zapp's Potato Chips founder Ron Zappe passed away in June. He had quite an adventure building his two companies, first in oilfield services in Texas, then in 1985 with his potato chips company focused on spicy Cajun chips. Their online store is simple -- their products are excellent!

Ruby on Rails Typo blog upgrade

I needed to migrate a Typo blog (built on Ruby on Rails) from one RHEL 5 x86_64 server to another. To date I've done Ruby on Rails deployments using Apache with FastCGI, mongrel, and Passenger, and I've been looking for an opportunity to try out an nginx + Unicorn deployment to see how it compares. This was that opportunity, and here are the changes that I made to the stack during the migration:

I used the following packages from End Point's Yum repository for RHEL 5 x86_64:

  • nginx-0.7.64-2.ep
  • ruby-enterprise-1.8.7-3.ep
  • ruby-enterprise-rubygems-1.3.6-3.ep

The rest were standard Red Hat Enterprise Linux packages, including the new RHEL 5.5 postgresql84 packages. The exceptions were the Ruby gems, which were installed locally with the `gem` command as root.

I had to install an older version of one gem dependency manually, sqlite3-ruby, because the current version requires a newer version of sqlite than comes with RHEL 5. The installation commands were roughly:

yum install sqlite-devel.x86_64
gem install sqlite3-ruby -v 1.2.5

gem install unicorn
gem install typo

yum install postgresql84-devel.x86_64
gem install postgres

Then I followed (mostly) the instructions in Upgrading to Typo 5.4, which are still pretty accurate even though outdated by one release. One difference was the need to specify PostgreSQL to override the default of MySQL (even though the default is documented as being sqlite):

typo install /path/to/typo database=postgresql

Then I ran pg_dump on the old Postgres database and imported the data into the new database, and put in place the database.yml configuration file.

The Typo upgrade went pretty smoothly this time. I had to delete the sidebars configuration from the database to stop getting a 500 error for that, and redo the sidebars manually -- which I've had to do with every past Typo upgrade as well. But otherwise it was easy.

I first tested the migrated blog by running unicorn_rails manually in development mode. Then to have Unicorn start at boot time, I wrote this little shell script and put it in ~/bin/start-unicorn.sh:

cd /path/to/app || exit 1
unicorn_rails -E production -D -c config/unicorn.conf.rb

Then added a cron job to run it:

@reboot bin/start-unicorn.sh

That unicorn.conf.rb file contains only:

listen 8080
worker_processes 4

The listen port 8080 is the default, but I may need to change it. Unicorn defaults to only 1 worker process, so I increased it to 4.

I added the following nginx configuration inside the http { ... } block (actually in a separate include file):

upstream app_server {
    server fail_timeout=0;

server {
    listen       the.ip.add.ress:80;
    server_name  the.host.name;

    location / { 
        root   /path/to/rails/typo/public/cache;

        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        #proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Host $http_host;
        proxy_redirect off;

        rewrite ^/blog/xml/atom/feed\.xml$ /articles.atom permanent;
        rewrite ^/blog/xml/rss20/feed\.xml$ /articles.rss permanent;

        if (-f $request_filename) {

        set $possible_request_filename $request_filename/index.html;
        if (-f $possible_request_filename) {
            rewrite (.*) $1/index.html;

        set $possible_request_filename $request_filename.html;
        if (-f $possible_request_filename) {
            rewrite (.*) $1.html;

        if (!-f $request_filename) {
            proxy_pass http://app_server;

    # Rails error pages
    error_page 500 502 503 504 /500.html;
    location = /500.html {
        root   /path/to/rails/typo/public;

The configuration was a little complicated to get nginx serving static content directly, including cache files that Typo writes out. I had to add special handling for / which gets cached as /index.html, but can't be called that when passing URIs to Typo, as it doesn't know about any /index.html. And all HTML cache files end in .html, though the URIs don't, so those need special handling too.

But when all is said and done, the blog is now running on the latest version of Typo, on the latest Unicorn, Rails, Ruby Enterprise Edition, PostgreSQL, and nginx, with all static content and fully-cached pages served directly by nginx, and for the most part only dynamic requests being served by Unicorn. I need to tweak the nginx rewrite rules a bit more to get 100% of static content served directly by nginx.

As far as blogging platforms go, I can recommend Typo mainly for Rails enthusiasts who want to write their own plugins, tweak the source, etc. WordPress or Movable Type are so much more widely used that non-programmers are going to have a lot easier time deploying and supporting them. They've had a lot more security vulnerabilities requiring updates, though that may also be a function of popularity and payoff for those exploiting them.

Rails deployment seems to take a lot of memory no matter how you do it. I don't think nginx + Unicorn uses much less RAM than Apache + Passenger, mostly the different between nginx and Apache themselves. But using Unicorn does allow for running the application processes on another server or several servers without needing nginx or Apache on those other servers. It does provide for clean separation between the web server and the application(s), including possibly different SELinux contexts rather than always httpd_sys_script_t as we see with Passenger. Passenger at least switches the child process UID to run with different permissions from the main web server, which is good. Both Passenger and Unicorn are much nicer than FastCGI, which I've always found to be a little buggy, and mongrel, which required specifying a range of ports and load-balancing across all of them in the proxy -- managing multiple port ranges is a pain with multiple apps on the same machine, especially when some need more than others.

I think if you have plenty of RAM, going with Apache + Passenger may still be the easiest Rails web deployment method overall, when mixed with other static content, server-side includes, PHP, and CGIs. But for high-traffic and custom setups, nginx + Unicorn is a nice option.

Creativity with fuzzy string search

PostgreSQL provides a useful set of contrib modules for "fuzzy" string searching; that is, searching for something that sounds like or looks like the original search key, but that might not exactly match. One place this type of searching shows up frequently is when looking for peoples' names. For instance, a receptionist at the dentist's office doesn't want to have to ask for the exact spelling of your name every time you call asking for an appointment, so the scheduling application allows "fuzzy" searches, and the receptionist doesn't have to get it exactly right to find out who you really are. The PostgreSQL documentation provides an excellent introduction to the topic in terms of the available modules; This blog post also demonstrates some of the things they can do.

The TriSano application was originally written to use soundex search alone to find patient names, but that proved insufficient, particularly because common-sounding last names with unusual spellings would be ranked very poorly in the search results. Our solution, which has worked quite well in practice, involved creative use of PostgreSQL's full-text search combined with the pg_trgm contrib module.

A trigram is a set of three characters. In the case of pg_trgm, it's three adjacent characters taken from a given input text. The pg_trgm module provides easy ways to extract all possible trigrams from an input, and compare them with similar sets taken from other inputs. Two strings that generate similar trigram lists are, in theory, similar strings. There's no particular reason you couldn't use two, four, or some other number of characters instead of trigrams, but you'd trade sensitivity and variability. And as the name implies, pg_trgm only supports trigrams.

Straight trigram search didn't buy us much on top of soundex, so we got a bit more creative. A trigram is just a set of three characters, which looks pretty much just like a word, so we thought we'd try using PostgreSQL's full text search on trigram data. Typically full text search has a list of "stop words": un-indexed words judged too common and too short to contribute meaningfully to an index. Our words would all be three characters long, so we had to create a new text search configuration using a dictionary with an empty stop word list. With that text search configuration, we could index trigrams effectively.

This search helped, but wasn't quite good enough. We finally borrowed a simplified version of a data mining technique called "boosting", which involves using multiple "weak" classifiers or searchers to create one relatively good result set. We combined straightforward trigram, soundex, and metaphone searches with a normal full text search of the unmodified name data and a full text search over the trigrams generated from the names. The data sizes in question aren't particularly large, so this amount of searching hasn't proven unsustainably taxing on processor power, and it provides excellent results. The code is on github; feel free to try it out.

Update: One of the comments suggested a demonstration of the results, which of course makes perfect sense. So I resurrected some of the scripts I used when developing the technique. In addition to the scripts used to install the fuzzystrmatch and pg_trgm modules and the name_search.sql script linked above, I had a script that populated the people table with a bunch of fake names. Then, it's easy to test the search mechanism like this:

select * from search_for_name('John Doe')
as a(id integer, last_name text, first_name text, sources text[], rank double precision);

 id  |  last_name  | first_name |                     sources                     |        rank        
 167 | Krohn       | Javier     | {trigram_fts,name_trgm,trigram_fts,trigram_fts} |  0.281305521726608
 228 | Jordahl     | Javier     | {trigram_fts,name_trgm,trigram_fts}             |  0.237995445728302
  59 | Pesce       | Dona       | {trigram_fts}                                   |  0.174265757203102
 185 | Finchum     | Dona       | {trigram_fts}                                   |  0.174265757203102
 104 | Rumore      | Dona       | {trigram_fts}                                   |  0.174265757203102
 250 | Dumond      | Julio      | {name_trgm,trigram_fts,trigram_fts}             |   0.16849160194397
 200 | Dedmon      | Javier     | {name_trgm,trigram_fts,trigram_fts}             |  0.163729697465897
 230 | Dossey      | Malinda    | {name_trgm,trigram_fts}                         |  0.158055320382118
  50 | Dress       | Darren     | {name_trgm,trigram_fts}                         |  0.153293430805206
 136 | Doshier     | Neil       | {name_trgm,trigram_fts}                         |  0.148531511425972
 165 | Donatelli   | Lance      | {name_trgm,trigram_fts}                         |  0.132845237851143
 280 | Dollinger   | Clinton    | {name_trgm,trigram_fts}                         |  0.132845237851143
 273 | Dimeo       | Milagros   | {name_trgm,trigram_fts}                         | 0.0866267532110214
  49 | Dawdy       | Christian  | {name_trgm,trigram_fts}                         | 0.0866267532110214
 298 | Elswick     | Jami       | {trigram_fts}                                   | 0.0845221653580666

This isn't all the results it returned, but it gives an idea what the results look like. The rank value ranks results based on the rankings given by each of the underlying search methods, and the sources column shows which of the search methods found this particular entry. Some search methods may show up twice, because that search method found multiple matches between the input text and the result record. These results don't look particularly good, because there isn't really a good match for "John Doe" in the data set. But if I horribly misspell "Jamie Elswick", the search does a good job:

select * from search_for_name('Jomy Elswik') as a(id integer, last_name text,                                                 
first_name text, sources text[], rank double precision)

 id  |  last_name  | first_name |                     sources                     |        rank        
 298 | Elswick     | Jami       | {trigram_fts,name_trgm,trigram_fts,trigram_fts} |  0.480943143367767
 312 | Elswick     | Kurt       | {name_trgm,trigram_fts}                         |  0.381967514753342
 228 | Jordahl     | Javier     | {trigram_fts,name_trgm,trigram_fts}             |  0.197063013911247
 403 | Walberg     | Erik       | {trigram_fts}                                   |  0.145491883158684
 309 | Hammaker    | Erik       | {trigram_fts}                                   |  0.145491883158684

End Point turns 15 years old

End Point was founded on August 8, 1995, so yesterday was its 15th anniversary! (Or is it a birthday? I'm never sure which it is for a company.) End Point's founders Rick Peltzman and Ben Goldstein have been friends since grade school, and this wasn't the first business they started together -- in college they painted houses together.

Nearly 20 years later, sensing the huge potential of the newly commercialized Internet as a medium for mass communications and commerce, they joined forces again. They founded End Point to offer Internet and ecommerce consulting and hosting services. That was back before the Internet became heavily hyped, before the tech bubble of the late '90s, and like most new companies, it took a lot of hard work, hope, patience, skill, and a little luck.

End Point has grown from those humble beginnings to our present staff of 21 employees located in 13 states. We work with wonderful clients, old and new, who in many cases have become very close partners and friends as we've worked together to grow their businesses over the years.

If we worked together in one office, we'd certainly celebrate with a large birthday cake, but since we're a distributed company, we're celebrating by each of us going out to a movie with family or friends, and having pizza or ice cream afterwards. It's not quite the same as being in the same room, but I don't think anyone will miss singing the "happy birthday" song and I suspect we'll all somehow manage to enjoy ourselves anyway.

Thanks to everyone who has made End Point what it is -- our employees, our clients, our families, and our friends in the industry and everywhere. We're looking forward to many more enjoyable years.

Tail_n_mail and the log_line_prefix curse

One of the problems I had when writing tail_n_mail (a program that parses log files and mails interesting lines to you) was getting the program to understand the format of the Postgres log files. There are quite a few options inside of postgresql.conf that control where the logging goes, and what it looks like. The basic three options are to send it to a rotating logfile with a custom prefix at the start of each line, to use syslog, or to write it in CSV format. I'll save a discussion of all the logging parameters for another time, but the important one for this story is log_line_prefix. This is what gets prepended to each log line when using 'stderr' mode (e.g. regular log files and not syslog or csvlog). By default, log_line_prefix is an empty string. This is a very useless default.

What you can put in the log_line_prefix parameter is a string of sprintf style escapes, which Postgres will expand for you as it writes the log. There are a large number of escapes, but only a few are commonly used or useful. Here's a log_line_prefix I commonly use:

log_line_prefix = '%t [%p] %u@%d '

This tells Postgres to print out the timestamp, the PID aka process id (inside of square brackets), the current username and database name, and finally a single space to help separate the prefix visually from the rest of the line. The above will generate lines that look like this:

2010-08-06 09:24:57.714 EDT [7229] joy@joymail LOG: execute dbdpg_p7228_5: SELECT count(id) FROM joymail WHERE folder = $1
2010-08-06 09:24:57.714 EDT [7229] joy@joymail DETAIL:  parameters: $1 = '4'

As you might imagine, the customizability of log_line_prefix makes parsing the log files all but impossible without some prior knowledge. I didn't want to go the pgfouine route and make people change their log_line_prefix to a specific setting. I think it's kind of rude to force your database to change its logging to accommodate your tools :). The original quick solution I came up with was to have a set of predefined regular expressions and the user would pick one that most closely matched their logs. For tail_n_mail to work properly, it needs to pick up at least the PID so it can tell when one statement ends a new one begins. For example, if you chose "regex #1", the log parsing regex would look like this:

(\d\d\d\d\-\d\d\-\d\d \d\d:\d\d:\d\d).+?(\d+)

This works fine on the example above, and gets us the timestamp and the PID from each line. The stock regexes worked for many different log_line_prefixes I came across that our clients were using, but I was never very happy with this solution. Not only was it susceptible to failing completely when a client was using a log_line_prefix not fitting into the current list of regexes, but there was no way to know exactly where the prefix ended and the statement began, which is important for the formatting of the output and the canonicaliztion of similar queries.

Enter the current solution: building a regex on the fly. Since we don't have a connection to the database at all, merely to the the log files, this requires that the user enter in their current log_line_prefix. This is a simple entry into the tailnmailrc file that looks just like the entry in postgresql.conf, e.g.:

log_line_prefix = '%t [%p] %u@%d '

The tail_n_mail script uses that variable to build a custom regex specifically tailored to that log_line_prefix and thus to the Postgres logs being used. Not only can we grab whatever bits we want (currently we only care about the timestamp (%t and %m) and the PID (%p)), but we can now cleanly break apart each line in the log into the prefix and the actual statement. This means the canonicalization/flattening of the queries is more effective, and allows us to only output the prefix information once. The output of tail_n_mail looks something like this:

Date: Fri Aug  6 11:01:03 2010 UTC                                                        
Host: whale.example.com
Unique items: 7
Total matches: 85
Matches from [A] /var/log/pg_log/postgresql-2010-08-05.log: 61
Matches from [B] /var/log/pg_log/postgresql-2010-08-06.log: 24

[1] From files A to B (between lines 14,205 of A and 527 of B, occurs 64 times)
First: [A] 2010-08-05 16:52:11 UTC [1602]  postgres@mydb
Last:  [B] 2010-08-06 01:18:14 UTC [20981] postgres@mydb
ERROR: syntax error at or near ")" 
STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (?,?,?))
ERROR: syntax error at or near ")"
STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (123,'chocolate','donut'));

[2] From file A (line 12,172)                                                                                                
2010-08-05 12:27:48 UTC [2906] bob@otherdb
ERROR: invalid input syntax for type date: "May" 
STATEMENT: UPDATE personnel SET birthdate='May' WHERE id = 1234;

(plus five other entries)

For the entry in the above example, we are able to show the complete prefix of the log lines where the error first occurred and where it most recently occurred. The next two lines show the "flattened" version of the query that tail_n_mail uses to group together similar errors. We then show a non-flattened example of an actual query from that group. In this case, someone added an extra closing paren in their application somewhere, which gives the same error each time, although the exact output changes depending on the values used. In the second example, because there is only one match, we don't bother to show the flattened version at all.

So in theory tail_n_mail should be now be able to handle any Postgres log you care to throw at it (yes, it can read syslog and csvlog format as well). As my coworker pointed out, parsing log files in this way is something that should probably be abstracted into a common module so other tools like pgsi can take advantage of it as well.

A WordPress Migration Quick Tip

This morning Chris asked me about a WordPress migration issue he was experiencing. Chris dumped the database from the current live server and imported it to another server with a temporary domain assigned and then tried to access the blog. Whenever he would attempt to visit the login (wp-admin) page, he would be redirected to the live domain admin login URL instead of the temporary domain. Luckily, there's a quick fix for this.

The simplified explanation for this is that throughout the WordPress code, there are various places where the base URL is retrieved from the database. There is a table (wp_options by default) that includes option settings and a function to retrieve data from the wp_options table (get_option). You'll see something similar to the following two lines scattered throughout the WordPress source to retrieve the base URL to be used in redirects or link creation. In Chris' case, my guess is that the wp-admin page was attempting to redirect him to the secure login page, which uses one of the examples below to get the base URL.

$home = parse_url( get_option('siteurl') );
$home_path = parse_url(get_option('home'));

If we take a look at the database wp_options table for option_name's home and siteurl immediately after the import, you might see something like the following, where http://www.myawesomesite.com/blog is the live blog.

mysql> SELECT option_name, option_value FROM wp_options WHERE option_name IN ('siteurl', 'home');
| option_name | option_value                      |
| home        | http://www.myawesomesite.com/blog | 
| siteurl     | http://www.myawesomesite.com/blog | 

To correct any redirect and link issues, you'll just want to update the wp_options values with the following, where http://myawesomesite.temphost.net/blog is the temporary blog location:

UPDATE wp_options SET option_value = 'http://myawesomesite.temphost.net/blog' WHERE option_name IN ('siteurl', 'home');


mysql> SELECT option_name, option_value FROM wp_options WHERE option_name IN ('siteurl', 'home');
| option_name | option_value                           |
| home        | http://myawesomesite.temphost.com/blog | 
| siteurl     | http://myawesomesite.temphost.com/blog | 

This is just a quick tip that I've learned after working through several WordPress migrations. The developer needs direct access to the database to make this change, either via SSH or a database control panel - I have not found a way to apply this change from the WordPress admin easily. Also, when the new server goes live, these values must be updated again to force redirects to the regular (non-temphost) domain.