End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Google Base and Spree

Several clients I've worked with have integrated Google Base. I've always wondered how much Google Base actually impacts traffic without having access to the data.

After the recent development of a google base spree extension, we've been able to measure the significant impact of search engine traffic directly related to Google Base. The extension was installed for one of our Spree clients. See the data below:


Search engine traffic (y-axis is hits where the referring site was a search engine). Google Base was installed on March 24th.


Other referral traffic during the same time period (y-axis is hits where the referring site was not a search engine and traffic was not direct).

Although the extension is relatively simple and has room for improvement (missing product brands, product quantity), you can see how much traffic was impacted. Hopefully with future extension improvements, traffic will increase even more.

Learn more about End Point's SEO and analytics services.

Rails and SEO advantages

In today's climate, search engine optimization is a must to be competitive. Rails routing provides this advantage and much more.

Descriptive, content packed URLs afford your website better search rankings because they provide a clear context as to what the page is about. Using keywords in the filename goes even further. Under normal circumstances, without advanced configuration, a web page filename is rigid and fixed. This isn't a problem in itself, except for that it doesn't help with SEO one bit.

Having multiple URLs linking to the same page opens more doors to search engine crawlers. Generally, once indexed correctly, this means more access paths in to your site which in turn a result in a greater variety and volume of traffic.

Normally in most other programming languages, you would need to use an Apache rewrite rule to accomplish this. This rule will detect a digit in a file name and pass it along as a parameter to another dynamically generated page.

RewriteRule ^/.*([0-9]+).*$ /index.php?i=$1 [R=301,L]

This rule is definitely probably too greedy of a match, however, it serves to illustrate the point. With that rule in place, any request containing at least one number will be forwarded along to the index.php handler. Then by using a site map or just by modifying existing link structure, you can spell out multiple, descriptive, relevant URLs and increase the number of ways into your site. Not only will the quantity of links improve, but more importantly, the quality will too.

Rails does it a little different. Apache generally deals with files and for the most part isn't aware of application dynamics. This is where rails routing comes in. Rails is MVC oriented; each controller is comprised of one or many methods or actions in rails terminology. The URI is typically broken down into constituent parts in the following fashion.

map.connect ":controller/:action/:id"

With rails routing, you can specify that the elements in the URL are passed along to the correct controller and corresponding method along with variables that are used in your code.

map.connect "music/:category/:year/:month",
            :controller   => "events",
            :action       => "show",
            :requirements => { 
                :year  => /(19|20)\d\d/,
                :month => /[01]\d/,
            },

As you can see as reflected in these examples, rails is powerful tool for building websites, well beyond SEO advantages. The point is though, for SEO, you can specify as many alternate pathways into your site utilizing keyword rich linkage. By using apache you can accomplish a lot, but with rails, you can accomplish so much more. If your application has dynamic category set that you wanted to have accessible via URI, rails would be ideal for this. With rails, not only could categories be represented, but products and product descriptions can be easily translated into the URI and then propagated out to the search engines indexes.

Learn more about End Point's ruby on rails development and technical SEO services.

Inside PostgreSQL -- Multi-Batch Hash Join Improvements

A few days ago a patch was committed to improve PostgreSQL's performance when hash joining tables too large to fit into memory. I found this particularly interesting, as I was a minor participant in the patch review.

A hash join is a way of joining two tables where the database partitions each table, starting with the smaller one, using a hash algorithm on the values in the join columns. It then goes through each partition in turn, joining the rows from the first table with those from the second that fell in the same partition.

Things get more interesting when the set of partitions from the first table is too big to fit into memory. As the database partitions a table, if it runs out of memory it has to flush one or more partitions to disk. Then when it's done partitioning everything, it reads each partition back from the disk and joins the rows inside it. That's where the "Multi-Batch" in the title of those post comes in -- each partition is a batch. The database chooses the smaller of the two tables to partition first to help guard against having to flush to disk, but it still needs to use the disk for sufficiently large tables.

In practice, there's one important optimization: after partitioning the first table, even if some partitions are flushed to disk, the database can keep some of the partitions in memory. It then partitions the second table, and if a row in that second table falls into a partition that's already in memory, the database can join it and then forget about it. It doesn't need to read in anything else from disk, or hang on to the row for later use. But if it can't immediately join the row with a partition already in memory, the database has to write that row to disk with the rest of the partition it belongs to. It will read that partition back later and join the rows inside. So when the partitions of the first table get too big to fit into memory, there are performance gains to be had if it intelligently chooses which partitions go to disk. Specifically, it should keep in memory those partitions that are more likely to join with something in the second table.

How, you ask, can the database know which partitions those are? Because it has statistics describing the distribution of data in every column of every table: the histogram. Assume it wants to join tables A and B, as in "SELECT * FROM A JOIN B USING (id)". If B.id is significantly skewed -- that is, if some values show up noticeably more frequently than others -- PostgreSQL can tell by looking its statistics for that column, assuming we have an adequately large statistics_target on the column and have analyzed the table appropriately. Using the statistics, PostgreSQL can determine approximately what percentage of the rows in B have a particular value in the "id" column. So when deciding to flush a partition to disk while partitioning table A, PostgreSQL now knows enough to hang on those partitions containing values that show up most often in B.id, resulting in a noticeable speed improvement in common cases.

End Point: Search Engine Bot Parsing

I've talked to several coworkers before about bot parsing, but I've never gone into too much detail of why I support search engine bot parsing. When I say bot parsing, I mean applying regular expressions to access log files to record distinct visits by the bot. Data such as the url visited, exact date-time, http response, bot, and ip address is collected. Here is a visual representation of bot visits (y-axis is hits).

And here are the top ten reasons why search engine bot parsing should be included in search engine optimization efforts:

#10: It gives you the ability to study search engine bot behavior. What is bot behavior after 500 error responses to a url? What IP addresses are the bots coming from? Do bots visit certain pages on certain days? Do bots really visit js and css pages?

#9: It can be used as a teaching tool. Already, I have discussed certain issues from data generated by this tool and am happy to teach others about some search engine behavior. After reading this post, you will be much more educated in bot crawling!!

#8: It gives you the ability to compare search engine bot behavior across different search engines. From some of the sites I've examined, the Yahoo bot has been visiting much more frequently than Googlebot, msnbot, and Teoma (Ask.com). I will follow up this observation by investigating which urls are getting crawled by Yahoo so much more frequently.

#7: It can help identify where 302s (temporary redirects) are served when 301s (permanent redirects) should be served. For example, spree has a couple of old domains that are 302-ing occassionally. We now have the visibility to identify these issues and remediate them.

#6: It gives you the ability to study bot crawling behavior across different domains. Today, I was asked if there was a metric for a "good crawl rate". I'm not aware of a metric, but comparing data across different domains can certainly give you some context to the data to determine where to make search engine optimization efforts if you are divided between several domains.

Website #1

Website #2

#5: It gives you the ability to determine how often your entire site is crawled. I previously developed a bot parsing tool that did a comparison to the urls included in the sitemap. It provided metrics of how often 100% of a site was crawled or even how often 50% of a site was crawled. Perhaps only 95% of your site has ever been crawled - this tool can help identify which pages are not getting crawled. This data is also relevant because as the bots deem your content more "fresh", they will visit more. "Freshness" is an important factor in search engine performance.

#4: It gives you the ability to correlate search engine optimization efforts with changes in bot crawling. Again, the goal is to increase your bot visits. If you begin working on a search engine marketing campaign, the bot crawl rate over time will be one KPI (key performance indicator) to measure the success of the campaign.

#3: It gives you the immediate ability to identify crawlability issues such as 500 or 404 responses, or identify the frequency of duplicate content being crawled. Many other tools can provide this information as well, but it can be important to distinguish bot behavior from user behavior.

#2: It provides a benchmark for bot crawling. Whether you are implementing a search engine marketing campaign or are simply making other website changes, this data can server as a good benchmark. If a website change immediately causes bot crawling problems, you can identify the problem before finding out a month later as search engine results start to suffer. Or, if a website change causes an immediate increase in bot visibility, keep it up!

And the #1 reason to implement bot parsing is...
"cuz robots are cool", Aleks. No explanation necessary.

Learn more about End Point's technical SEO services.

Generating sample text automatically

It's a classic problem: you have a template to test, or a field constraint to test, and you need a large block of text. Designers and developers have come up with many ways to generate this sample data. My favorite is the classic 'Lorem Ipsum' Latin text used by typesetters for hundreds of years. For a long time I've just copy-and-pasted it, but I just happened to find a really cool Lorem Ipsum generator, complete with the ability to specify character length, paragraph number, word count, or even make a bulleted list. Simple and stylish, and easy for less technical collaborators to use. Hit the link for some fascinating history.

I'm sure many of you have your own methods. Share yours in the comments! Extra points for creative shell one-liners.

Ack, grep for Developers

A relatively new tool in my kit that I've come to use very frequently over the last 6 months or so is Ack. Notwithstanding that it is written in my preferred development language, and is maintained by a developer active in the Perl community working on some important projects, like TAP, it really does just save typing while producing Real Purdy output. I won't go so far as to say it completely replaces grep, at least not without a learning curve and especially for people doing more "adminesque" tasks, but as a plain old developer I find its default set of configuration and output settings incredibly efficient for my common tasks. I'd go into the benefits myself, but I think the "Top 10 reasons to use ack instead of grep." from the ack site pretty much covers it. To highlight a couple here,
  1. It's blazingly fast because it only searches the stuff you want searched.
  2. Searches recursively through directories by default, while ignoring .svn, CVS and other VCS directories. Which would you rather type?
    • $ grep pattern $(find . -type f | grep -v '\.svn')
    • $ ack pattern
  3. ack ignores most of the crap you don't want to search
    • VCS directories
    • blib, the Perl build directory
    • backup files like foo~ and #foo#
    • binary files, core dumps, etc
  4. Ignoring .svn directories means that ack is faster than grep for searching through trees.
  5. Lets you specify file types to search, as in --perl or --nohtml. Which would you rather type? (Note that ack's --perl also checks the shebang lines of files without suffixes, which the find command will not.)
    • $ grep pattern $(find . -name '*.pl' -or -name '*.pm' -or -name '*.pod' | grep -v .svn)
    • $ ack --perl pattern
  6. File-filtering capabilities usable without searching with ack -f. This lets you create lists of files of a given type.
  7. Color highlighting of search results.
Note that there are actually 13 on their list, but I eliminated the one about that that one OS, the last is basically just for humor, and a couple are mainly relevant only to Perl users/developers. The next time you need to search a development tree for a particular subroutine, library name, etc. give Ack a try.

Git commits per contributor one-liner

Just for fun, in the Spree Git repository:

git log | grep ^Author: | sed 's/ <.*//; s/^Author: //' | sort | uniq -c | sort -nr
    813 Sean Schofield
     97 Brian Quinn
     81 Steph (Powell) Skardal
     42 Jorge Calás Lozano
     37 paulcc
     27 Edmundo Valle Neto
     16 Dale Hofkens
     13 Gregg Pollack
     12 Sonny Cook
     11 Bobby Santiago
      8 Paul Saieg
      7 Robert Kuhr
      6 pierre
      6 mjwall
      6 Eric Budd
      5 Fabio Akita
      5 Ben Marini
      4 Tor Hovland
      4 Jason Seifer
      2 Wynn Netherland
      2 Will Emerson
      2 spariev
      2 ron
      2 Ricardo Shiota Yasuda
      1 Yves Dufour
      1 yitzhakbg
      1 unknown
      1 Tomasz Mazur
      1 tom
      1 Peter Berkenbosch
      1 Nate Murray
      1 mwestover
      1 Manuel Stuefer
      1 Joshua Nussbaum
      1 Jon Jensen
      1 Chris Gaskett
      1 Caius Durling
      1 Bernd Ahlers

She sells C shells by the seashore

In contrast to my previous post on tabs in vim, here's a different way of managing multiple files, multiple SQL console sessions, multiple nearly anything. This works with any program that behaves well with regard to Unix job control, and really allows Unix to shine as an IDE in its own right. The emphasis here will be on using whatever tools are suitable to do the job, rather than on one specific editor. Note that the details given here will not work very well for network programs that assume a constant connection like an IRC client. However, at least the Postgres and MySQL consoles both support this feature, and they're the only "networked" applications I can imagine using in this way. This post will focus more on a way of thinking than on technical know-how, though there is a bit of how-to mixed in. Most readers are familiar with backgrounding a task at a Unix terminal with ^Z and then bg. Something that is less common, at least in my experience (in favor of GNU screen and the like), is using shell job control for anything more than detaching a running program from one's terminal. When applied liberally, the tactic allows one to harness the power of Unix all via one login. To better envision this workflow, let us envision a scenario that many MVC developers have experienced: extending all three parts of an existing application. This setup comprises at least five different tasks:
  • modifying the model, either directly through an RDBMS console or indirectly via an ORM mapper;
  • modifying the view, which typically involves editing template files of some flavor;
  • modifying the controller, which typically involves editing code files;
  • restarting the application;
  • testing the application, which may involve watching log files;
  • and perhaps modifying code files that comprise either the model or the view of the MVC implementation.
Many users would either log in multiple times and start editors / RDBMS consoles in multiple shells, or use a terminal multiplexer like GNU Screen to achieve the same goal. These are both respectable and venerable means of achieving the same task, each with pros and cons. I trust that people reading this are familiar enough with both methods to grasp what these pros and cons are. However, in my experience, I have found that using shell job control is the fastest and most efficient way to manage this sort of workflow. The general idea works something like this:
  1. Start every program one anticipates needing in this session. This includes text editors, RDBMS consoles, logfile watchers, etc.
  2. Background suspend every started program and then start the next. This, then, sets everything up for bringing programs to the foreground when needed.
  3. As one is working, foreground each task when needed, backgrounding when not finished.
  4. Remember to hit ^Z rather than quitting the programs.
  5. Repeat until the desired goal is achieved.
Now, this is quite handy, but typing fg 2 and the like is rather annoying and requires one to pay attention to job numbers. I thought IDEs were supposed to make one's job easier? There's a handy sequence for bringing jobs to the foreground that bash, zsh, and tcsh all support. To bring to the foreground a job matching a specific string, one can use %?string where the string is a unique bit of the job name--file names work well, here. Given this, then, it's not too much of a stretch to start each job, ^Z it, and then bring it to the foreground when needed. In this way, one can use an editor when needed, restart the application when needed, view log files when needed, etc., all from one shell. This is not the sort of thing where a demo is useful. Workflows vary among developers, and this is something that everyone must adapt to their own way of doing things. When a colleague shared the idea with me, its simple but powerful elegance surprised me--the same sort of simple, powerful elegance I see everywhere in Unix.

End Point SEO with Linkscape

Linkscape was released in October of 2008 and is SEOmoz's collection of index data from the web that currently contains 36 billion URLs over 225 million domains. You must have a pro membership to access advanced reporting, but without a pro membership you can access basic data such as mozRank (SEOmoz's own logarithmic metric for page popularity) for the url, number of links to a url, number of domains to a url, and mozRank for the domain.

For example, I ran a basic report on www.google.com and found:
- The mozRank of http://www.google.com/ is 9.36 out of 10
- There are ~96.8 million links to http://www.google.com/
- There are ~1.6 million domains linking to http://www.google.com/

More interesting data on www.facebook.com:
- The mozRank of http://www.facebook.com/ is 7.40 out of 10
- There are 0.9 million links to http://www.facebook.com/
- There are 60,000 domains linking to http://www.facebook.com/

Because I haven't given justice to describing Linkscape, please read more about Linkscape, or see Linkscape Comic for visual enhancements.

Case Study
In an effort to examine and improve End Point's search engine performance, I pulled together some snippets of data from Linkscape for End Point and End Point's blog after getting a pro membership.

www.endpoint.com:
- The mozRank of http://www.endpoint.com/ is 5.24 out of 10
- There are 24,084 links to http://www.endpoint.com/
- There are 189 domains linking to http://www.endpoint.com/

Top 5 most common anchor text phrases to www.endpoint.com:
- "End Point Corporation" from 86 links over 36 domains
- BLANK from 16 links over 4 domains
- "DESIGNED BY END POINT CORPORATION" from 10 links over 1 domain
- "End Point Corp." FROM 18 links over 5 domains
- "Endpoint" from 19 links over 2 domains

blog.endpoint.com:
- The mozRank of http://blog.endpoint.com/ is 4.88 out of 10
- There are 220 links to http://blog.endpoint.com/
- There are 8 domains linking to http://blog.endpoint.com

Top 5 most common anchor text phrases to blog.endpoint.com:
- "Blog" from 10 links over 1 domain
- "End Point blog" from 31 links over 4 domains
- "Home" from 10 links over 1 domain
- "End Point blog" - http blog.endpoint.com from 1 link over 1 domain
- "Jon Jensen" from 7 links over 1 domain

Here are a few points I'd like to mention as an initial reaction to the data:

#1:
The End Point site was registered in October of 1995 and the blog was registered July of 2008. And the mozRank (~popularity) of the blog has built up considerably in less than a year. Why? Linkscape reports that the top most important links to blog.endpoint.com come from www.endpoint.com, so in the short time that the blog has been in existence, much of the value has passed from www.endpoint.com. The remainder of the links to blog.endpoint.com come from sites like osnews.com, cryptography.mesogunus.com, blogenius.com, and perl.coding-school.com.

On the other hand, www.endpoint.com has many external links passed from client websites spread over 189 domains.

This emphasizes the fact that having high popularity external links can significantly influence page popularity (www.endpoint.com passing to blog.endpoint.com in this case).

When explaining this to a fellow End Pointer (Jon), he also commented, "based on [the data], if nothing external changes, it's pretty much impossible for the blog to overtake www.endpoint.com in ranking" - another good point to realize. We can continue to work on improving www.endpoint.com's popularity and it will continue to pass along value to blog.endpoint.com.

#2:
Google Analytics shows that traffic to www.endpoint.com is lacking in terms related to services that End Point provides, such as "ecommerce", "postgres", and "interchange". The Linkscape reports can help explain why. Linkscape provides a list of the 50 most common anchor text phrases to a url derived from the 3,000 most important links to that url. Out of the 50 most common anchor text phrases, more than half of them contain some variant of "End Point Corporation" and less than 5 of them contain terms related to the services that we offer. Although this is not surprising, it highlights an opportunity for End Point to consider targeting service related keywords. We may not have direct control over all of our external links, but we can request to enhance the existing anchor text or alt text of images.

#3:
Linkscape also points out where anchor text is blank or lacking in relevant keywords. Some examples of these include images without alt text, anchor text like 'work', 'at work', 'my employer', or 'open this site in another window'. Again, we may not have complete control over external link anchor text, but we can try to address the missed opportunity for passing link value through relevant anchor text such as in these examples.

Conclusion:
Ultimately, End Point does not have control over all external links or anchor text in external links. However, we will try to address some of the issues mentioned and revisit the data in a couple of months (Linkscape data is refreshed monthly) in hopes of improving our search engine performance. SEOmoz and Linkscape provides tons (!) of other data related to a wide variety of search engine optimization topics. I'm very excited to have access to this tool and hope to provide more snippets of data in the future.

Learn more about End Point's technical SEO services.

VIM Tip of the Day: tabbed editing

How many tabs does your browser have open? I have 17 tabs open in Firefox presently (and opened / closed about 12 while writing this post). Most users will agree that tabs have changed the way they use the Web. Even IE, which has spawned a collection of shells for tabbed browsing, now supports it natively. Tabs allow for a great saving in screen real-estate, and in many cases, better interaction among the various open documents. Considering how much time programmers spend in their text editors, it therefore seems logical that the editor should provide the same functionality.

And the VIM developers agree. Although VIM calls them "tab-pages", the functionality is there, waiting to be used. Before reading any further, ensure that your VIM supports tabs. You can do this by running this command, on anything resembling Unix: vim --version | fgrep '+windows', which will check for the required windows feature. If you don't see any output, check your vendor's packaging system for something like 'vim-full'. If you don't have a VIM available with the windows feature, go get one and come back. Now that you can use tabs, let's get started. One way to open tabs is via the command line. vim uses the -p option to determine how many tab-pages to open. This functions like the -[oO] options for windows in that it accepts a numeric argument, but defaults to one for each specified file. Thus, let's propose a hypothetical situation. I want to compare the implementation of has() in both Moose and Mouse. Therefore, it might be convenient to have the two files open in two tab-pages in VIM. Presuming I'm in the same directory as both files, I would open vim with a tab-page for each file like so: vim -p Moose.pm Mouse.pm. Enough talk. How about some pretty pictures? Notice the bar at the top of the terminal? That's the tab bar. It has every open tab, plus an X at the far right. That's great, and all, but how does one use this? There are a number of ways, depending upon your configuration. First, there's the basic tab page commands. These are:
  • gt Advance to the next rightmost tab, cycling back to the first.
  • gT Advance to the next leftmost tab, cycling back to the last.
  • {count}gt Go to the {count} tab.
These commands work in both command-line VIM (for those of you working on remote machines) and in GVIM (for those of you writing new code). Additionally, GVIM and VIMs that recognize mouse input on a terminal (:help mouse-using) recognize clicks on the tabs or the X in the upper right. This tip has touched on the very basics of tab-pages, but the information covered here is enough to be useful (and is all I use on a regular basis). However, the curious should definitely read the VIM reference manual for more info, with :help tabpage.

VIM Tip of the Day: running external commands in a shell

A common sequence of events when editing files is to make a change and then need to test by executing the file you edited in a shell. If you're using vim, you could suspend your session (ctrl-Z), and then run the command in your shell.

That's a lot of keystrokes, though.

So, instead, you could use vim's built-in "run a shell command"!

:!{cmd} Run a shell command, shows you the output and prompts you before returning to your current buffer.

Even sweeter, is to use the vim special character for current filename: %

Here's ':! %' is in action!

A few more helpful shortcuts related to executing things in the shell:

  • :! By itself, runs the last external command (from your shell history)
  • :!! Repeats the last command
  • :silent !{cmd} Eliminates the need to hit enter after the command is done
  • :r !{cmd} Puts the output of $cmd into the current buffer

Interchange jobs caveat

I'd used Interchange's jobs feature to handle sending out email expirations and re-invites for a client. However I found out the hard way that scratch variables persisted between individual sub-jobs in the job set. I'd tested each of the two sub-jobs in isolation and had had no issues.

This bit me because I'd assumed each job component was run in isolation and variables were initialized with sensible (aka empty) content. In my case it fortunately only affected the reporting of each piece of the job system, but definitely could have affected larger pieces of the system.

The lessons? 1) Always explicitly initialize your variables; you don't know the ultimate context they'll be run in. 2) Individual component testing is no substitute for testing a system as a whole; you can reveal bugs that would otherwise slip through.

Emacs Tip of the Day: ediff-revision

I recently discovered a cool feature of emacs: M-x ediff-revision. This launches the excellent ediff-mode with the defined version control system's concept of revision spelling. In my case, I was wanting to compare all changes between two git branches introduced several commits ago relative to each branches' head.

M-x ediff-revision prompted for a filename (defaulting to the current buffer's file) and two revision arguments, which in vc-git's case ends up being anything recognized by git rev-parse. So I was able to provide the simple revisions master^ and otherbranch^{4} and have it Do What I Mean™.

I limited the diff hunks in question to those matching specific regexes (different for each buffer) and was able to quickly and easily verify that all of the needed changes had been made between each of the branches.

As usual, C-h f ediff-revision is a good jumping off point for finding more about this useful editor command, as is C-h f ediff-mode for finding more about ediff-mode in general.

pg_controldata

PostgreSQL ships with several utility applications to administer the server life cycle and clean up in the event of problems. I spent some time lately looking at what is probably one of the least well known of these, pg_controldata. This useful utility dumps out a number of useful tidbits about a database cluster, given the data directory it should look at. Here's an example from a little-used 8.3.6 instance:

josh@eddie:~$ pg_controldata
pg_control version number:            833
Catalog version number:               200711281
Database system identifier:           5291243377389434335
Database cluster state:               in production
pg_control last modified:             Mon 09 Mar 2009 04:05:23 PM MDT
Latest checkpoint location:           0/B70E5B9C
Prior checkpoint location:            0/B70E5B5C
Latest checkpoint's REDO location:    0/B70E5B9C
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/307060
Latest checkpoint's NextOID:          37410
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Fri 06 Mar 2009 02:27:02 PM MST
Minimum recovery ending location:     0/0
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        2000
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           en_US.UTF-8
LC_CTYPE:                             en_US.UTF-8

I can't claim to speak with authority on all these data, but leave it as an exercise to the reader to determine the meaning of those that appear most captivating. One of pg_controldata's more interesting features is that it doesn't have to actually connect to anything; it reads everything from the disk. That means you can use it on databases in the middle of WAL recovery, even though you can't actually query the recovering database. The check_postgres.pl script uses this unique capability to make inferences about the health of a WAL replica, specifically by making sure checkpoints happen fairly regularly. pg_controldata requires only one argument, the data directory of the PostgreSQL instance you're interested in, and that only if you haven't already set the PGDATA environment variable.

Scout barcode artistry

Once upon a time, UPC barcodes had to be pretty large for the barcode readers to work. That made the barcode roughly square. Some years ago newer standards came out and the barcodes were still the same width to maintain compatibility, but they could now be shorter, presumably because scanning technology had improved.

The combination of packaging that had space allocated for a tall barcode with the new reality that barcodes didn't have to be tall was an invitation to creativity, as evidenced by the barcode on the box of this Scout Pinewood Derby kit I noticed today:

I love it -- and the barcode is the only place on the box that the Scout emblem appears at all! Here it is closer up:

Has anyone seen this kind of artistically subverted but still functional barcode anywhere else?

Apache RewriteRule to a destination URL containing a space

Today I needed to do a 301 redirect for an old category page on a client's site to a new category which contained spaces in the filename. The solution to this issue seemed like it would be easy and straight forward, and maybe it is to some, but I found it to be tricky as I had never escaped a space in an Apache RewriteRule on the destination page.

The rewrite rule needed to rewrite:

/scan/mp=cat/se=Video Games

to:

/scan/mp=cat/se=DS Video Games

I was able to get the first part of the rewrite rule quickly:

^/scan/mp=cat/se=Video\sGames\.html$

The issue was figuring out how to properly escape the space on the destination page. A literal space, %20 and \s all failed to work properly. Jon Jensen took a look and suggested a standard unix escape of '\ ' and that worked. Some times a solution is right under your nose and it's obvious once you step back or ask for help from another engineer. Googling for the issue did not turn up such a simple solution, thus the reason for this blog posting.

The final rule:

RewriteRule ^/scan/mp=cat/se=Video\sGames\.html$ http://www.site.com/scan/mp=cat/se=DS\ Video\ Games.html [L,R=301]

Passenger and SELinux

We recently ran into an issue when launching a client's site using Phusion Passenger where it would not function with SELinux enabled. It ended up being an issue with Apache having the ability to read/write the Passenger sockets. In researching the issue we found another engineer had reported the problem and there was discussion about having the ability to configure where the sockets could be placed. This solution would allow someone to place the sockets in a directory other than /tmp and set the context on the directory so that sockets created within it have the same context and then grant httpd the ability to read/write to sockets with that specific context. This is a win over granting httpd the ability to read/write to all sockets in /tmp since many other services place their sockets there and you may not want httpd to be able to read/write to those sockets.

End Point had planned to take on the task of patching passenger and submitting the patch. While collecting information about the issue this morning to pass to Max I found this in the issue tracker for Passenger:

Comment 4 by honglilai, Feb 21, 2009 Implemented.

Status: Fixed
Labels: Milestone-2.1.0

Excellent! We'll be testing this internally soon and will post a new blog entry with our solution for Passenger + SELinux. Thanks to the Passenger engineers for taking the request seriously and working on an update with the PassengerTempDir configuration directive included.