End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

JPEG compression: quality or quantity?

There are many aspects of JPEG files that are interesting to web site developers, such as:

  • The optimal trade off between quality and file size for any encoder and uncompressed source image.
  • Reducing size of an existing JPEG image when the uncompressed source is unavailable, but still finding the same optimal trade-off.
  • Comparison of different encoders and/or settings for quality at a given file size.

Two essential factors are file size and image quality. Bytes are objectively measurable, but image quality is much more nebulous. What to one person is a perfectly acceptable image is to another a grotesque abomination of artifacts. So the quality factor is subjective. For example, Steph sent me some images to compare compression artifacts. Here is the first one with three different settings in ImageMagick: 95, 50, and 8:

Compare the subtle (or otherwise) differences in the following images (mouseover shows the filesize and compression setting):

Mouseover each image for the file size and ImageMagick compression setting. Additional comparisons are below. Each image can be opened in a separate browser tab for easy A/B comparison. I think many would find the setting of 8 to have too many artifacts, even though it's 10 times smaller than image compressed at a setting of 95. Some would find the setting of 50 to be an acceptable tradeoff between quality and size, since it sends 3.4 times fewer bytes.

Here is the code I wrote to make the comparison (shell script is great for this stuff):

#!/bin/bash
HTML_OUTFILE=comparison.html
echo '' > $HTML_OUTFILE

write_img_html () {
    size=`du -h --apparent-size $1 | cut -f 1`
    if [ -n "$2" ]; then
       qual="setting: $2"
    fi
    cat <>$HTML_OUTFILE

EOF
}

for name in image1 image2; do
    orig=$name-original.jpg
    resized=$name-300.png
    
    echo Resizing $orig to 300 on longest side: $resized...
    convert $orig -resize 300x300 $resized 
    write_img_html $resized "lossless"
    
    for quality in 100 95 85 50 20 8 1; do
        echo Creating JPEG quality $quality...
        jpeg=$name-300-q-$quality.jpg
        convert $resized -strip -quality $quality $jpeg
        write_img_html $jpeg $quality
    done
done

Another factor that often comes into play is how artifacts in the image (e.g. aliasing, ringing, noise) combine with JPEG compression artifacts to exacerbate quality problems. So one way to get smaller file sizes is to reduce the other types of artifacts in the image, thereby allowing higher JPEG compression.

The most common source of artifacts is image resizing. If you are resizing the images, I strongly recommend using a program that has a high quality filter. Irfanview and ImageMagick are two good choices.

The ideal situation is this:

  • Uncompressed source image
  • Full-resolution if you will be handling the resize
  • Absent artifacts such as aliasing
  • Resize performed with good software like ImageMagick
  • JPEG compression chosen based on subjective quality assessment.

Choosing the trade-off between quality and file size is difficult in part because it varies by image content. Images with lots of small color details (e.g. bright fabric threads; AKA high spatial frequency chroma) stand less compression than images that only have medium sized details that do not have important and minute color information.

One of the settings that is important for small web images is removal of the color space profile (e.g. sRGB). The only time it is needed is when there is a good reason for using non-sRGB JPEG, such as when you are certain that your users will have color managed browsers. Removing it can shave off 5KB or so; software will assume images without profiles have an sRGB profile. It can be removed with the -strip parameter of ImageMagick.

As for choosing the specific compression settings, keep in mind that there are over 30 different types of options/techniques that can be used in compressing the image. Most image programs simplify that to a sliding scale from 0 to 100, 1 to 12, or something else. Keep in mind that even when programs use the same scale (e.g. 0 to 100), they probably have different ideas of what the numbers mean. 95 in one program may be very different than 95 in another.

If bandwidth is not an issue, then I use a setting of 95 on ImageMagick, because in normal images I can't tell the difference between 95 and 100. But when file size in an important concern, I consider 85 to be the optimal setting. In this image, the difference should be clear, but I generally find that cutting filesize in half is worth it. Below 85, the artifacts are too onerous for my taste.

You don't often hear about web site visitors' dissatisfaction with compression artifacts, so you might be tempted to just reduce file sizes even beyond the point where it gets noticable. But I think there is a subliminal effect from the reduced image quality. Visitors may not stop visiting the site immediately, but my gut feeling is it leaves them with a certain impression in their mind or taste in their mouth. I would guess that user testing might result in comments such as "the X web site is not the same high-grade quality as the Y web site", even if they don't put it into words as specific as "the compression artifacts make X look uglier than Y". Even if that pet theory is true, it still has to be balanced against the benefit of faster page loading times.

Ideally, the tradeoff between quality and page loading time would be a choice left to the user. Those who prefer fewer artifacts could set their browser to download larger, less-compressed image files than the default, while users with low bandwidth could set it for more compressed images to get a faster page load at the expense of quality. I could imagine an Apache module and corresponding Firefox add-on some day.

Regarding the situation where you want to reduce the file size of existing JPEGs, my advice is to first try (hard) to get the original source files. You can do better (for any given quality/size tradeoff) from those than you can by just manipulating the existing files. If that's not possible, then the suboptimal workflows like jpegtran, jpegoptim, and doing a full decompress/recompress are the only alternative.

As far as comparing different encoders, I haven't really looked into that except to compare ImageMagick and Photoshop, where I (subjectively) determined they both had about similar quality for file size (and vice-versa).

Steph also made a video to show the range of compression from 1 to 100:

Here are all the comparison images. The file size and ImageMagick quality setting are in the rollover. I suggest opening images in browser tabs for easy A/B comparison.

MySQL and Postgres command equivalents (mysql vs psql)

Users toggling between MySQL and Postgres are often confused by the equivalent commands to accomplish basic tasks. Here's a chart listing some of the differences between the command line client for MySQL (simply called mysql), and the command line client for Postgres (called psql).

MySQL (using mysql)Postgres (using psql)Notes
\c Clears the buffer\r (same)
\d string Changes the delimiterNo equivalent
\e Edit the buffer with external editor\e (same)Postgres also allows \e filename which will become the new buffer
\g Send current query to the server\g (same)
\h Gives help - general or specific\h (same)
\n Turns the pager off\pset pager off (same)The pager is only used when needed based on number of rows; to force it on, use \pset pager always
\p Print the current buffer\p (same)
\q Quit the client\q (same)
\r [dbname] [dbhost] Reconnect to server\c [dbname] [dbuser] (same)
\s Status of serverNo equivalentSome of the same info is available from the pg_settings table
\t Stop teeing output to fileNo equivalentHowever, \o (without any argument) will stop writing to a previously opened outfile
\u dbname Use a different database\c dbname (same)
\w Do not show warningsNo equivalentPostgres always shows warnings by default
\C charset Change the charset\encoding encoding Change the encodingRun \encoding with no argument to view the current one
\G Display results vertically (one column per line)\x (same)Note that \G is a one-time effect, while \x is a toggle from one mode to another. To get the exact same effect as \G in Postgres, use \x\g\x
\P pagername Change the current pager programEnvironment variable PAGER or PSQL_PAGER
\R string Change the prompt\set PROMPT1 string (same)Note that the Postgres prompt cannot be reset by omitting an argument. A good prompt to use is:\set PROMPT1 '%n@%`hostname`:%>%R%#%x%x%x '
\T filename Sets the tee output fileNo direct equivalentPostgres can output to a pipe, so you can do: \o | tee filename
\W Show warningsNo equivalentPostgres always show warnings by default
\? Help for internal commands\? (same)
\# Rebuild tab-completion hashNo equivalentNot needed, as tab-completion in Postgres is always done dynamically
\! command Execute a shell command\! command (same)If no command is given with Postgres, the user is dropped to a new shell (exit to return to psql)
\. filename Include a file as if it were typed in\i filename (same)
Timing is always on\timing Toggles timing on and off
No equivalent\t Toggles 'tuple only' modeThis shows the data from select queries, with no headers or footers
show tables; List all tables\dt (same)Many also use just \d, which lists tables, views, and sequences
desc tablename; Display information about the given table\d tablename (same)
show index from tablename; Display indexes on the given table\d tablename (same)The bottom of the \d tablename output always shows indexes, as well as triggers, rules, and constraints
show triggers from tablename; Display triggers on the given table\d tablename (same)See notes on show index above
show databases; List all databases\l (same)
No equivalent\dn List all schemasMySQL does not have the concept of schemas, but uses databases as a similar concept
select version(); Show backend server versionselect version(); (same)
select now(); Show current timeselect now(); (same)Postgres will give fractional seconds in the output
select current_user; Show the current userselect current_user; (same)
select database(); Show the current databaseselect current_database(); (same)
show create table tablename; Output a CREATE TABLE statement for the given tableNo equivalentThe closest you can get with Postgres is to use pg_dump --schema-only -t tablename
show engines; List all server enginesNo equivalentPostgres does not use separate engines
CREATE object ... Create an object: database, table, etc.CREATE object ... Mostly the sameMost CREATE commands are similar or identical. Lookup specific help on commands (for example: \h CREATE TABLE)

If there are any commands not listed you would like to see, or if there are errors in the above, please let me know. There are differences in how you invoke mysql and psql, and in the flags that they use, but that's a topic for another day.

Updates: Added PSQL_PAGER and \o |tee filename, thanks to the Davids in the comments section. Added \t back in, per Joe's comment.

jQuery UI Drag Drop Tips and an Ecommerce Example

This week, I implemented functionality for Paper Source to allow them to manage the upsell products, or product recommendations. They wanted a better way to visualize, organize, and select the three upsell products for every product. The backend requirements of this functionality were relatively simple. A new table was created to manage the product upsells.

The frontend requirements were more complex: They wanted to be able to drag and drop products into the desired upsell position (1, 2, or 3). I was allowed a bit of leeway on the interactivity level of the functionality, but the main requirement was to have drag and drop functionality working to provide a more efficient way to manage upsells. A mockup similar to the image shown below was provided at the onset of the project.


The mockup provided did not demonstrate the "interactiveness" of the drag and drop functionality. Items below the current upsells were ordered by cross sell revenue, or the revenue of each related item purchased with the current item.

Since I was familiar with jQuery, I knew that the jQuery UI included drag and drop functionality. I also had heard of several other jQuery drag and drop plugins, but since the jQuery UI is well supported, I was hopeful that the UI would have the functionality that I envisioned needing. Throughout the project, I learned a few valuable tips to consider with drag and drop implementation. To begin development, I downloaded the latest jQuery and UI Core in addition to the draggable and droppable UI components.

Visual Feedback = Helpful

The first thing I learned from working on the drag and drop functionality, was that visual feedback is very helpful in interactive design and that the jQuery UI has functionality built in to provide visual feedback. The first bit of visual feedback I included was to use a "clone" helper with semi-opaque styling to provide visual feedback that the object was being dragged. This was accomplished using the following code:

// on page load
$('div.common_item').draggable({
  opacity: 0.7,
  helper: 'clone'
});
// end on page load

And is shown here as the Lake Peace 1.25" Circle Stickers product is dragged:

The second bit of visual feedback I included was adding a class to the droppable item when a draggable item hovered over it. I added the "hoveringover" class to the droppable item which was defined in by the stylesheet to have a different colored background. This was accomplished using the following code:

// on page load
$('tr.upsells td').droppable({
  hoverClass: 'hoveringover'
});
// end on page load

And is shown here as the Shimmer Silver A7 Envelope product hovers above the Quilt on Night with Curry A2 Stationers in upsell position #2:

jQuery UI Event Functionality = Useful

The second tip I learned from working on the drag and drop functionality was that the jQuery drag and drop UI includes valuable event functionality to manage events during the drag and drop process.

By adding the code shown below, at the initiation of dragging, I set a hidden input variable to track which element was being dragged. This value was later used to populate the product upsell form.

// on page load
$('div.common_item').draggable({
  start: function(event, ui) { $('input#is_dragging').val($(this).attr('id')); }
  });
// end on page load

By adding the code shown below, at the conclusion of dragging, I cleared the hidden input variable that indicated which item was being dragged.

// on page load
$('div.common_item').draggable({      
  stop: function(event, ui) { $('input#is_dragging').val(''); }
});
// end on page load

A final event response was added to be called when an item is dropped on a droppable item. The function do_drop is called at this drop time. The do_drop function replaces the html of the current upsells if the dropped sku is different than the current upsell sku, updates the hidden form element, adds visual feedback by adding a class to show that the item had been replaced, and displays the "Save" and "Revert" options to save to database or revert the upsell items.

// on page load
$('tr.upsells td').droppable({
  drop: function(event, ui) { do_drop(this); }
});
// end on page load

var do_drop = function(obj) {
  var current_sku = $('input#is_dragging').val();
  if(current_sku != $(obj).find('img').attr('class')) {
    //show "Save" and "Revert" options
    show_drag_form();

    //update hidden form element
    $('input#' + $(obj).attr('id').replace('td_', '')).val(current_sku);

    //replace html and add visual feedback by adding a class to show that the item was replaced
    $(obj).html($('div#' + current_sku).html()).addClass('replaced');     
  }
};

Shown below, the Curry Dots A9 Printable Party Invitations have been replaced with the Olive Natsuki Gel Roller and the background color change signifies the item has been modified.

jQuery UI Documentation and Examples = Awesome

I found the jQuery UI documentation and examples to be very helpful. Another jQuery UI draggable component that was used was to force draggable items to be contained to a region on the page. I contained the elements to the entire parent table using the following code.

$('div.common_item').draggable({
  containment: 'table#drag_table'
});

The Envelope Liners product is shown below to be confined to the table that contained potential and current upsell products. I could not drag the Envelope Liners any further to the right.

Because the functionality was a backend admin tool, the client requested that the functionality not be over-engineered to work across browsers. I did, however, verify that the drag and drop functionality worked in Firefox, Internet Explorer 7 and 8, Chrome, and Safari with a small amount of styling tweaking.

The final drag-drop JavaScript initiation is similar to the following code:

$(function() {
  $('div.common_item').draggable({
    opacity: 0.7,
    helper: 'clone',
    start: function(event, ui) { $('input#is_dragging').val($(this).attr('id')); },
    stop: function(event, ui) { $('input#is_dragging').val(''); },
    containment: 'table#drag_table'
  });
  $('tr.upsells td').droppable({
    hoverClass: 'hoveringover',
    drop: function(event, ui) { do_drop(this); }
  });
})

Shown below is an example of the product upsell in action for the Chrysanthemum Letterpress Thank You Notes.

Verifying Postgres tarballs with PGP

If you are downloading the Postgres source code tarballs from a mirror, how can you tell if these are the same tarballs that were created by the packagers? You can't really - although they come with a MD5 checksum file, these files are packaged right alongside the tarballs themselves, so it would be easy enough for someone to create an evil tarball along with a new MD5 file. All you could do is perhaps check if the tarball that came from mirror A has a matching checksum file from mirror B, or even the main repository itself.

One way around this is to use PGP (which almost always means GnuPG in the open-source software world) to digitally sign the tarballs. Until the Postgres project gets an official key and starts doing this, one workaround is to at least know the checksums from one single point in time. To that end, I've been digitally signing messages containing the checksums for the tarballs for many years now now and posting them to pgsql-announce. You'll need a copy of my public key (0x14964AC8m fingerprint 2529 DF6A B8F7 9407 E944 45B4 BC9B 9067 1496 4AC8) to verify the messages. A copy of the latest announcement message is below.

Note that I've also added a sha1sum for each tarball, as a precaution against relying on a single MD5 checksum (sha1sum does a SHA-1 checksum, naturally). Also note that rather than signing each tarball, I've simply signed a message containing the checksums for each one.

While this is far from a fool-proof system, it's much, much better than the existing system, and provides a way for changed tarballs to be detected. If anyone ever finds a mismatch please let me know (or better yet, email pgsql-general@postgresql.org)

-----BEGIN PGP SIGNED MESSAGE-----                                   
Hash: RIPEMD160                                                      


Source code MD5 and SHA1 checksums for PostgreSQL 
versions 8.4.2, 8.3.9, 8.2.15, 8.1.19, 8.0.23, and 7.4.27

For instructions on how to use this file to verify Postgres 
tarballs, please see:                                       

http://www.gtsm.com/postgres_sigs.html

## Created with md5sum:
1bc9cdc76c6a2a13bd7fdc0f3f53667f  postgresql-8.4.2.tar.gz
d738227e2f1f742d2f2d4ab56496c5c6  postgresql-8.4.2.tar.bz2
4f176a4e7c0a9f8a7673bec99d1905a0  postgresql-8.3.9.tar.gz 
e120b001354851b5df26cbee8c2786d5  postgresql-8.3.9.tar.bz2
a9d97def309c93998f4ff3e360f3f226  postgresql-8.2.15.tar.gz
e6f2274613ad42fe82f4267183ff174a  postgresql-8.2.15.tar.bz2
335d8c42bd6e7522bb310d19d1f9a91b  postgresql-8.1.19.tar.gz 
ba84995e1e2d53b0d750b75adfaeede3  postgresql-8.1.19.tar.bz2
eb35f66d1c49d87c27f2ab79f0cebf8e  postgresql-8.0.23.tar.gz 
1c6fac4265e71b4f314a827ca5f58f6a  postgresql-8.0.23.tar.bz2
77d09f4806bd913820f82abc27aca70e  postgresql-7.4.27.tar.gz 
1fd1d2702303f9b29b5dba1ec4e1aade  postgresql-7.4.27.tar.bz2

## Created with sha1sum:
563caa3da16ca84608e5ff9c487753f3bd127883  postgresql-8.4.2.tar.gz
a617698ef3b41a74fe2c4af346172eb03e7f8a7f  postgresql-8.4.2.tar.bz2
6ee1e384bdd37150ce6fafa309a3516ec3bbef02  postgresql-8.3.9.tar.gz 
5403f13bb14fe568e2b46a3350d6e28808d93a2c  postgresql-8.3.9.tar.bz2
bd803d74bf9aeac756cb69ae6c1c261046d90772  postgresql-8.2.15.tar.gz
4de199b3223dba2164a9e56d998f6deb708f0f74  postgresql-8.2.15.tar.bz2
233a365985a5a636a97f9d1ab4e777418937caed  postgresql-8.1.19.tar.gz 
f1667a64e92a365ae3d46903382648bdc0daa1ba  postgresql-8.1.19.tar.bz2
7783dc54638e044cff3c339d9fd960a9b65a31df  postgresql-8.0.23.tar.gz 
a2c37eb802a4d67bc2508f72035dae6fb29494df  postgresql-8.0.23.tar.bz2
405909d755aa907fc176d22d1b51d6b5704eb3b4  postgresql-7.4.27.tar.gz 
bb35cc844157b8a0d0b2e9e1ab25b6597c82dd1c  postgresql-7.4.27.tar.bz2

- -- 
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200912151528     
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAksoDPgACgkQvJuQZxSWSsikVQCgiE34ycdexL9lwSfZ+TLTZh5m
G3AAnRkazEu/uHLJCNvDZe2cmqCrCkem                                
=HjAS                                                           
-----END PGP SIGNATURE-----

dstat: better system resource monitoring

I recently came across a useful tool I hadn't heard of before: dstat, by Dag Wieers (of DAG RPM-building fame). He describes it as "a versatile replacement for vmstat, iostat, netstat, nfsstat and ifstat."

The most immediate benefit I found is the collation of system resource monitoring output at each point in time, removing the need to look at output from multiple monitors. The coloring helps readability too:

% dstat                                                                         
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--         
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw          
  4   1  92   3   0   0|  56  84k|   0     0 |  94 188B|1264  1369          
  3   7  43  44   1   1| 368  11M| 151 222B|   0   260k|1453  1565          
  3   2  46  48   1   0| 4325784k|   0     0 |   0     0 |1421  1584          
  2   2  47  49   0   0| 592k    0 |   0     0 |   0     0 |1513  1763          
  6   2  44  49   1   0| 448 248k|   0     0 |   0     0 |1398  1640          
  8   4  41  45   3   0| 456k    0 | 135 222B|   0     0 |1530  2102          
 18   4  38  41   0   0| 408 128k|   0    47B|   0     0 |1261  1977          
 10   4  44  43   0   0| 728 208k|   0     0 |   0     0 |1445  2203          
  6   3  39  51   0   0| 648 256k|36074124B|   0     0 |1496  2180          
  7   7  34  53   0   0|1088k    0 |1234 582B|   0     0 |1465  2057          
 14   8  28  49   0   0|2856 104k|   0     0 |   0    52k|1610  2995          
  6   6  43  45   0   0|1992k    0 |59644836B|   0     0 |1493  2391          
  9  14  34  44   0   0|2432 112k|7854 726B|   0     0 |1527  2190          
  9  11  40  41   1   0|2680k    0 |1382 972B|   0     0 |1550  2298          
  5   4  68  22   0   0| 5761096k|  124628B|   0     0 |1522  1731 ^C       

(Textual screenshot by script of util-linux and Perl module HTML::FromANSI.)

Its default one-line-per-timeslice output makes it good for collecting data samples over time, as opposed to full-screen top-like utilities such as atop, which give much more detailed information at each snapshot, but don't show history.

Since dstat is a standard package available in RHEL/CentOS and Debian/Ubuntu, it is a reasonably easy add-on to get on various systems.

dstat also allows plugins, and just in the most recent release last month were added new plugins "for showing NTP time, power usage, fan speed, remaining battery time, memcache hits and misses, process count, top process total and average latency, top process total and average CPU timeslice, and per disk utilization rates."

It sounds like it'll grow even more useful over time and is worth keeping an eye on.

Content Syndication, SEO, and the rel canonical Tag

End Point Blog Content Syndication

The past couple weeks, I've been discussing if content syndication of our blog negatively affects our search traffic with Jon. Since the blog's inception, full articles have been syndicated by OSNews. The last couple weeks, I've been keeping an eye on the effects of content syndication on search to determine what (if any) negative effects we experience.

By my observations, immediately after we publish an article, the article is indexed by Google and is near the top search results for a search with keywords similar to the article's title. The next day, OSNews syndication of the article shows up in the same keyword search, and our article disappears from the search results. Then, several days later, our article is ahead of OSNews as if Google's algorithm has determined the original source of the content. I've provided visual representation of this behavior:

With content syndication of our blog articles, there is a several day lag where Google treats our blog article as the duplicate content and returns the OSNews article in search results for a search similar to our the blog article's title. After this lag time, the OSNews article is treated as duplicate content and our article is shown in the search results.

During the lag time, a search for "google pages indexed seo", an article I published last Thursday, the OSNews article is shown at search position #5.

After the lag time, a search for "google pages indexed seo" returned the original End Point blog article to search position #2.

Several other factors have influenced the lag time, but typically I've seen very similar behavior.

End Point's content syndication has only been an issue with blog articles, since the majority of our new content comes in the form of blog articles. Examples of content syndication in the ecommerce space may include:

  • inner-company content syndication of products across sister sites. For example, our client Backcountry.com sells outdoor gear, while their site RealCyclist targets the road biking niche of the outdoor gear industry. Cycling products sold on both sites and may compete directly for search engine traffic.
  • syndication of product information through affiliate programs like Commission Junction and AvantLink. Affiliates are paid a small portion of the sales and may target traffic by building supplementary content or communities around content provided by ecommerce sites through the affiliate program.

Cross-Domain rel=canonical Tag

I've been planning to write this article and with impeccable timing, Google announced support for the rel=canonical tag across different domains this week. I've referenced the use of the rel=canonical tag in two articles (PubCon 2009 Takeaways, Search Engine Thoughts), but I haven't gone too much into depth about its use. Support of the rel=canonical tag was introduced early this year as a method to help decrease duplicate content across a single domain. A non-canonical URL that includes this tag suggests its canonical URL to search engines. Search engines then use this suggestion in their algorithms and results to reduce the effects of duplicate content.

<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />

With the cross-domain rel=canonical support announcement, the rel=canonical tag presents another tool to battle duplicate content from content syndication across domains.

Back to Content Syndication

The point of my investigation was to identify whether or not content syndication to OSNews negatively affects our search traffic. The data above suggests that after the brief lag time, Google's algorithm sorts out the source of the original content. The value of exposure, referral traffic, and link juice from OSNews outweighs lost search traffic during this lag time.

In the example of similar product content across backcountry.com's sites, using the rel=canonical tag across domains would allow backcountry.com to suggest prioritization of same product URLs for search results. This may be a valuable tool for directing search traffic to the desired domain.

In the example of content syndication across sites that are not owned by the same company, the use of the rel=canonical tag is more complicated. If the goals of the site that grabs content are to compete directly for search traffic, they would likely not want to use the canonical tag. However, if the goal of the site that grabs content is to focus on search traffic from aggregate content or by building a community around the valuable content, they may be more willing to implement the cross-domain rel=canonical tag to point to the original source of the content. In the case of affiliate programs, I believe it will be difficult to negotiate the cross-domain rel=canonical tag use into existing or future contracts.

The takeaways:

  • Content syndication of our blog does not cause negative long term effects on search. This should be monitored for sites that may have much different behavior than the data I provided above.
  • The announcement of support of the cross-domain rel=canonical tag may be helpful for battling duplicate content across sites, especially to sites owned by the same company.
  • The use of the cross-domain rel=canonical tag in affiliate programs or through sites owned by different companies will be trickier to negotiate.

Editing large files in place

Running out of disk space seems to be an all too common problem lately, especially when dealing with large databases. One situation that came up recently was a client who needed to import a large Postgres dump file into a new database. Unfortunately, they were very low on disk space and the file needed to be modified. Without going into all the reasons, we needed the databases to use template1 as the template database, and not template0. This was a very large, multi-gigabyte file, and the amount of space left on the disk was measured in megabytes. It would have taken too long to copy the file somewhere else to edit it, so I did a low-level edit using the Unix utility dd. The rest of this post gives the details.

To demonstrate the problem and the solution, we'll need a disk partition that has little-to-no free space available. In Linux, it's easy enough to create such a thing by using a RAM disk. Most Linux distributions already have these ready to go. We'll check it out with:

$ ls -l /dev/ram*
brw-rw---- 1 root disk 1,  0 2009-12-14 13:04 /dev/ram0
brw-rw---- 1 root disk 1,  1 2009-12-14 22:27 /dev/ram1

From the above, we see that there are some RAM disks available (there are actually 16 of them available on my box, but I only showed two). Here's the steps to create a usable partition from /dev/ram1, and to then check the size:

$ mkdir /home/greg/ramtest

$ sudo mke2fs /dev/ram1
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
4096 inodes, 16384 blocks
819 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=16777216
2 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
        8193

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

$ sudo mount /dev/ram1 /home/greg/ramtest

$ sudo chown greg:greg /home/greg/ramtest

$ df -h /dev/ram1
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1              16M  140K   15M   1% /home/greg/ramtest

First we created a new directory to server as the mount point, then we used the mke2fs utility to create a new file system (ext2) on the RAM disk at /dev/ram1. It's a fairly verbose program by default, but there is nothing in the output that's really important for this example. Then we mounted our new filesystem to the directory we just created. Finally, we reset the permissions on the directory such that an ordinary user (e.g. 'greg') can read and write to it. At this point, we've got a directory/filesystem that is just under 16 MB large (we could have made it much closer to 16 MB by specifying a -m 0 to mke2fs, but the actual size doesn't matter).

To simulate what happened, let's create a database dump and then bloat it until there it takes up all available space:

$ cd /home/greg/ramtest

$ pg_dumpall > data.20091215.pg

$ ls -l data.20091215.pg
-rw-r--r-- 1 greg greg 3685 2009-12-15 10:42 data.20091215.pg

$ dd seek=3685 if=/dev/zero of=data.20091215.pg bs=1024 count=99999
dd: writing 'data.20091215.pg': No space left on device
13897+0 records in
13896+0 records out
14229504 bytes (14 MB) copied, 0.0814188 s, 175 MB/s

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1              16M   15M     0 100% /home/greg/ramtest

First we created the dump, then we found the size of it, and told dd via the 'seek' argument to start adding data to it at the 3685 byte mark (in other words, we appended to the file). We used the special file /dev/zero as the 'if' (input file), and our existing dump as the 'of' (output file). Finally, we told it to chunk the inserts into 1024 bytes at a time, and to attempt to add 999,999 of those chunks. Since this is approximately 100MB, we ran out of disk space quickly, as we intended. The filesystem is now at 100% usage, and will refuse any further writes to it.

To recap, we need to change the first three instances of template0 with template1. Let's use grep to view the lines:

$ grep --text --max-count=3 template data.20091215.pg
CREATE DATABASE greg WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';

We need the --text argument here because grep correctly surmises that we've changed the file from text-based to binary with the addition of all those zeroes on the end. We also used the handy --max-count argument to stop processing once we've found the lines we want. Very handy argument when the actual file is gigabytes in size!

There are two major problems with using a normal text editor to change the file. First, the file (in the real situation, not this example!) was very, very large. We only needed to edit something at the very top of the file, so loading the entire thing into an editor is very inefficient. Second, editors need to save their changes somewhere, and there just was not enough room to do so.

Attempting to edit with emacs gives us: emacs: IO error writing /home/greg/ramtest/data.20091215.pg: No space left on device

An attempt with vi gives us: vi: Write error in swap file on startup. "data.20091215.pg" E514: write error (file system full?)

Although emacs gives the better error message (why is vim making a guess and outputting some weird E514 error?), the advantage always goes to vi in cases like this as emacs has a major bug in that it cannot even open very large files.

What about something more low-level like sed? Unfortunately, while sed is more efficient than emacs or vim, it still needs to read the old file and write the new one. We can't do that writing as we have no disk space! More importantly, in sed there is no way (that I could find anyway) to tell it stop processing after a certain number of matches.

What we need is something *really* low-level. The utility dd comes to the rescue again. We can use dd to truly edit the file in place. Basically, we're going to overwrite some of the bytes on disk, without needing to change anything else. First though, we have to figure out exactly which bytes to change. The grep program has a nice option called --byte-offset that can help us out:

$ grep --text --byte-offset --max-count=3 template data.20091215.pg
301:CREATE DATABASE greg WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
380:CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
459:CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';

This tells us the offset for each line, but we want to replace the number '0' in 'template0' with the number '1'. Rather than count it out manually, let's just use another Unix utility, hexdump, to help us find the number:

$ grep --text --byte-offset --max-count=3 template data.20091215.pg | hexdump -C
00000000  33 30 31 3a 43 52 45 41  54 45 20 44 41 54 41 42  |301:CREATE DATAB|
00000010  41 53 45 20 67 72 65 67  20 57 49 54 48 20 54 45  |ASE greg WITH TE|
00000020  4d 50 4c 41 54 45 20 3d  20 74 65 6d 70 6c 61 74  |MPLATE = templat|
00000030  65 30 20 4f 57 4e 45 52  20 3d 20 67 72 65 67 20  |e0 OWNER = greg |
00000040  45 4e 43 4f 44 49 4e 47  20 3d 20 27 55 54 46 38  |ENCODING = 'UTF8|
...

Each line is 16 characters, so the first three lines comes to 48 characters, then we add two for the 'e0', subtract four for the '301:', and get 301+48+2-4=347. We subtract one more as we want to seek to the point just before that character, and we can now use our dd command:

$ echo 1 | dd of=data.20091215.pg seek=346 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.00012425 s, 8.0 kB/s

Instead of an input file (the 'if' argument), we simply pass the number '1' via stdin to the dd command. We use our calculated seek, tell it to copy a single byte (bs=1), one time (count=1), and (this is very important!) tell dd NOT to truncate the file when it is done (conv=notrunc). Technically, we are sending two characters to the dd program, the number one and a newline, but the bs=1 argument ensures only the first character is being copied. We can now verify that the change was made as we expected:

$ grep --text --byte-offset --max-count=3 TEMPLATE data.20091215.pg
301:CREATE DATABASE greg WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
380:CREATE DATABASE rand WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';
459:CREATE DATABASE sales WITH TEMPLATE = template0 OWNER = greg ENCODING = 'UTF8';

Now for the other two entries. From before, the magic number is 45, so we now add 380 to 45 to get 425. For the third line, the name of the database is 1 character longer so we add 459+45+1 = 505:

$ echo 1 | dd of=data.20091215.pg seek=425 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000109234 s, 9.2 kB/s

$ echo 1 | dd of=data.20091215.pg seek=505 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000109932 s, 9.1 kB/s

$ grep --text --byte-offset --max-count=3 TEMPLATE data.20091215.pg
301:CREATE DATABASE greg WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
380:CREATE DATABASE rand WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';
459:CREATE DATABASE sales WITH TEMPLATE = template1 OWNER = greg ENCODING = 'UTF8';

Success! On the real system, the database was loaded with no errors, and the large file was removed. If you've been following along and need to cleanup:

$ cd ~
$ sudo umount /home/greg/ramtest
$ rmdir ramtest

Keep in mind that dd is a very powerful and thus very dangerous utility, so treat it with care. It can be invaluable for times like this however!

Live by the sword, die by the sword

In an amazing display of chutzpah, Monty Widenius recently asked on his blog for people to write to the EC about the takeover of Sun by Oracle and its effect on MySQL, saying:

I, Michael "Monty" Widenius, the creator of MySQL, is asking you urgently to help save MySQL from Oracle's clutches. Without your immediate help Oracle might get to own MySQL any day now. By writing to the European Commission (EC) you can support this cause and help secure the future development of the product MySQL as an Open Source project.

"Help secure the future development"? Sorry, but that ship has sailed. Specifically, when MySQL was sold to Sun. There were many other missed opportunities over the years to keep MySQL as a good open source project. Some of the missteps:

  • Bringing in venture capitalists
  • Selling to Sun instead of making an IPO (Initial Public Offering)
  • Failing to check on the long-term health of Sun before selling to them
  • Choosing the proprietary dual-licensing route
  • Making the documentation have a restricted license
  • Failing to acquire InnoDB (which instead was bought by Oracle)
  • Failing to acquire SleepyCat (which was instead bought by Oracle)
  • Spreading FUD about the dual license and twisting the GPL in novel and dubious ways

Also interesting is some of the related blog posters and pundits, who seem to think that MySQL has some sort of special mystical quality that requires it be 'saved'. Sorry, but the business world and the open source world are both harsh ecosystems, where today's market leader can become tomorrow's has-been. For all those who are bemoaning MySQL's fate (especially those directly involved in selling this dual-licensed project for money), I offer a quote: "live by the sword, die by the sword". Not that MySQL is dead yet, but it's been dealt quite a number of near-fatal blows, and I'm not convinced that all the forks, spinoffs, well-wishers, and ex-developers can fix that. Should be interesting times ahead.

Multiple links to files in /etc

I came across an unfamiliar error in /var/log/messages on a RHEL 5 server the other day:

Dec  2 17:17:23 X restorecond: Will not restore a file with more than one hard link (/etc/resolv.conf) No such file or directory

Sure enough, ls showed the inode pointed to by /etc/resolv.conf having 2 links. What was the other link?

# find /etc -samefile resolv.conf
/etc/resolv.conf
/etc/sysconfig/networking/profiles/default/resolv.conf
# ls -lai /etc/resolv.conf /etc/sysconfig/networking/profiles/default/resolv.conf
1526575 -rw-r--r-- 2 root root 69 Nov 30  2008 /etc/resolv.conf
1526575 -rw-r--r-- 2 root root 69 Nov 30  2008 /etc/sysconfig/networking/profiles/default/resolv.conf

I've worked with a lot of RHEL/CentOS 5 servers and hadn't ever dealt with these network profiles. Kiel guessed it was probably a system configuration tool that we never use, and he was right: Running system-config-network (part of the system-config-network-tui RPM package) creates the hardlinks for the default profile.

/etc/hosts gets the same treatment as /etc/resolv.conf.

I suppose SELinux's restorecond doesn't want to apply any context changes because its rules are based on filesystem paths, and the paths of the multiple links are different and could result in conflicting context settings.

Since we don't use network profiles, we can just delete the extra links in /etc/sysconfig/networking/profiles/default/.

List Google Pages Indexed for SEO: Two Step How To

Whenever I work on SEO reports, I often start by looking at pages indexed in Google. I just want a simple list of the URLs indexed by the *GOOG*. I usually use this list to get a general idea of navigation, look for duplicate content, and examine initial counts of different types of pages indexed.

Yesterday, I finally got around to figuring out a command line solution to generate this desired indexation list. Here's how to use the command line using http://www.endpoint.com/ as an example:

Step 1

Grab the search results using the "site:" operator and make sure you run an advanced search that shows 100 results. The URL will look something like: http://www.google.com/search?num=100&as_sitesearch=www.endpoint.com

But it will likely have lots of other query parameters of lesser importance [to us]. Save the search results page as search.html.

Step 2

Run the following command:

sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' search.html | grep LINK | sed 's/<a href="\|" LINK//g' 

There you have it. Interestingly enough, the order of pages can be an indicator of which pages rank well. Typically, pages with higher PageRank will be near the top, although I have seen some strange exceptions. End Point's indexed pages:

http://www.endpoint.com/
http://www.endpoint.com/clients
http://www.endpoint.com/team
http://www.endpoint.com/services
http://www.endpoint.com/sitemap
http://www.endpoint.com/contact
http://www.endpoint.com/team/selena_deckelmann
http://www.endpoint.com/team/josh_tolley
http://www.endpoint.com/team/steph_powell
http://www.endpoint.com/team/ethan_rowe
http://www.endpoint.com/team/greg_sabino_mullane
http://www.endpoint.com/team/mark_johnson
http://www.endpoint.com/team/jeff_boes
http://www.endpoint.com/team/ron_phipps
http://www.endpoint.com/team/david_christensen
http://www.endpoint.com/team/carl_bailey
http://www.endpoint.com/services/spree
...

For the site I examined yesterday, I saved the pages as one.html, two.html, three.html and four.html because the site had about 350 results. I wrote a simple script to concatenate all the results:

#!/bin/bash

rm results.txt

for ARG in $*
do
        sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' $ARG | grep LINK | sed 's/<a href="\|" LINK//g' >> results.txt
done

And I called the script above with:

./list_google_index.sh one.html two.html three.html four.html

This solution isn't scalable nor is it particularly elegant. But it's good for a quick and dirty list of pages indexed by the *GOOG*. I've worked with the WWW::Google::PageRank module before and there are restrictions on API request limits and frequency, so I would highly advise against writing a script that makes requests to Google repeatedly. I'll likely use the script described above for sites with less than 1000 pages indexed. There may be other solutions out there to list pages indexed by Google, but as I said, I was going for a quick and dirty approach.


Remember not to get eaten by the Google Monster

Learn more about End Point's technical SEO services.

CakePHP Infinite Redirects from Auto Login and Force Secure

Lately, Ron, Ethan, and I have been blogging about several of our CakePHP learning experiences, such as incrementally migrating to CakePHP, using the CakePHP Security component, and creating CakePHP fixtures for HABTM relationships. This week, I came across another blog-worthy topic while troubleshooting for JackThreads that involved auto login, requests that were forced to be secure, and infinite redirects.


Ack! Users were experiencing infinite redirects!

The Problem

Some users were seeing infinite redirects. The following use cases identified the problem:

  • Auto login true, click on link to secure or non-secure homepage => Whammy: Infinite redirect!
  • Auto login false, click on link to secure or non-secure homepage => No Whammy!
  • Auto login true, type in secure or non-secure homepage in new tab => No Whammy!
  • Auto login false, type in secure or non-secure homepage in new tab => No Whammy!

So, the problem boiled down to an infinite redirect when auto login customers clicked to the site through a referer, such as a promotional email or a link to the site.

Identifying the Cause of the Problem

After I applied initial surface-level debugging without success, I decided to add excessive debugging to the code. I added debug statements throughout:

  • the CakePHP Auth object
  • the CakePHP Session object
  • the app's app_controller beforeFilter that completed the auto login
  • the app's component that forced a secure redirect on several pages (login, checkout, home)

I output the session id and request location with the following debug statement:

$this->log($this->Session->id().':'.$this->here.':'.'/*relevant message about whatsup*/', LOG_DEBUG);

With the debug statement shown above, I was able to compare the normal and infinite redirect output and identify a problem immediately:

normal output
2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/:     User does not exist!
2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/:     Success in auto login!
2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/:     redirecting to /sale
2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/sale: User exists!
2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/sale: calling action!
infinite redirect output
2009-12-09 11:43:30 Debug: 65cb23e4ca358b7270513cca4a52e9b7:/:      User does not exist!
2009-12-09 11:43:30 Debug: 65cb23e4ca358b7270513cca4a52e9b7:/:      Success in auto login!
2009-12-09 11:43:30 Debug: 65cb23e4ca358b7270513cca4a52e9b7:/:      redirecting to /sale
2009-12-09 11:43:30 Debug: 397f099790347716e0bc58c73f23358d:/sale:  User does not exist!
2009-12-09 11:43:30 Debug: 397f099790347716e0bc58c73f23358d:/sale:  redirecting to /login
2009-12-09 11:43:30 Debug: 0dfee15a4295b26aad115ae37d470d30:/login: User does not exist!
2009-12-09 11:43:30 Debug: 0dfee15a4295b26aad115ae37d470d30:/login: Success in auto login!
2009-12-09 11:43:30 Debug: 0dfee15a4295b26aad115ae37d470d30 /login: redirecting to /sale
2009-12-09 11:43:31 Debug: 3f23b7f7bead5d23fd006b6d91b1d195:/sale:  User does not exist!
2009-12-09 11:43:31 Debug: 3f23b7f7bead5d23fd006b6d91b1d195:/sale:  redirecting to /login
...

What I immediately noticed was that sessions were dropped at every redirect on the infinite redirect path. So I researched a bit and found the following resources:

As it turns out, the Security.level configuration affected the referer check for redirects. The CakePHP Session object set the referer_check to HTTP_HOST if Security.level was equal to 'high' or 'medium'. A couple of the resources mentioned above recommend to adjust the Security.level to 'low', which sounded like a potential solution. But I wasn't certain that this was the cause of the redirect, so I tested several changes to verify the problem.

First, I tested the Security.levels to 'high', 'medium', and 'low'. With the Security.level set to 'low', the infinite redirect would not happen and the debug log would show a consistent session id. Next, I commented out the code in the CakePHP Session object that set the referer_check and set the Security.level to 'high'. This also seemed to fix the infinite redirect, although, it wasn't ideal to make changes to the the core CakePHP code. Finally, I changed this->host to HTTPS_HOST instead of HTTP_HOST in the CakePHP Session object, so that the referer would be checked against the secure host rather than the non-secure host. This also fixed the infinite redirect, but again, it wasn't ideal to change the core CakePHP code.

I concluded that the secure redirect to the homepage or login page coupled with the auto login caused this infinite redirect. As pages were redirected between /login and /sale, the session (that stored the auto logged in user) was dropped since the referer check against HTTP_HOST failed.

The Solution

In an ideal world, I would like to see HTTP_HOST and HTTPS_HOST included in the CakePHP referer check. But because we didn't want to edit the CakePHP core, I investigated the affect of changing the Security.level on the app:

Security.level == high
- session timeout is multiplied by a factor of 10
- cookie lifetime is set to 0
- config timeout is set
- inactiveMins is equal to 10

Security.level == medium
- session timeout is multiplied by a factor of 100
- cookie lifetime is set to 7 days
- inactiveMins is equal to 100

*Security.level == low
- session timeout is multiplied by a factor of 300
- cookie lifetime is set to 788940000s
- inactiveMins is equal to 300

Security.level is not set
- session timeout is multiplied by a factor of 10
- cookie lifetime is set to 788940000s
- inactiveMins is equal to 300

I provided this information to the client and let them decide which scenario met their business needs. For this situation, I recommended commenting out the Security.level configuration so that the session timeout would stay the same, but the cookie lifetime and inactiveMins values would increase.

This was an interesting learning experience that helped me understand a bit more about how CakePHP handles sessions. It also gave me exposure to referer checks in PHP, which I haven't dealt with much in the past.

Cisco PIX mangled packets and iptables state tracking

Kiel and I had a fun time tracking down a client's networking problem the other day. Their scp transfers from their application servers behind a Cisco PIX firewall failed after a few seconds, consistently, with a connection reset.

The problem was easily reproducible with packet sizes of 993 bytes or more, not just with TCP but also ICMP (bloated ping packets, generated with ping -s 993 $host). That raised the question of how this problem could go undetected for their heavy web traffic. We determined that their HTTP load balancer avoided the problem as it rewrote the packets for HTTP traffic on each side.

Kiel narrowed the connect resets down to iptables' state-tracking considering packets INVALID, not ESTABLISHED or RELATED as they should be.

Then he found via tcpdump that the problem was easily visible in scp connections when TCP window scaling adjustments were made by either side of the connection. We tried disabling window scaling but that didn't help.

We tried having iptables allow packets in state INVALID when they were also ESTABLISHED or RELATED, and that reduced the frequency of terminated connections, but still didn't eliminate them entirely. (And it was a kludge we weren't eager to keep in place anyway.)

We wanted to avoid some unpleasant possibilities: (1) turn off stateful firewalling or (2) perform risky updates or configuration changes on the Cisco PIX, which may or may not fix the problem, in the middle of the busy holiday ecommerce season.

Finally, Kiel found this netfilter mailing list post which describes how to enable a Linux kernel workaround for the mangled packets the Cisco generates:

echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

Of course saving that in /etc/sysctl.conf so it persists after a reboot.

So we have reliable long-running scp connections with TCP window scaling working and iptables doing its job. I love it when a plan comes together.

Iterative Migration of Legacy Applications to CakePHP

As Steph noted, we recently embarked on an adventure with a client who had a legacy PHP app. The app was initially developed in rapid fashion, with changing business goals along the way. Some effort was made at the outset with this vanilla PHP app to put key business logic in classes, but as often happens over time the cleanliness of those classes degraded. While much of the business rules and state management (i.e. database manipulation, session wrangling, authentication/access-control, etc.) were kept separate from the "views" (the PHP entry pages), the classes themselves became tightly coupled, overburdened with myriad responsibilities, etc.

This was a far cry from the stereotypical spaghetti PHP app, but nevertheless it needed some reorganization; all but the smallest changes inevitably required touching a wide range of classes and pages, and the code would only grow more brittle unless some serious refactoring took place.

We determined at the outset that getting the application moved into an established MVC framework would be of great benefit, and further determined that CakePHP would be a good choice. (This is the point where anybody reading will inevitably ask in comments "Why CakePHP instead of My Preferred Awesome Framework?" Sigh.) The client agreed. The question became: how do we get there from here?

I spent some time investigating and inevitably came across the well-regarded three-part blog series:

  1. Converting Legacy Apps to CakePHP Part I
  2. Part II
  3. Part III
(The author of that series has a book out on the subject, as well.)

For somebody new to MVC application design, especially in the PHP space, the series (and presumably the related book) probably makes for pretty good reading. They present a decent approach to how the refactoring of legacy code can be accomplished. However, the series also appears to operate under the assumption that you're in a scrap-and-rebuild situation: the legacy app can essentially go nowhere for a few weeks while it gets gutted into CakePHP.

As noted in a review of the related book, the rebuild-it-all assumption doesn't apply to many real world situations. The more money your application makes, the more users it affects, the larger the feature set, the more likely it is that the business cannot afford to have an application sit in a code freeze while an entire rewrite takes place.

We ultimately opted for a different approach: iteratively migrate to CakePHP. The simplicity of the basic PHP paradigm makes this remarkably easy.

The basic steps:

  1. Rearrange the legacy application so it runs "within" CakePHP, with the CakePHP dispatcher handling the request but ultimately invoking the original legacy view
  2. Make adjustments to the legacy code such that it gets its database handle(s) from CakePHP rather than internally, it uses CakePHP's session, etc.
  3. New development can proceed within CakePHP; legacy logic can be refactored into CakePHP over time as the opportunity presents itself (or the situation demands)

Getting the application to run within CakePHP in this manner does not require that much effort. Of course, this would depend on your situation, but in the traditional model of presentation-oriented code relying on some business objects and a database, it works out. For the initial step:

  1. Prepare a basic CakePHP application
  2. Pull the legacy code into the CakePHP webroot, with the legacy pages moved under a new legacy/ subdirectory
  3. Prepare a "legacy" action in the default PageController that maps the requested URI path to a path relative to the legacy/ directory, then invokes the file living at that path
  4. Set up a new catch-all route that invokes this legacy action
After these steps, you have CakePHP fronting your legacy app, but otherwise not doing much else. A snippet of our code that deals with pulling in a legacy app page in this manner:
    function includeLegacyPage($path = null) { 
        // map the path passed in or from the request to the legacy/ subdirectory
        $cakeRequestPath = $path ? $path : $this->controller->params['url']['url'];
        $path = WWW_ROOT . 'legacy/' . $cakeRequestPath;

        // This just maps input arguments to globals
        $this->prepareGlobals(array('cakeRequestPath' => $cakeRequestPath));

        // Resolve directories to an index.php page as necessary
        if (is_dir($path)) {
            if(substr($path, -1) != '/') 
                $path .= '/'; 
            $path .= 'index.php';
        }

        if (!file_exists($path)) {
            $this->controller->render('error');
        }

        try {
            // buffer PHP output
            ob_start();

            // this "invokes" the legacy page and gathers its content
            include $path;

            // pull in the buffered content
            $this->controller->output = ob_get_contents();

            // stop output buffering
            ob_end_clean();
        } catch (JackExceptionRedirect $e) {
            // We adjusted the legacy app's redirect functions to throw a custom exception
            // class that we catch here, so we can use CakePHP's native redirection
            $this->controller->redirect($e->location, $e->getCode(), false); 
        } catch (Exception $e) {
            // All other errors propagate up
            throw $e;
        }

        $this->controller->autoRender = false;
        $this->controller->autoLayout = false;
    }
  
Our PageController's "legacy" action uses the above routine to pull in the legacy page.

The second step, of getting CakePHP to control the session, the database handle, etc., involves some minor hacks. They don't feel elegant. They go outside the MVC pattern. But they provide the crucial glue necessary to put CakePHP in charge.

  • Make the controller's session available from a global; adjust legacy code to use it instead of direct use of the PHP session. This means that CakePHP controls the session configuration.
  • Make the CakePHP database handle available from a global as well; adjust your legacy database initialization code so it simply uses the global handle from CakePHP. Now CakePHP controls your database configuration, and CakePHP and the legacy code will use the same handle in a given request.
  • And so on and so forth.
For instance:
        App::import('ConnectionManager');
        $standard_globals = array(
            'cakeDbh'       => ConnectionManager::getDataSource('default')->connection,
            'cakeSession'   => $this->Session
        );

        $this->prepareGlobals($standard_globals);
  

Up until now, CakePHP's introduction into the mix hasn't added value. Having reached this point, however, you're ready to start taking advantage of CakePHP. From here, we refactored our special "legacy" action logic into a new "LegacyPage" component so any controller/action could use the mechanism. Then we were able to:

  • Refactor legacy user authentication logic to use CakePHP's Auth core component
  • Refactor various legacy pages to be fronted by CakePHP controller actions, moving the high-level flow control (input validation, user validation, and associated redirects) out of the legacy page and into the controller. This simplifies the legacy page (making it more strictly limited to presentation) and puts flow control where it belongs.
  • For a new feature involving new data structures, developed a new CakePHP component to implement the business operations, new controllers/actions for aspects of the new functionality, and adjusted some legacy code to get data from the new component rather than original direct database calls or legacy class calls

So, what are the advantages of this approach, versus a slash-and-burn rewrite-it-all approach?

  • We get to a point in which we're tangibly benefitting from CakePHP with minimal investment of time/money; contrast that with the potential expense of rewriting the entire application before the business sees any benefit
  • While we proceeded in this work, the client was actively developing their legacy system; there was no need for a code freeze, and reconciling their changes with our work was fairly trivial; one git rebase took care of it (though I admittedly missed a couple things during the rebase, which we caught and fixed with some spot-checking).
  • No repeating of oneself: by making the entire legacy application available within the context of the target framework, we don't need to spend cycles rewriting existing functionality; the do-it-from-scratch approach would, by contrast, require reimplementation of everything
  • We can refactor the legacy code in a prioritized, iterative fashion: refactor the most important stuff first, and the less important stuff later.
  • We can partially refactor specific pieces of legacy code, such as removing business/data logic from pages such that legacy pages become more like views in the MVC triad; we're not forced to redo an entire legacy subsystem to improve the code organization
  • The legacy work that is solid and doesn't need much refactoring stays put, and is usable from the rest of the CakePHP application

We may well get to the point (in late 2010, perhaps) when all legacy code has been refactored into CakePHP's MVC architecture. Or perhaps not: the business has to balance competing priorities, and it may ultimately be that some aspects of the legacy code just don't get refactored because they aren't especially broken and the business need simply doesn't come up. That's part of the beauty here: we don't have to make that decision right now; we can let the real-world priorities make that decision for us over time.

It's easy to imagine an engineer finding this less attractive than a redo-it-in-my-favorite-framework-du-jour approach. It reeks of compromise. Yet, from a business standpoint the advantages are hard to dispute. From a technical standpoint, they're hard to dispute as well: faster, shorter cycles of development bring a higher likelihood of success, particularly for small teams (or lone individuals); the management of change is much simpler with iterative design; the iterative approach is arguably less prone to second-system effect than is a rewrite; etc.

This asks more of the engineer than does a ground-up rewrite in Framework X. So many modern frameworks positively shine with possibility; the engineer lusts for the opportunity to Do It Right, and falls prey to the fallacy that the framework will solve all their problems given that Done Right investment. But, whatever the features and community offerings may be, modern frameworks ultimately help us organize our code better; better organization of code is amongst the most obvious benefits one gets in moving into a modern framework.

The iterative approach gets us there with far less risk and, in many cases, far more naturally than does the rewrite-it-all approach, but it asks us to have the patience to move in small steps. It asks that we have the mental room and rigor to envision what the Done Right system might look like, as well as a long chain of interim steps taking us from here to there. But it delivers value much faster, at lower risk, at lower cost, and crucially, reduces redundant work and gives us the opportunity to change direction as we go. Consequently, for many -- even most -- business situations the iterative transformation is the system Done Right.