End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Setting up a login form in a controller other then the Users controller in CakePHP, don't forget the User model

I ran into an issue today while setting up a login form on the front page of a site that would post to the login action of the User controller. The issue was that when the the form was posted the App controller beforeFilter was called, the User controller beforeFilter was called, but the login action of the User controller was never reached and a blank template with the normal debugging output was shown. No errors were being output and there wasn't much to go on. Ultimately what ended up being the problem was that in the Home controller where the form was being served from we did not have the following to include the User model:

var $uses = array('User');

Surprisingly within our view we were able to setup forms to work with the User model. When the auth component was checking for the user data in the post it did not find any data, and stopped processing the request. This was not a graceful way for the auth component or CakePHP to handle the request, an error message would have helped track down the issue.

XZ compression

XZ is a new free compression file format that is starting to be more widely used. The LZMA2 compression method it uses first became popular in the 7-Zip archive program, with an analogous Unix command-line version called 7z.

We used XZ for the first time in the Interchange project in the Interchange 5.7.3 packages. Compared to gzip and bzip2, the file sizes were as follows:

interchange-5.7.3.tar.gz   2.4M
interchange-5.7.3.tar.bz2  2.1M
interchange-5.7.3.tar.xz   1.7M

Getting that tighter compression comes at the cost of its runtime being about 4 times slower than bzip2, but a bonus is that it decompresses about 3 times faster than bzip2. The combination of significantly smaller file sizes and faster decompression made it a clear win for distributing software packages, leading to it being the format used for packages in Fedora 12.

It's also easy to use on Ubuntu 9.10, via the standard xz-utils package. When you install that with apt-get, aptitude, etc., you'll get a scary warning about it replacing lzma, a core package, but this is safe to do because xz-utils provides compatible replacement binaries /usr/bin/lzma and friends (lzcat, lzless, etc.). There is also built-in support in GNU tar with the new --xz aka -J options.

Dropped sessions when Ask.com Toolbar is installed

We've been dealing with an issue on a client's site where customers were reporting that they could not login and when they added items to their cart the cart would come up empty. This information pointed towards a problem with the customer's session being dropped, but we were unable to determine the common line across these customer's environments and came up empty handed. This was a case of being unable to reproduce a problem which made it nearly impossible to fix.

This morning on the Interchange users list there was a post from Racke discussing a similiar issue. His customer had the Ask.com toolbar installed and Interchange's robot matching code was mistakenly matching the Ask.com toolbar as a search spider. The user agent of the browser with Ask.com installed appeared as so:

"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; msn OptimizedIE8;ENUS; AskTB5.6)"

A quick look at the current robots.cfg that Steven Graham linked showed that 'AskTB' had been added to the NotRobotUA directive which instructs Interchange to not consider AskTB a search spider, thus allowing proper use of sessions on the site.

Updating the robots.cfg on our client's site allowed users with Ask.com to browse, login and checkout as expected. Those with Interchange sites who see reports of similiar issues should consider a false positive spider match a possibility and update their robots.cfg.

WordPress Plugin for Omniture SiteCatalyst

A couple of months ago, I integrated Omniture SiteCatalyst into an Interchange site for one of End Point's clients, CityPass. Shortly after, the client added a blog to their site, which is a standalone WordPress instance that runs separately from the Interchange ecommerce application. I was asked to add SiteCatalyst tracking to the blog.

I've had some experience with WordPress plugin development, and I thought this was a great opportunity to develop a plugin to abstract the SiteCatalyst code from the WordPress theme. I was surprised that there were limited Omniture WordPress plugins available, so I'd like to share my experiences through a brief tutorial for building a WordPress plugin to integrate Omniture SiteCatalyst.

First, I created the base wordpress file to append the code near the footer of the wordpress theme. This file must live in the ~/wp-content/plugins/ directory. I named the file omniture.php.

  <?php /*
    Plugin Name: SiteCatalyst for WordPress
    Plugin URI: http:www.endpoint.com/
    Version: 1.0
    Author: Steph Powell
    */
    function omniture_tag() {
    }
    add_action('wp_footer', 'omniture_tag');
  ?>

In the code above, the wp_footer is a specific WordPress hook that runs just before the </body> tag. Next, I added the base Omniture code inside the omniture_tag function:

...

function omniture_tag() {
?>
<script type="text/javascript">
<!-- var s_account = 'omniture_account_id'; -->
</script>
<script type="text/javascript" src="/path/to/s_code.js"></script>
<script type="text/javascript"><!--
s.pageName='' //page name
s.channel='' //channel
s.pageType='' //page type
s.prop1='' //traffic variable 1
s.prop2='' //traffic variable 2
s.prop3='' //traffic variable 3
s.prop4= '' //traffic variable 4
s.prop5= '' //traffic variable 5
s.campaign= '' //campaign variable
s.state= '' //user state
s.zip= '' //user zip
s.events= '' //user events
s.products= '' //user products
s.purchaseID= '' //purchase ID
s.eVar1= '' //conversion variable 1
s.eVar2= '' //conversion variable 2
s.eVar3= '' //conversion variable 3
s.eVar4= '' //conversion variable 4
s.eVar5= '' //conversion variable 5
/************* DO NOT ALTER ANYTHING BELOW THIS LINE ! **************/
var s_code=s.t();if(s_code)document.write(s_code)
--></script>
<?php
}

...

To test the footer hook, I activated the plugin in the WordPress admin. A blog refresh should yield the Omniture code (with no variables defined) near the </body> tag of the source code.

After verifying that the code was correctly appended near the footer in the source code, I determined how to track the WordPress traffic in SiteCatalyst. For our client, the traffic was to be divided into the home page, static page, articles, tag pages, category pages and archive pages. The Omniture variables pageName, channel, pageType, prop1, prop2, and prop3 were modified to track these pages. Existing WordPress functions is_home, is_page, is_single, is_category, is_tag, is_month, the_title, get_the_category, the_title, single_cat_title, single_tag_title, the_date were used.

...

<script type="text/javascript"><!--
<?php
if(is_home()) {    //WordPress functionality to check if page is home page
        $pageName = $channel = $pageType = $prop1 = 'Blog Home';
} elseif (is_page()) {    //WordPress functionality to check if page is static page
        $pageName = $channel = the_title('', '', false);
        $pageType = $prop1 = 'Static Page';
} elseif (is_single()) { //WordPress functionality to check if page is article
        $categories = get_the_category();
        $pageName = $prop2 = the_title('', '', false);
        $channel = $categories[0]->name;
        $pageType = $prop1 = 'Article';
} elseif (is_category()) {    //WordPress functionality to check if page is category page
        $pageName = $channel = single_cat_title('', false);
        $pageName = 'Category: ' . $pageName;
        $pageType = $prop1 = 'Category';
} elseif (is_tag()) {     //WordPress functionality to check if page is tag page
        $pageName = $channel = single_tag_title('', false);
        $pageType = $prop1 = 'Tag';
} elseif (is_month()) {     //WordPress functionality to check if page is month page
        list($month, $year) = split(' ', the_date('F Y', '', '', false));
        $pageName = 'Month Archive: ' . $month . ' ' . $year;
        $channel = $pageType = $prop1 = 'Month Archive';
        $prop2 = $year;
        $prop3 = $month;
}
echo "s.pageName = '$pageName' //page name\n";
echo "s.channel = '$channel' //channel\n";
echo "s.pageType = '$pageType'  //page type\n";
echo "s.prop1 = '$prop1' //traffic variable 1\n";
echo "s.prop2 = '$prop2' //traffic variable 2\n";
echo "s.prop3 = '$prop3' //traffic variable 3\n";
?>
s.prop4 = '' //traffic variable 4

...

The plugin allows you to freely switch between WordPress themes without having to manage the SiteCatalyst code and to track the basic WordPress page hierarchy. Here are example outputs of the SiteCatalyst variables broken down by page type:

Homepage

s.pageName = 'Blog Home' //page name
s.channel = 'Blog Home' //channel
s.pageType = 'Blog Home'  //page type
s.prop1 = 'Blog Home' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3

Tag Page

s.pageName = 'chocolate' //page name
s.channel = 'chocolate' //channel
s.pageType = 'Tag'  //page type
s.prop1 = 'Tag' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3

Category Page

s.pageName = 'Category: Food' //page name
s.channel = 'Food' //channel
s.pageType = 'Category'  //page type
s.prop1 = 'Category' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3

Static Page

s.pageName = 'About' //page name
s.channel = 'About' //channel
s.pageType = 'Static Page'  //page type
s.prop1 = 'Static Page' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3

Archive

s.pageName = 'Month Archive: November 2009' //page name
s.channel = 'Month Archive' //channel
s.pageType = 'Month Archive'  //page type
s.prop1 = 'Month Archive' //traffic variable 1
s.prop2 = '2009' //traffic variable 2
s.prop3 = 'November' //traffic variable 3

Article

s.pageName = 'Hello world!' //page name
s.channel = 'Test Category' //channel
s.pageType = 'Article'  //page type
s.prop1 = 'Article' //traffic variable 1
s.prop2 = 'Hello world!' //traffic variable 2
s.prop3 = '' //traffic variable 3

A followup step to this plugin would be to use the wp_options table in WordPress to manage the Omniture account id, which would allow admin to set the Omniture account id through the WordPress admin without editing the plugin code. I've uploaded the plugin to a github repository here.

Learn more about End Point's analytics services.

Update: This plugin is included in the WordPress plugin registry and can be found at http://wordpress.org/extend/plugins/omniture-sitecatalyst-tracking/.

Test Fixtures for CakePHP Has-and-Belongs-to-Many Relationships

CakePHP, a popular MVC framework in/for PHP, offers a pretty easy-to-use object-relational mapper, as well as fairly straightforward fixture class for test data. Consequently, it's fairly easy to get into test-driven development with CakePHP, though this can take some acclimation if you're coming from Rails or Django or some such; the need to go through a web interface to navigate to and execute your test cases feels, to me, a little unnatural. Nevertheless, you can get writing tests pretty quickly, and the openness of the testing framework means that it won't get in your way. Indeed, compared to the overwhelming plethora of testing options one gets in the Ruby space -- and the accompanying sense that the choice of testing framework is akin to one's choice of religion, political party, or top 10 desert island album list -- CakePHP's straightforward testing feels a little liberating.

Which is why it was a little surprising to me that getting a test fixture going for the join table on a has-and-belongs-to-many (HABTM) association is -- at least in my experience -- not the clearest thing in the world.

One can presumably configure the fixture to merely use the table option in the fixture's $import attribute. However, as I was following the table and model naming conventions, I felt that I must be doing something wrong in my attempts to get a fixture going for a HABTM relationship, and consequently I eschewed the (potentially) easy way out to try to find a solution that ought to work.

So, let's say my relations were:

  • Product model: some stuff to sell
  • Sale model: individual "sale" events when particular products are promoted
  • A products_sales join table establishes a many-to-many relationship (can we all acknowledge that "many-to-many" is much more convenient for meatspace communication than the horrendously awkward "has-and-belongs-to-many"?) between these two fabulous structures

You can go with the usual Cake-ish model definitions:

# in app/models/product.php
class Product extends AppModel {
    $name = 'Product';
    $hasAndBelongsToMany = array(
        'Sale' => array('className' => 'Sale')
    );
}

# in app/models/sale.php
class Sale extends AppModel {
    $name = 'Sale';
    $hasAndBelongsToMany = array(
        'Product' => array('className' => 'Product')
    );
}
Since we're following the naming conventions here (singular model name fronts pluralized table name, the join table for the HABTM relationship uses pluralized names for each relation joined, in alphabetical order), then the above code should be all you need for the relationship to work.

Indeed, as explained in this helpful article on the HABTM-in-CakePHP subject, you should find that queries using these models will automatically include 'ProductsSale' model entries in their result sets, with that model being dynamically generated by the HABTM association.

So, that means you should be able to create a test fixture for the ProductsSale model, right?

# in app/tests/fixtures/products_sale.php
class ProductsSale extends CakeTestFixture {
    $name = 'ProductsSale';
    $import = 'ProductsSale';
    $records = array(
       a buncha awesome stuff...
    );
}

Unfortunately, at least with my experience on CakePHP 1.2.5, that doesn't work. When your test case attempts to load the fixture, you'll get SQL errors indicating that the test-prefixed version of your "products_sales" table doesn't exist.

I haven't done a sufficiently exhaustive analysis of the Cake innards to sort out why this is, and may yet do so. My guess based on nothing other than observation and intuition is that the auto-generated model is related only to the models involved in the HABTM relationship, through the bindModel method, and does not get generated in any global capacity such that it exists as a model in its own right. Consequently, while the testing code can guess the correct table name for the join table based on the naming conventions used for the fixture, since it doesn't relate to an extant model, it fails to go through the model-wrapping procedures that typically take place per test-case (setting up the test-space table per model, populating it from the fixture, etc.)

Fortunately, as illustrated by the aforementioned helpful article, we can front the join table with a full-fledged model class, and use that model class within the association definitions. This solves the problem of the broken fixture, as the fixture will now refer to a standard model and successfully set up the test table, data, etc.

That means the code becomes:

# in app/models/products_sale.php
class ProductsSale extends AppModel {
    /* the naming convention assumes singularized model name
       based on the entire table name; it does not make inner
       names singular.  This feels a little unclean.  If it
       really bothers you, recall the language you're using
       and I suspect you'll get over it. */
    $name = 'ProductsSale';
    /* The join table belongs to both relations */
    $belongsTo = array('Product', 'Sale');
}

# in app/models/product.php
class Product extends AppModel {
    $name = 'Product';
    /* Use the 'with' option to join through the new model class */
    $hasAndBelongsToMany = array(
        'Sale' => array('with' => 'ProductsSale')
    );
}

# in app/models/sale.php
class Sale extends AppModel {
    $name = 'Sale';
    /* And again, the 'with' option */
    $hasAndBelongsToMany = array(
        'Product' => array('with' => 'ProductsSale')
    );
}

No changes are necessary to the fixture for ProductsSale; once that join model is in place, it'll be good.

It is not uncommon for ORMs to provide magical intelligence for establishing HABTM relationships, and as a matter of convenience it's pretty handy. It is similarly common to allow for HABTM association through an explicitly-defined model class. While this ups the ceremony for setting up your ORM, there are benefits that come with it; a reduced reliance on magic can be distinctly advantageous if you ever get into hairy situations with ORM query wrangling, and it is reasonably common for a HABTM association to have annotations on the relationship itself. In each case, you'll be happy to have your join table fronted by a model class.

Hopefully this will save somebody else some trouble.

PubCon Vegas: 7 Takeaway Nuggets

I'm back at work after last week's PubCon Vegas. I published several articles about specific sessions, but I wanted to provide some nuggets on recurring themes of the conference.

Google Caffeine Update

This year Google rolled out some changes referred to as the Google Caffeine update. This change increases the speed and size of the index, moves Google search to real-time, and improves search results relevancy and accuracy. It was a popular topic at the conference, however, not much light was shed on how algorithm changes would affect your search results, if at all. I'll have to keep an eye on this to see if there are any significant changes in End Point's search performance.

Bing

Bing is gaining traction. They want to get [at least] 51% of the search market share.

Social media

Social media was a hot topic at the conference. An entire track was allocated to Twitter topics on the first day of the conference. However, it still pales in comparison to search. Of all referrals on the web, search still accounts for 98% and social media referrals only account for less than 1% (view referral data here). Dr. Pete from SEOmoz nicely summarized the elephant in the room at PubCon regarding social media that it's important to measure social media response to determine if it provides business value.

Ecommerce Advice

I asked Rob Snell, author of Starting a Yahoo Business for Dummies, for the most important advice for ecommerce SEO he could provide. He explained the importance of content development and link building to target keywords based on keyword conversion. Basically, SEO efforts shouldn't be wasted on keywords that don't convert well. I typically don't have access to client keyword conversion data, but this is great advice.

Internal SEO Processes

Another recurring topic I observed at PubCon was that often internal SEO processes are a much bigger obstacle than the actual SEO work. It's important to get the entire team on your side. Alex Bennert of Wall Street Journal discussed understanding your audience when presenting SEO. Here are some examples of appropriate topics for a given audience:

  • IT Folks: sitemaps, duplicate content (parameter issues, pagination, sorting, crawl allocation, dev servers), canonical link elements, 301 redirects, intuitive link structure
  • Biz Dev & Marketing Folks: syndication of content, evaluation of vendor products & integration, assessing SEO value and link equity of partner sites, microsites, leveraging multiple assets
  • Content Developers: on page elements best practices, linking, anchor text best practices, keyword research, keyword trends, analytics
  • Management: progress, timelines, roadmaps

On the topic of internal processes, I was entertained by the various comments expressing the developer-marketer relationship, for example:

  • "Don't ever let a developer control your URL structure."
  • "Don't ever let a developer control your site architecture."
  • "This site looks like it was designed by a developer."

Apparently developers are the most obvious scapegoat. Back to the point, though: It often requires more effort to get SEO understanding and support than actually explaining what needs to be done.

Search Engine Spam

Search engine spam detection is cool. During a couple of sessions with Matt Cutts, I became interested in writing code to detect search spam. For example:

  • Crawling the web to detect links where the anchor text is '.'.
  • Crawling the web to identify sites where robots.txt blocks ia_archiver.
  • Crawling the web to detect pages with keyword stuffing.

I've typically been involved in the technical side of SEO (duplicate content, indexation, crawlability), and haven't been involved in link building or content development, but these discussions provoked me to start looking at search spam from an engineer's perspective.

Google Parameter Masking

Apparently I missed the announcement of parameter masking in Google Webmaster Tools. I've helped battle duplicate content for several clients, and at PubCon I heard about parameter masking provided in Google Webmaster Tools. This functionality was announced in October of 2009 and allows you to provide suggestions to the crawler to ignore specific query parameters.

Parameter masking is yet another solution to managing duplicate content in addition to the rel="canonical" tag, creative uses of robots.txt, and the nofollow tag. The ideal solution for SEO would be to build a site architecture that doesn't require the use of any of these solutions. However, as developers we have all experienced how legacy code persists and sometimes a low effort-high return solution is the best short term option.

Learn more about End Point's technical SEO services.

Port knocking with knockd

One of the best ways to secure your box against SSH attacks is the use of port knocking. Basically, port knocking seals off your SSH port, usually with firewall rules, such that nobody can even tell if you are running SSH until the proper "knock" is given, at which time the SSH port appears again to a specific IP address. In most cases, a "knock" simply means accessing specific ports in a specific order within a given time frame.

Let's step back a moment and see why this solution is needed. Before SSH there was telnet, which was a great idea way back at the start of the Internet when hosts trusted each other. However, it was (and is) extremely insecure, as it entails sending usernames and passwords "in the clear" over the internet. SSH, or Secure Shell, is like telnet on steroids. With a mean bodyguard. There are two common ways to log in to a system using SSH. The first way is with a password. You enter the username, then the password. Nice and simple, and similar to telnet, except that the information is not sent in the clear. The second common way to connect with SSH is by using public key authentication. This is what I use 99% of the time. It's very secure, and very convenient. You put the public copy of your PGP key on the server, and then use your local private SSH key to authenticate. Since you can cache your private key, this means only having to type in your SSH password once, and then you can ssh to many different systems with no password needed.

So, back to port knocking. It turns out that any system connected to the internet is basically going to come under attack. One common target is SSH - specifically, people connecting to the SSH port, then trying combinations of usernames and passwords in the hopes that one of them is right. The best prevention against these attacks is to have a good password. Because public key authentication is so easy, and makes typing in the actual account password such a rare event, you can make the password something very secure, such as:

gtsmef#3ZdbVdAebAS@9e[AS4fed';8fS14S0A8d!!9~d1aAQ5.81sa0'ed

However, this won't stop others from trying usernames and passwords anyway, which fills up your logs with their attempts and is generally annoying. Thus, the need to "hide" the SSH port, which by default is 22. One thing some people do is move SSH to a "non-standard" port, where non-standard means anything but 22. Typically, some random number that won't conflict with anything else. This will reduce and/or stop all the break-in attempts, but at a high cost: all clients connecting have to know to use that port. With the ssh client, it's adding a -p argument, or setting a "Port" line in the relevant section of your .ssh/config file.

All of which brings us to port knocking. What if we could run SSH on port 22, but not answer to just anyone, but only to people who knew the secret code? That's what port knocking allows us to do. There are many variants on port knocking and many programs that implement it. My favorite is "knockd", mostly because it's simple to learn and use, and is available in some distros' packaging systems. My port knocking discussion and examples will focus on knockd, unless stated otherwise.

knockd is a daemon that listens for incoming requests to your box, and reacts when a certain combination is reached. Once knockd is installed and running, you modify your firewall rules (e.g. iptables) to drop all incoming traffic to port 22. To the outside world, it's exactly as if you are not running SSH at all. No break-in attempts are possible, and your security logs stay nice and boring. When you want to connect to the box via SSH, you first send a series of knocks to the box. If the proper combination is received, knockd will open a hole in the firewall for your IP on port 22. From this point forward, you can SSH in as normal. The new firewall entry can get removed right away, cleared out at some time period later, or you can define another knock sequence to remove the firewall listing and close the hole again.

What exactly is the knock? It's a series of connections to TCP or UDP ports. I prefer choosing a few random TCP ports, so that I can simply use telnet calls to connect to the ports. Keep in mind that when you do connect, it will appear as if nothing happened - you cannot tell that knockd is logging your attempt, and possibly acting on it.

Here's a sample knockd configuration file:

[options]
  logfile = /var/log/knockd.log

[openSSH]
  sequence    = 32144,21312,21120
  seq_timeout = 15
  command     = /sbin/iptables -I INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
  tcpflags    = syn

[closeSSH]
  sequence    = 32144,21312,21121
  seq_timeout = 15
  command     = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
  tcpflags    = syn

In the above file, we've stated that any host that sends a TCP syn flag to ports 32144, 21312, and 21120, in that order, within 15 seconds, will cause the iptables command to be run. Note that the use of iptables is completely not hard-coded to knockd at all. Any command at all can be run when the port sequence is triggered, which allows for all sorts of fancy tricks.To close it up, we do the same sequence, except the final port is 21221.

Once knockd is installed, and the configuration file is put in place, start it up and begin testing. Leave a separate SSH connection open to the box while you are testing! If you are really paranoid, you might want to open a second SSH daemon on a second port as well. First, check that the port knocking works by triggering the port combinations. knockd comes with a command-line utility for doing so, but I usually just use telnet like so:

[greg@home ~] telnet example.com 32144
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused
[greg@home ~] telnet example.com 21312
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused
[greg@home ~] telnet example.com 21120
Trying 123.456.789.000...
telnet: connect to address 123.456.789.000: Connection refused

Note that we reveived a bunch of "Connection refused" - the same message as if we tried any other random port. Also the same message that people trying to connect to a port knock protected SSH will see. If you look in the logs for knockd (set as /var/log/knockd.log in the example file above), you'll see some lines like this if all went well:

[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 1
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 2
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 3
[2009-11-09 14:01] 100.200.300.400: openSSH: OPEN SESAME
[2009-11-09 14:01] openSSH: running command: /sbin/iptables -I INPUT -s 100.200.300.400 -p tcp --dport 22 -j ACCEPT

Voila! Your iptables should now contain a new line:

$ iptables -L -n | grep 100.200
ACCEPT     tcp  --  100.200.300.400  anywhere            tcp dpt:ssh

The next step is to lock everyone else out from the SSH port. Add a new rule to the firewall, but make sure it goes to the bottom:

$ iptables -A INPUT -p tcp --dport ssh -j DROP
$ iptables -L | grep DROP
DROP       tcp  --  anywhere             anywhere            tcp dpt:ssh

You'll note that we used "A" to append the DROP to the bottom of the INPUT chain, and "I" to insert the exceptions to the top of the INPUT chain. At this point, you should try a new SSH connection and make sure you can still connect.If all is working, the final step is to make sure the knockd daemon starts up on boot, and that the DROP rule is added on boot as well. You can also add some hard-coded exceptions for boxes you know are secure, if you don't want to have to port knock from them every time.

One flaw in the above scheme the sharp reader may have spotted is that although the SSH port cannot be reached without a knock, the sequence of knocks used can easily be intercepted and played back. While this doesn't gain the potential bad guy too much, there is a way to overcome it. The knockd program allows the port knocking combinations to be stored inside of a file, and read from, one line at a time. Each successful knock will move the required knocks to the next line, so that even knowing someone else's knock sequence will not help, as it changes each time. To implement this, just replace the 'sequence' line as seen in the above configuration file with a line like this:

one_time_sequences = /etc/knockd.sequences.txt

In this case, the sequences will be read from the file named "/etc/knockd.sequences.txt". See the manpage for knockd for more details on one_time_sequences as well as other features not discussed here. For more on port knocking in general, visit portknocking.org.

While the one_time_sequences is a great idea, I'd like to see something a little different implemented someday. Specifically, having to pre-populate a fixed list of sequences is a drag. Not only do you have to make sure they are random, and that you have enough, but you have to keep the list with you locally. Lose that list, and you cannot get in! A better way would be to have your port knocking program generate the new port sequences on the fly. It would also encrypt the new port sequences to one or more public keys, and then put the file somewhere web accessible. Thus, one could simply grab the file from the server, decrypt it, and perform the port knocking based on the list of ports inside of it. Is all of that overkill for SSH? Almost certainly. :) However, there are many other uses for port knocking that simple SSH blocking and unblocking. Remember that many pieces of information can be used against your server, including what services are running on which ports, and which versions are in use.

PubCon Vegas Day 3: User Generated Content

On day 3 of PubCon Vegas, a great session I attended was Optimizing Forums For Search & Dealing with User Generated Content with Dustin Woodard, Lawrence Coburn, and Roger Dooley. User generated content is content generated by users in the form of message boards, customizable profiles, forums, reviews, wikis, blogs, article submission, question and answer, video media, or social networks.

Some good statistics were presented about why to tap into user generated content. Nielsen research recently released showed that 1 out of every 11 minutes spent online is on a social network and 2/3rds of customer "touch points" are user-generated.

Dustin provided some interesting details about long tail traffic. He looked at HitWise's data of the top 10,000 search terms for a 3 month period. The top 100 terms accounted for 5.7% of all traffic, the top 1000 terms accounted for 10.6% of all traffic, and the entire 10,000 data set accounted for just 18.5% of all traffic. With this data, representing the long tail would be analogous to a lizard with a one inch head and a tail that was 221 miles long that represents the long tail traffic.

Dustin gave the following steps for developing a user generated content community:

  1. Seed it with a few editors and really good initial content.
  2. Give them a voice.
  3. Make it easy to contribute.
  4. Make it cool or trendy.
  5. Provide ownership.
  6. Create competition with contests, ranking or by highlighting expertise.
  7. Build a sense of community or a sense of exclusivity.
  8. Give the people community a purpose.

All SEO best practices apply to a user generated content, but throughout the session, I learned several specific user generated content tips:

  • Predefining keyword rich categories, topics and tags will go a long way with optimization. The better structure for topics that is created up front, the better the user generated content can content in the long run. Users are not inherently good at content organization, so content can be easily buried with poor information architecture.
  • Developing automated cross-linking between user generated content helps improve authority, build clusters of content, and enrich the internal link structure. Dustin had experience with building widgets to automatically links to 5 pieces of user generated content and another widget to allow the user to select several pieces of user generated content from a set of related content.
  • Examples of battling duplicate content include disallowing duplicate page titles and meta descriptions. Content that is moved, renamed or deleted should be managed well.
  • Finally, building a badge or widget to display user involvement helps increase external linking to your site, but this should be carefully managed to avoid appearing spammy. Widget best practices are that the widget should have excellent accessibility, widgets should be simple with light branding and always have fresh content.
  • Developing your own tiny URL helps pass and keep intact external links to your site with user generated content. Lawrence suggested to "gently tweet" user generated content that is the highest quality.

Several of End Point's clients are either in the middle of or considering building a community with user generated content. In ecommerce, blogs, forums, reviews, and Q&A are the most prevalent types of user generated content that I've encountered. Many of the things mentioned in this session were good tips to consider throughout the development of user generated content for ecommerce.

Learn more about End Point's technical SEO services.

PubCon Vegas Day 2: International and Mega Site SEO, and Tools for SEO

On the second day of PubCon Vegas, I attended several SEO track sessions including "SEO for Ecommerce", "International and European Site Optimization", "Mega Site SEO", and "SEO/SEM Tools". A mini-summary of several of the sessions is presented below.

Derrick Wheeler from Microsoft.com spoke on Mega Site SEO about "taming the beast". Microsoft has 1.2 billion URLs that are comprised of thousands of web properties. For mega site SEO, Derrick highlighted:

  • Content is NOT king. Structure is! Content is like the princess-in-waiting after structure has been mastered.
  • Developing an overall SEO approach and organization to getting structure, content, and authority SEO completed is more valuable or relevant to the actual SEO work. This was a common theme among many of the presentations at PubCon.
  • Getting metrics set up at the beginning of SEO work is a very important step to measure and justify progress.
  • Don't be afraid to say no to low priority items.

Most developers deal with a large amount of legacy code. Derrick discussed primary issues when working with legacy problems:

  • Duplicate and undesirable pages. For Microsoft.com, managing and dealing with 1.2 billion pages results in a lot of duplicate and undesirable pages from the past.
  • Multiple redirects.
  • Improper error handling (error handling on 404s or 500s).
  • International URL structure can be a problem for international sites. Having an appropriate TLD (top level domain) is the best solution, but if that's not possible, a process should be implemented to regulate the international urls.
  • Low Quality Page Titles and Meta Tags. For large sites with hundreds of thousands of pages, it's really important to have unique page titles and meta descriptions or to have a template that forces uniqueness.

In summary, structure and internal processes are areas to focus on for Mega Site SEO. Legacy problems are something to be aware of when you have a site so large where changes won't be implemented as quickly as small site changes.

In International and European Search Management, Michael Bonfils, Nelson James, and Andy Atkins-Krueger discussed international SEO and SEM tactics. Takeaways include:

  • In terms of international search marketing, it's important to incorporate culture into search optimization and marketing. If it works in one country, it may not work in another country and so don't offend a culture by not understanding it. Some examples of content differences for targeting different cultures include emphasizing price points, focusing on product quality, and asserting authority or trust on a site.
  • It's also important to understand how linguistics affects your keyword marketing. Automatic translation should not be used (all the speakers mentioned this). A good example of linguistics and search targeting is the use of the search term "soccer cleats", or "football boots". In England, the term "football boot" has a very small portion of the traffic share, but singular terms in other languages ("scarpe de calcio", "botas de futbal") have a much larger percentage of the search market share. Andy shared many other examples of how direct translation would not be the best keywords to target ("car insurance", "healthcare", "30% off", "cheap flights").
  • Local hosting is important for metrics, linking, and to develop trust. Nelson James shared research that shows that 80% of the top 10 results of the top 30 keywords in china had a '.cn' top level domain, but the other top sites that were '.com' sites are all hosted in china.
  • Other technical areas for international search that were mentioned are using the meta language tag, pinyin, charset, and language set. Duplicate content also will become a problem across sites of the same language.
  • It's important to understand the search market share. In Russia, Google shares 35% of the search market and Yandexx has 54%. In China, Baidu has 76% and Google has 22%. There are some reasons that explain these market share differences. Yandexx was written to manage the large Russian vocabulary that Google does not handle as well. Baidu handles search for media better than Google and search traffic in China is much more entertainment driven rather than business driven in the US.

In the last session of the day, about 100 tools were discussed in SEO/SEM Tools. I'm planning on writing another blog post with a summary of these tools, but here's a short list of the tools mentioned by multiple speakers:

  • SEMRush
  • Google: Keyword Ad Tool, Webmaster Tools, Adplanner, SocialGraph API, Google Trends, Analytics, Google Insights
  • SpyFu: Kombat, Domain Ad History, Smart Search, Keyword Ad History
  • SEOBook
  • SEOmoz: Linkscape, Mozbar, Top Pages, etc.
  • MajesticSEO
  • Raven SEO Tools: Website Analytics, Campaign Reports

Stay tuned for a day 3 and wrap up article!

Learn more about End Point's technical SEO services.

PubCon Vegas Day 1: Keyword Research Session

On the first day of PubCon Vegas, I was bombarded by information, sessions, and people. PubCon is a SEO/SEM conference that has a variety of sessions categorized in SEO (Search Engine Optimization), SEM (Search Marketing), Social Media and Affiliates. My primary interest is in SEO, which is why I attended the SEO track yesterday that included sessions about in-house SEO, organic keyword research and selection, and hot topics in SEO.

Because my specific involvement in SEO has focused on technical SEO, I was surprised that my highlight of day one was "Smart Organic Keyword Research and Selection" which included speakers Wil Reynolds, Craig Paddock, Carolyn Shelby, and Mark Jackson.

With good organization and humor, Carolyn first presented the "ABCs of Organic Keyword Research and Selection": A is for analytics and knowing your audience. B is for brainstorm and bonus. and C is for Cookie!, crunch the numbers, cull the lists, and create a final list of keywords.

On the analytics side, Carolyn mentioned good sources of analytics include web server logs (read my article on the value of log or bot parsing), Google Analytics "traffic generating" keyword list, and logs from internal site search.

In regards to knowing your audience, Carolyn shared her personal experience of focus group research: For a project that targeted teenage girls, she invited her daughter and several of her daughter's friends to join her around the table with laptops. She showed them a picture and ask them to search for that image. She recorded the search terms used and used this information to help understand her target audience behavior.

On the brainstorm side, she likes to involve core web team members, product managers, marketing, developers, designers, promoters, marketers, and front liners (customer service representatives, tech support). B was also for bonus, which was to get input from the "suits" of a company to get a list of ideal keywords to understand how they measure keyword success.

Craig Paddock spoke on "Organic Keyword Research and Selection" next. He touched on some of the following SEO keyphrase concepts:

  • keyphrase research: Keyword research is based on keyword popularity, click through rate, quality (measured by conversion and engagement), keyword competitiveness, and current ranking
  • keyphrase expanders and variations: Broad keyword phrases should include variations of keywords that include words like 'best', 'online', 'buy', 'cheap', 'discount', 'wholesale', 'accessories', 'supplies', 'reviews', and abbreviations of words like states. For End Point's ecommerce clients, targeting keyphrases with customer reviews is a great way to generate traffic from user generated content
  • keyphrase discovery: It shouldn't be assumed that clients know the industry. Craig shared an example that his boxing retailer client made the mistake of targeting specific boxing terms that had low traffic. They expanded to include more popular terms like "lose weight" and "burn calories". Another tactic to discover keyphrase is to ask what kind of problems the website service offered solve and choose keywords that target these questions and answers.
  • keyphrase quality: Keyphrase quality is typically measured by conversion rate (revenue / visitor) or engagement. Engagement is measured by the time on site, pages/visit, and bounce rate, which are commonly included in analytics packages.
  • keyphrase selection: Using exact match and broad match on keywords is helpful and let the customers guide the keyword selection. Craig mentioned that data shows that there is a higher conversion rate on more specific keyphrases, which isn't surprising.
  • keyphrase targeting: Keyphrase targeting should match competitiveness with link popularity. An example of this being that more competitive words on your site should be higher up in the hierarchy of the site such as on the home page. For End Point, this would involve us targeting competitive phrases terms like "ecommerce", "ruby on rails development", and "web application development" on our homepage and targeting less competitive phrases such as "interchange development" or "ruby on rails ecommerce" on pages lower in the hierarchy.
  • keyphrase analysis: One area of interest was how analytics tools attribute "credit" to keyphrases. In Google Analytics, if a customer searches "interchange consulting" and visits endpoint.com, then a week later searches "end point", the conversion or credit of the keyphrase is attributed to the "end point" keyword rather than "interchange consulting". This is important in ecommerce because this attribution doesn't accurately credit targeted keywords for revenue. Craig did mention that other tools (including Omniture) provide the ability to select last click attribute versus first click attribute to fix this attribution problem. Another solution to this problem mentioned was to set a user defined variable in Google Analytics equal to a cookie that has the first click search term ("interchange consulting" in the example above) and set the cookie to not expire.

Wil Reynolds spoke next on "Keyword Analysis AFTER the rankings". He touched on an important concept that SEO (specifically keyphrase research and targeting) is never done because keywords are constantly evolving because people change the way they search, blended search (video, image) is on the rise, and there are social or economic influences on the keyword popularity. Some good examples of keyphrase trending include:

  • "Shopping" was a good keyword in 1999 because ecommerce was growing on the web and users didn't know what to search for.
  • "Handheld device" transitioned to "Smartphone"
  • "Eco-Friendly" has grown while "Environmentally Friendly" has declined - view this trend here
  • "Netbooks" and "Ultraportables" are popular search terms on the rise that were non-existent two years ago - view netbook trends here
  • Brands in the gear industry evolve at a much faster pace than the plumbing or wood floor industry

Wil's examples and advice apply directly to our clients who should to be aware of social and economic influences that may require they change they keyphrase targeting over time.

Finally, Mark Jackson spoke on focusing your keywords for better results. He discussed the importance of analyzing the keyword competitiveness to determine which keywords to target to get the most value out of keyword SEO work.

In summary, I still don't love keyword and keyphrase research and selection :), but I found that the speakers presented a great overview of keyword research and selection with a good mixture of personal experience, expertise and examples. In summary, some great concepts to keep in mind in regards to keyword research are:

  • There are always missed opportunities in keyword targeting.
  • There are lotsa tools! Tools are good for measuring keyphrase competitiveness, user engagement, and identifying missed opportunities.
  • SEO keyphrase research and selection is an ongoing process.

Now, back to day 2 activities...

Learn more about End Point's technical SEO services.

DjangoCon 2009: Portland, Ponies, and Presentations

I attended DjangoCon this year for the first time, and found it very informative and enjoyable. I hoped to round out my knowledge of the state of the Django art, and the conference atmosphere made that easy to do.

Presentations

Avi Bryant's opening keynote was on the state of web application development, and what Django must do to remain relevant. In the past, web application frameworks did things in certain ways due to the constraints of CGI. Now they're structured around relational databases. In the future, they'll be arranged around Ajax and other asynchronous patterns to deliver just content to browsers, not presentation. To wit, "HTML templates should die", meaning we'll see more Gmail-style browser applications where the HTML and CSS is the same for each user, and JavaScript fetches all content and provides all functionality. During Q&A, he clarified that most of what he said applies to web applications, not content-driven sites which must be SEO friendly and so arranged much differently. Many of these themes were serendipitously also in Jacob Kaplan-Moss' "Snakes on the Web" talk, which he gave at PyCon Argentina the same week as DjangoCon.

Ian Bicking's keynote was on open-source as a philosophy; very abstract and philosophical but also interesting. It has been described as “a free software programmer’s midlife crisis”. Frank Wiles of Revolution Systems gave a barn-burner talk on how Django converted him from Perl to Python, followed by another on Postgres optimization. The latter reflected a theme that all web developers are now expected to do Operations as well, with several talks devoted to simple systems administration concepts.

Deployment

While working on Django projects we've been doing this year, I've been watching developments around deployment of Python web applications, particularly with Apache. The overwhelming consensus: Without active development, mod_python is going the way of the dodo. Although WSGI is architecturally similar to CGI, the performance difference can be striking. mod_wsgi's daemon mode running as a separate user is more secure and flexible than mod_python processes running as the Apache user. Given mod_wsgi's momentum, it makes sense to use it and avoid mod_python for new projects.

Several other tools kept re-appearing in presenters' demonstrations. Fabric is a remote webapp deployment tool similar to Ruby's Capistrano. Python's package index, formerly named "The Cheeseshop", has been renamed PyPI, the Python Package Index. Though easy_install is the standard tool to install PyPI packages, pip is gaining momentum as its successor. VirtualEnv is a tool to create isolated Python environments, discrete from the system environment. Since the conference I've been exploring how these tools may be leveraged for our own development, and may be integrated into our DevCamps multiple-environments system.

Pinax

The Eldarion folks gave three talks on Pinax, and the project came up a lot in conversations. If anything could be said to have "buzz" at the conference, this is it. Pinax is a suite of re-usable Django applications, encompassing functionality often-desired, but not common enough to be included in django.contrib. It may be compared to Drupal and Plone. Its popularity also spurred discussions on what should or should not be included in the Django core, and how all Django developers should make their apps re-usable (some of James Bennett's favorite topics).

Community

Among others, I was fortunate to spend time with Kevin Fricovsky and the others who launched the new community site DjangoDose during the conference. DjangoDose is a spiritual successor to the now-defunct This Week in Django podcast, and was visible on many laptops in the conference room, aggregating #djangocon tweets.

That's all I have time to relate now. There was plenty more there and I look forward to following up with people & projects.

Learn more about End Point's Django and Python development.

Automatically building Pentaho metadata

Every so often I'll hear of someone asking for a way to allow their users to write queries against their database without having to teach everyone SQL. There are various applications to do this: BusinessObjects and Cognos, are two common commercial examples, among many others. Pentaho and JasperReports provide similar capabilities in the open-source world. These tools allow users to write reports by selecting fields from a user-friendly list, adding suitable constraints, and making other formatting and filtering choices, all without needing to understand SQL.

Those familiar with these packages know that in order to provide those nice, readable field names and simple, meaningful field groupings, the software generally needs some sort of metadata file. This file maps actual database fields to readable descriptions, specifies relationships between tables, and translates database field types to data types the reporting software understands. Typically to create such a file, an administrator spends a few hours in front of a vendor-supplied GUI application dragging graphical representations of their tables and columns around, defining joins and entering friendly descriptions.

For the TriSano™ project's data warehouse, we needed a way to make regular modifications to the metadata file we gave to our Pentaho instance, in order to allow users to write reports that included data from the custom-built forms TriSano allowed them to create. To this end, we dove into the Pentaho APIs and developed a system to modify the metadata file automatically, adding tables and relationships whenever users create a new custom form.

TriSano is a Ruby-on-Rails application, running on JRuby, and the ability to use Java objects natively within JRuby was critical to interfacing correctly with Pentaho, a Java application. Within JRuby, our script can create Pentaho objects at will. Interested parties are encouraged to browse the source code of the TriSano script for the many details required to make this work.

In short, the script makes a new Pentaho metadata file entirely from scratch, using only information from a small number of purpose-built database tables, and database structure information taken directly from the PostgreSQL catalogs. It creates a schema file, populates it with descriptions of each of the actual database tables our users are interested in, assigns friendly names to each of the database objects with which users will interact, and divides up the results into user-defined groupings meaningful to their business.

I'm not familiar with a commercial reporting package that allows for modification of the underlying metadata without user intervention; doing something like this without the benefit of open-source software would have been daunting indeed.

PL/LOLCODE and INLINE functions

PostgreSQL 8.5 recently learned how to handle "inline functions" through the DO statement. Further discussion is here, but the basic idea is that within certain limitations, you can write ad hoc code in any language that supports it, without having to create a full-fledged function. One of those limitations is that you can't actually return anything from your function. Another is that the language has to support an "inline handler".

PostgreSQL procedural languages all have a language handler function, which gets called whenever you execute a stored procedure in that language. An inline handler is a separate function, somewhat slimmed down from the standard language handler. PostgreSQL gives the inline handler an argument containing, among other things, the source text passed in the DO block, which the inline handler simply has to parse and execute.

As of when the change was committed in PostgreSQL, only PL/pgSQL supported inline functions. Other languages may now support them; today I spent the surprisingly short time needed to add the capability to PL/LOLCODE. Here's a particularly useless example:

DO $$
HAI
 VISIBLE "This is a test of INLINE stuff"
KTHXBYE
$$ language pllolcode;