End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Using ln -sf to replace a symlink to a directory

When you want to forcibly replace a symbolic link on some kind of Unix (here I'm using the version of ln from GNU coreutils), you can do it the manual way:

rm -f /path/to/symlink
ln -s /new/target /path/to/symlink

Or you can provide the -f argument to ln to have it replace the existing symlink automatically:

ln -sf /new/target /path/to/symlink

(I was hoping this would be an atomic action such that there's no brief period when /path/to/symlink doesn't exist, as when mv moves a file over top of an existing file. But it's not. Behind the scenes it tries to create the symlink, fails because a file already exists, then unlinks the existing file and finally creates the symlink.)

Anyway, that's convenient, but I ran into a gotcha which was confusing. If the existing symlink you're trying to replace points to a directory, the above actually creates a symlink inside the dereferenced directory the old symlink points to. (Or fails if the referent is invalid.)

To replace an existing directory symlink, use the -n argument to ln:

ln -sfn /new/target /path/to/symlink

That's always what I have wanted it to do, so I need to remember the -n.

GNU Screen: follow the leader

First of all, if you're not using GNU Screen, start now :).

Years ago, Jon and I spoke of submitting patches to implement some form of "follow the leader" (like the children's game, but with a work-specific purpose) in GNU Screen. This was around the time he was patching screen to raise the hard-coded limit of windows allowed within a given session, which might give an idea of how much screen gets used around here (a lot).

The basic idea was that sometimes we just want to "watch" a co-worker's process as they're working on something within a shared screen session. Of course, they're going to be switching between screen windows and if they forget to announce "I've switched to screen 13!" on the phone, then one might quickly become lost. What if the cooperative work session doesn't include a phone call at all?

To the rescue, Screen within Screen.

Accidentally arriving at one screen session within another screen session is a pretty common "problem" for new screen users. However, creative use of two (or more) levels of nested screen during a shared session allows for a "poor man's" follow the leader.

If the escape sequence of the outermost screen is changed to something other than the default, then the default escape sequence will pass through and take effect on the inner screen. In this way, anyone attached to the outermost screen will be following whomever is controlling the inner screen session as they flip between windows, grep logs, launch editors and save my vegan bacon! To "break away" from the co-working session, a user would simply use the chosen non-default escape sequence of the outermost screen to create a new window or disconnect entirely.

Sound confusing? Give some of the following commands a try. You can always just close out all the windows of a screen session and eventually you'll make it back to your original shell.

Steps:

  1. start the outermost screen session (called "followme") with a non-default escape sequence (pick one that suits you):
    screen -S followme -e ^ee
  2. from within the "followme" session, start the inner screen where actual work will be performed:
    screen -S work
  3. get friends and co-workers (logged-in as the same user) to connect to your "followme" screen:
    screen -x followme
  4. work as normal using the default: <CTRL> <a> sequences (which ought to affect the inner "work" session).
  5. to "break away" from the "work" session, use: <CTRL> <e> sequences (which ought to affect the outer "followme" session). For example, to disconnect from the shared session, one would type: <CTRL> <e> <d>

Note: If those sharing the screen session are already acclimated to screen-within-screen, you can skip the non-default escape sequences entirely and use <CTRL> <a> <a> as the escape sequence (another <a> for every level of screen-within-screen). This also happens to be your evasion route for accidental screen-within-screen moments.

Remember that, by default, everyone who wants to share the screen must already be logged-in as the same user (without the use of sudo or su). There are methods of allowing shared screen access between users, but those are outside the scope of this post.

Have fun!

edited on 09 OCT 2014 to update bacon link

Permission denied for postgresql.conf

I recently saw a problem in which Postgres would not startup when called via the standard 'service' script, /etc/init.d/postgresql. This was on a normal Linux box, Postgres was installed via yum, and the startup script had not been altered at all. However, running this as root:

 service postgresql start

...simply gave a "FAILED".

Looking into the script showed that output from the startup attempt should be going to /var/lib/pgsql/pgstartup.log. Tailing that file showed this message:

  postmaster cannot access the server configuration file
  "/var/lib/pgsql/data/postgresql.conf": Permission denied

However, the postgres user can see this file, as evidenced by an su to the account and viewing the file. What's going on? Well, anytime you see something odd when using Linux, especially if permissions are involved, you should suspect SELinux. The first thing to check is if SELinux is running, and in what mode:

# sestatus

SELinux status:                 enabled
SELinuxfs mount:                /selinux
Current mode:                   enforcing
Mode from config file:          enforcing
Policy version:                 21
Policy from config file:        targeted

Yes, it is running and most importantly, in 'enforcing' mode. SELinux logs to /var/log/audit/ by default on most distros, although some older ones may log directly to /var/log/messages. In this case, I quickly found the problem in the logs:

# grep postgres /var/log/audit/audit.log | grep denied | tail -1

type=AVC msg=audit(1234567890.334:432): avc:  denied  { read } for
pid=1234 comm="postmaster" name="pgsql" dev=newpgdisk ino=403123  
scontext=user_u:system_r:postgresql_t:s0
tcontext=system_u:object_r:var_lib_t:s0 tclass=lnk_file

Looks like SELinux did not like a symlink, and sure enough:

# ls -ld /var/lib/pgsql /var/lib/pgsql/data /var/lib/pgsql/data/postgresql.conf

lrwxrwxrwx. 1 postgres postgres 18 1999-12-31 23:55 /var/lib/pgsql -> /mnt/newpgdisk
drwx------. 2 postgres postgres  4096 1999-12-31 23:56 /var/lib/pgsql/data
-rw-------. 1 postgres postgres 16816 1999-12-31 23:57 /var/lib/pgsql
/data/postgresql.conf

Here we see that although the postgres user owns the symlink, owns the data directory at /var/lib/pgsql/data, and owns the file in question, /var/lib/pgsql/data/postgresql.conf, the conf file is no longer really on /var/lib/pgsql, but is on /mnt/newpgdisk. SELinux did not like the fact that the postmaster process was trying to read across that symlink.

Now that we know SELinux is the problem, what can we do about it? There are four possible solutions at this point to get Postgres working again:

First, we can simply edit the PGDATA assignment within the /etc/init.d/postgresql file to point to the actual data dir, and bypass the symlink. In this case, we'd change the line as follows:

#PGDATA=/var/lib/pgsql/data
PGDATA=/mnt/newpgdisk/data

The second solution is to simply turn SELinux off. Unless you are specifically using it for something, this is the quickest and easiest solution.

The third solution is to change the SELinux mode. Switching from "enforcing" to "permissive" will keep SELinux on, but rather than denying access, it will log the attempt and still allow it to proceed. This mode is a good way to debug things while you attempt to put in new enforcement rules or change existing ones.

The fourth solution is the most correct one, but also the most difficult. That of course is to carve out an SELinux exception for the new symlink. If you move things around again, you'll need to tweak the rules again, or course.

SEO: External Links and PageRank

I had a flash of inspiration to write an article about external links in the world of search engine optimization. I've created many SEO reports for End Point's clients with an emphasis on technical aspects of search engine optimization. However, at the end of the SEO report, I always like to point out that search engine performance is dependent on having high quality fresh and relevant content and popularity (for example, PageRank). The number of external links to a site is a large factor in popularity of a site, and so the number of external links to a site can positively influence search engine performance.

After wrapping up a report yesterday, I wondered if the external link data that I provide to our clients is meaningful to them. What is the average response when I report, "You should get high quality external links from many diverse domains"?

So, I investigated some data of well known and less well known sites to display a spectrum of external link and PageRank data. Here is the origin of some of the less well known domains referenced in the data below:

And here is the data:

I retrieved the PageRank from a generic PageRank tool. SEOmoz was used to collect external link counts and external linking subdomains. Finally, Yahoo Site Explorer was used to retrieve external link counts to the domain in question. I chose to examine both external link counts from SEOmoz and Yahoo Site Explorer to get a better representation of data. SEOmoz compiles their data about once a month and does not have as many urls indexed as Yahoo, which explains why their numbers may be lagging behind the Yahoo Site Explorer external link counts.

Out of curiosity, I went on to plot the Page Rank data vs. Log (base 10) of the other data.

PageRank vs Log of SEOmoz external link count

PageRank vs Log of SEOmoz external linking subdomain count

PageRank vs Log of Yahoo SiteExplorer external link count

PageRank is described as a theoretical probability value on a logarithmic scale and it's based on inbound links, PageRank of inbound links, and other factors such as Google visit data, search click-through rates, etc. The true popularity rank is a rank between 1 and X, where X is equal to the total number of webpages crawled by search engine A. After pages are individually ranked between 1 and X, they are scaled logarithmically between 0 and 10.

The takeaway from this data is when an "SEO report" gives advice to "get more external links", it means:

  • If your site has a PageRank of < 4, getting external links on the scale of hundreds may impact your existing PageRank or popularity
  • If your site has a PageRank of >= 4 and < 6, getting external links on the scale of thousands may impact your existing PageRank or popularity
  • If your site has a PageRank of >= 6 and < 8, getting external links on the scale of tens to hundreds of thousands may impact your existing PageRank or popularity
  • If your site has a PageRank of >= 8, you probably are already doing something right...

Furthermore, even if a site improves external link counts, other factors will play into the PageRank algorithm. Additionally, keyword relevance and popularity play key roles in search engine results.

Learn more about End Point's technical SEO services.

Migrating Postgres with Bucardo 4

Bucardo just released a major version (4). The latest version, 4.0.3, can be found at the Bucardo website. The complete list of changes is available on the new Bucardo wiki.

One of the neat tricks you can do with Bucardo is an in-place upgrade of Postgres. While it still requires application downtime, you can minimize your downtime to a very, very small window by using Bucardo. We'll work through an example below, but for the impatient, the basic process is this:

  1. Install Bucardo and add large tables to a pushdelta sync
  2. Copy the tables to the new server (e.g. with pg_dump)
  3. Start up Bucardo and catch things up (e.g. copy all rows changes since step 2)
  4. Stop your application from writing to the original database
  5. Do a final Bucardo sync, and copy over non-replicated tables
  6. Point the application to the new server

With this, you can migrate very large databases from one server to another (or from Postgres 8.2 to 8.4, for example) with a downtime measured in minutes, not hours or days. This is possible because Bucardo supports replicating a "pre-warmed" database - one in which most of the data is already there.

Let's test out this process, using the handy pgbench utility to create a database. We'll go from PostgreSQL 8.2 (the original database, called "A") to PostgreSQL 8.4 (the new database, called "B"). The first step is to create and populate database A:

  initdb -D testA
  echo port=5555 >> testA/postgresql.conf
  pg_ctl -D testA -l a.log start
  createdb -p 5555 alpha
  pgbench -p 5555 -i alpha
  psql -p 5555 -c 'create user bucardo superuser'

At this point, we have four tables:

  $ psql -p 5555 -d alpha -c '\d+'
                          List of relations
   Schema |   Name   | Type  |  Owner   |    Size    | Description
  --------+----------+-------+----------+------------+-------------
   public | accounts | table | postgres | 13 MB      |
   public | branches | table | postgres | 8192 bytes |
   public | history  | table | postgres | 0 bytes    |
   public | tellers  | table | postgres | 8192 bytes |

For the purposes of this example, let's make believe that accounts table is actually 13 TB. :) The next step is to prepare the 8.4 database:

  initdb -D testB
  echo port=5566 >> testB/postgresql.conf
  pg_ctl -D testB -l b.log start

We'll copy everything except the data itself to the new server:

  pg_dumpall --schema-only -p 5555 | psql -p 5566 -f -

Because the other tables are very small, we're only going to use Bucardo to copy over the large "accounts" table. So let's install Bucardo and add a sync to do just that:

  sudo yum install perl-DBIx-Safe
  tar xvf Bucardo-4.0.3.tar.gz
  cd Bucardo-4.0.3
  perl Makefile.PL
  sudo make install

(That's a very quick overview - see the Installation page for more information.)

Let's install bucardo on the new database:

  mkdir /tmp/bctest
  bucardo_ctl install --dbport=5566 --piddir=/tmp/bctest

Set the port so we don't have to keep typing it in:

  echo dbport=5566 > .bucardorc

Now teach Bucardo about both databases:

  bucardo_ctl add db alpha name=oldalpha port=5555
  bucardo_ctl add db alpha name=newalpha port=5566

Finally, create a sync to copy from old to new:

  bucardo_ctl add sync pepper type=pushdelta source=oldalpha targetdb=newalpha tables=accounts ping=false

This adds a new sync named "pepper" which is of type pushdelta (master-slave: copy changes from the source table to the target(s).). The source is our old server, named "oldalpha" by Bucardo. The target database is our new server, named "newalpha". The only table in this sync is "accounts", and we set ping as false, which means that we do NOT create a trigger on this table to signal Bucardo that a change has been made, as we will be kicking the sync manually.

At this point, the accounts table has a trigger on it that is capturing which rows have been changed. The next step is to copy the existing table from the old database to the new database. There are many ways to do this, such as a NetApp snapshot, using ZFS, etc., but we'll use the traditional way of a slow but effective pg_dump:

  pg_dump alpha -p 5555 --data-only -t accounts | psql -p 5566 -d alpha -f -

This can take as long as it needs to. Reads and writes can still happen against the old server, and changes can be made to the accounts tables. Once that is done, here's the situation:

  • The old server is still in production
  • The new server has a full but outdated copy of 'accounts'
  • The new server has empty tables for everything but 'accounts'
  • All changes to the accounts table on the old server are being logged.

Our next step is to start up Bucardo, and let it "catch up" the new server with all changes that have occurred since we created the sync:

  bucardo_ctl start

You can keep track of how far along the sync is by tailing the log file (syslog and ./log.bucardo by default) or by checking on the sync itself:

  bucardo_ctl status pepper

Once it has caught up (how long depends on how busy the accounts table is, of course), the only disparity should be any rows that have changed since the sync last ran. You can kick off the sync again if you want:

  bucardo_ctl kick pepper 0

The final 0 there will allow you to see when the sync has finished.

For the final step, we'll need to move the remainder of the data over. This begins our production downtime window. First, stop the app from writing to the database (reading is okay). Next, once you've confirmed nothing is making changes to the database, make a final kick:

  bucardo_ctl kick pepper 0

Next, copy over the other data that was not replicated by Bucardo. This should be small tables that will copy quickly. In our case, we can do it like this:

  pg_dump alpha -p 5555 --data-only -T accounts -N bucardo | psql -p 5566 -d alpha -f -

Note that we excluded the schema bucardo, and copied all tables *except* the 'accounts' one.

That's it! You can now point your application to the new server. There are no Bucardo triggers or other artifacts on the new server to clean up. At this point, you can shutdown Bucardo itself:

  bucardo_ctl stop

Then shutdown your old Postgres and start enjoying your new 8.4 server!

Client Side Twitter Integration

I recently was assigned a project that required an interesting solution, Crisis Consultation Services. The site is essentially composed of five static pages and two dynamic components.

The first integration point required PayPal payment processing. Crisis Consultation Services links to PayPal where payment processing is completed through PayPal. Upon payment completion, the user is bounced back to a static receipt page. This integration was quite simple as PayPal provides the exact form that must be included in the static HTML.

The second integration point required a unique solution. The service offered by the static brochure site is dependent on the availability and schedule of the company employees, so the service availability remains entirely dynamic. The obvious solution was to include dynamic functionality where the employees would update their availability. Some thoughts that crossed our minds of how to update the availability were:

  • Could we build an app for the employees to update the availability given the budget constraints?
  • Could the employees use ftp or ssh to upload a single file containing details on their availability?
  • Are there other dynamic tools that we could use to track the availability of the consultant such as SMS or Twitter?

Initially, we investigated using Google App Engine with a Python app that retrieved the availability information from an existing tool. To keep the budget down and try to stick with a purely static site on the server, we decided to investigate using Twitter for integration. I reviewed the Twitter API and found some code snippets for integrating Twitter via JavaScript. Below are snippets and explanations of the resulting code.

First, a script that retrieves the Twitter feed is appended to the document body. In this case, the endpoint Twitter account is pinged to get the most recent comment (count=1), and the resulting callback 'twitterAfter' is made after the JSON feed has been retrieved.

var url = 'http://twitter.com/statuses/user_timeline/endpoint.json?callback=twitterAfter&count=1';
var script = document.createElement('script');
script.setAttribute('src', url);
document.body.appendChild(script);

Next, the callback 'twitterAfter' function is called. The callback function includes logic to determine if the consultant is available based on the most recent twitter message. If the datetime is in the future, the consultant is not available and will be available at that future datetime. If the datetime is in the past, the consultant is available and has been available since that datetime.

var twitterAfter = function(obj) {
   var now = new Date();
   var available = new Date(obj[0].text.replace(/-/g, '/'));
   if (now >= available) {
       alert('Consultant is available!');
       // do other whizbang stuff here
   }
   return;
};

In another example of a more complex callback, the availability of the consultant is calculated.

var twitterAfter_advanced = function(obj) {
   var now = new Date();
   var available = new Date(obj[0].text.replace(/-/g, '/'));
   mins_available = parseInt((available.getTime() - now.getTime())/60000);
   if (mins_available < 1) {
       alert('Consultant is available!');
       // do other whizbang stuff here
    } else {
       alert('Consultant is not available. The consultant will be available in ' + mins_available + ' minute(s).');
       // do other whizbang stuff here
    }
    return;
};

Here is an example Twitter feed to be used with this client side code:

2009-09-13 9:00 - 6:00pm Sept 12th from web
2009-09-12 8:30 - 7:00pm Sept 11th from web
2009-09-10 22:00 - 5:00pm Sept 10th from web

The above example Twitter feed would yield the following availability:

Sept 10th 5pm - Sept 10th 10pm: Not Available
Sept 10th 10pm - Sept 11th 7pm: Available
Sept 11th 7pm - Sept 12th 8:30am: Not Available
Sept 12th 8:30am - Sept 12th 6pm: Available
Sept 12th 6pm - Sept 13th 9am: Not Available
Sept 13th 9am - now: Available

In both the basic and advanced callback methods above, content on the page is updated to inform users of service or consultant availability. In the application of the advanced callback method, the user is notified when the consultant will be available.

The client side Twitter integration solution fit our budget and server constraints - the functionality lives entirely on the client side, so we weren't concerned about server installation, setup, or requirements. Additionally, Twitter is such a popular app that there are many convenient ways to tweet availability from a mobile environment.

Tests are contracts, not blank checks

Recently, I wrote up a new class and some tests to go along with it, and I was lazy and sloppy. My class had a fairly simple implementation (mostly a set of accessors, plus a to_s method). It looked something like this:

class Soldier
  attr_accessor :name, :rank, :serial_number
  def initialize(name,rank,serial_number)
    @name = name
    @rank = rank
    @serial_number = serial_number
  end

  def to_s
    "#{name}, #{rank}, #{serial_number}"
  end
end

I had been trying to determine the essential attributes of the class (e.g., what are the minimal elements of this class? should I have a base class, then sub-class it for the various differences, or should I have only a single class that contains everything I need?)

As a result of the speculative nature of the development, my tests only included a few of the attributes.

What's wrong with that?

On the surface, there is nothing technically wrong with skipping accessor tests: after all, testing each accessor individually is really testing Ruby, not the code I wrote. Another excuse I made is that testing each individually is very non-DRY - the testing code itself has lots of duplication.

The problem is that the set of tests should be considered a contract between the class writer and the outside world. By not including the correct and complete list of accessors, I left out important information; it's a check, already signed by the class developer, but with the amount left blank.

I've seen some code solve the non-DRY-ness problem like the following:

class Soldier
  Attributes = [:name, :rank, :serial_number]
  Attributes.each {|attr| attr_accessor attr}
  ...

then testing code of:

  Attributes.each do |attr|
    it "should have an accessor for #{attr}" do
      ...

That let's the testing code be nice and compact; simply load in the class, then iterate over the Attributes to verify that the accessors are present.

From a tests are contracts standpoint, this approach is terrible, though, perhaps even worse than the original, incomplete set of tests I had written. All the reader of the tests learns is that there is an array of attributes; the reader has to go look at the implementation itself to see what those attributes are.

Better is to use an anonymous array in the test, duplicating the attribute list; i.e.,

  [:name,:rank,:serial_number].each do |attr|
    it "should have an accessor for #{attr}" do
      ...
    end
  end

That seems to be a good balance between keeping tests as contacts yet keeping them DRY.

Starting processes at boot under SELinux

There are a few common ways to start processes at boot time in Red Hat Enterprise Linux 5 (and thus also CentOS 5):

  1. Standard init scripts in /etc/init.d, which are used by all standard RPM-packaged software.
  2. Custom commands added to the /etc/rc.local script.
  3. @reboot cron jobs (for vixie-cron, see `man 5 crontab` -- it is not supported in some other cron implementations).

Custom standalone /etc/init.d init scripts become hard to differentiate from RPM-managed scripts (not having the separation of e.g. /usr/local vs. /usr), so in most of our hosting we've avoided those unless we're packaging software as RPMs.

rc.local and @reboot cron jobs seemed fairly equivalent, with crond starting at #90 in the boot order, and local at #99. Both of those come after other system services such as Postgres & MySQL have already started.

To start up processes as various users we've typically used su - $user -c "$command" in the desired order in /etc/rc.local. This was mostly for convenience in easily seeing in one place what all would be started at boot time. However, when running under SELinux this runs processes in the init_t context which usually prevents them from working properly.

The cron @reboot jobs don't have that SELinux context problem and work fine, just as if run from a login shell, so now we're using those. Of course they have the added advantage that regular users can edit the cron jobs without system administrator intervention.

Increasing maildrop's hardcoded 5-minute timeout

One of the ways I like to retrieve email is to use fetchmail as a POP and IMAP client with maildrop as the local delivery agent. I prefer maildrop to Postfix, Exim, or sendmail for this because it doesn't add any headers to the messages.

The only annoyance I have had is that maildrop has a hardcoded hard timeout of 5 minutes for delivering a mail message. When downloading a very long message such as a Git commit notification of a few hundred megabytes, or a short message with an attached file of dozens of megabytes, especially over a slow network connections, this timeout prevents the complete message from being delivered.

Confusingly, a partial message will be delivered locally without warning -- with the attachment or other long message data truncated. When fetchmail receives the error status return from maildrop, it then tries again, and given similar circumstances it suffers a similar fate. In the worst case this leads to hours of clogged tubes and many partial copies of the same email message, and no other new mail.

This maildrop hard timeout is compiled in and there is no runtime option to override it. Thus it is helpful to compile a custom build from source, specifying a different timeout at configure time. In my case, I set the timeout to be 1 day:

./configure --enable-global-timeout=86400 --without-db --enable-syslog=1 \
    --enable-tempdir=tmp --enable-smallmsg=65536 
make

If you choose to configure with --without-db as I do, you need to manually remove two occurrences of makedatprog from Makefile, as makedatprog is a utility only needed by DBM and won't have been compiled. Then make install as root and edit your ~/.fetchmailrc lines, adding mda "/usr/local/bin/maildrop", and restart the fetchmail daemon.

Long messages will still take a long time to deliver over a slow link, but they will at least be allowed to eventually finish this way.

Tests are not Specs

We're big fans of Test Driven Development (TDD). However, a co-worker and I encountered some obstacles because we focused too intently on writing tests and didn't spend enough up-front time on good, old-fashioned specifications.

We initially discussed the new system (which is a publish/subscribe interface used to do event management for a reasonably large system, which totals around 70K lines of Ruby). My co-worker did most of the design and put a high-level one-pager together to outline how things should work, wrote unit tests and a skeleton set of classes and modules, then handed the project to me to implement.

So far, so good. All I had to do was make all of the tests pass, and we were finished.

We only had unit tests, no integration tests, so there was no guarantee that once I was done coding, that the integration work would actually solve the problem at hand. In Testing (i.e., the academic discipline that studies testing), this is referred to as a validation problem: we may have a repeatable, accurate measure, but it's measuring the wrong thing.

We knew that was a weakness, but we pressed ahead anyway, expecting to tackle that later. As an example, we identified 3 different uses of this publish/subscribe event management mechanism that had wildly different use cases. When we discussed these with the customer, he clarified that one of the use cases is needed in the immediate term, one is useful in the short term, and that the third is out of scope. Getting that information was helpful in keeping us on track, and not having the scope grow unmanageably.

Tests are code and no code (of sufficient size and complexity) is bug-free; thus, tests have bugs.

When tests are the only spec, what is the best way to proceed?

The developer can assume tests are correct and Make The Tests Pass; clearly that is not always the best approach. It's better for the developer to exercise judgement and fix obvious errors. However, the developer's judgement can be wrong, so the test writer needs to pay special attention to any changes to the tests (and the need to catch problems implies that the test designer and developer need to be in tight communication -- don't just hand your co-worker your tests and then go on a long vacation).

Sometimes tests aren't buggy, per se, but they may be ambiguous. Variable names may not communicate clearly. They may be too large and not clearly test one thing. They may be too broad and leave unspecified the intent or design parameters in mind.

In our recent experience, one test had the following code (slightly altered):

 it 'should pass event to each callback in sequence' do
  listener = mock 'listener' callback_seq = sequence 'callbacks'

  listener.expects(:one).with(@event).in_sequence(callback_seq)
  listener.expects(:two).with(@event).in_sequence(callback_seq)
  listener.expects(:three).with(@event).in_sequence(callback_seq)
  ...

What's wrong with this? On the surface, nothing is wrong, until the bigger picture is viewed: there is no other mention of a listener anywhere else in the tests, high-level design document, or code. Is a listener a subscriber? Should there be a separate listener class somewhere? After all, mock 'foo' often means that there should be a foo object. Perhaps the test developer forgot to include a file (or the developer overlooked it).

What actually transpired is not so mundane, but it identified a very different approach in testing. My colleague made the observation that it doesn't matter if a listener is a subscriber or not for this particular test, as it's really only a syntactic placeholder: we could do a variable renaming for listener and that should not change the meaning of the code.

While his observation is true and correct, it ignores Abelson and Sussman's viewpoint that "programs must be written for other humans to read, and only incidentally for machines to execute." As the implementer behind the pseudo-Chinese-wall of his tests, I expected the tests to tell me how the universe of this system should be constructed, and the mention of a listener communicated something other than the intended message.

Sometimes, even unit tests require extensive setup, and it can be tempting to add in extra tests and checks which don't add a lot of value but instead make the intent unclear, make the tests themselves less DRY, and give yet another opportunity to introduce bugs. One example looked something like:

describe 'creating subscriber entry' do
  before do
    @subscriber = stub 'subscriber'
  end
  describe 'with a method name' do
    it 'should create a block that invokes the method name' do
      class << @subscriber                     
        attr_accessor :weakref, :last_received                      

        def callback(e)                         
          self.last_received = e
          self.weakref = self.respond_to?(:weakref_alive?)
        end
      end
      ...              
      entry = @publisher.class.create_subscriber_entry(@subscriber, :callback)
      entry.size.should == 2
      event = stub 'event'
      entry[:block].call(event)
      @subscriber.last_received.should == event
      @subscriber.weakref.should be_true
    end 

Note that the purpose of the test is Creating a subscriber entry with a method name should create a block that invokes the method name, yet the test checks the size of the subscriber entry, verifies that the event is received, and that the callback itself is stored via a weak reference so that it can be garbage collected. Each of those should be in separate tests. In fact, the stated goal of the test is only implicitly checked. Better is something like

  ...
  class << @subscriber
    attr_accessor :sentinel
    def callback(e)
      self.sentinel = true
    end
  end
  entry = @publisher.class.create_subscriber_entry(@subscriber, :callback)
  event = stub 'event'
  @subscriber.sentinel.should be_nil
  entry[:block].call(event)
  @subscriber.sentinel.should be_true
end

which only tests that the named callback is invoked, properly setting a sentinel value.

If tests are being used as specification, then they will hide important details. A simple example if the type of storage to use for a particular set of values (for us, it was callbacks). Should they be in an array? a hash? a hash of arrays? something else?

How to handle this is a little more tricky -- implementation details like this could arguably not be part of a set of tests, as the behavior is the driver, not implementation details. However, without a specification, or a design document that outlines what kinds of performance characteristics we should aim for here, the implementer has to make choices, and those choices are not necessarily what the test writer would have wanted.

There is no one right answer there: for those who want to only use tests, then the tests need to be complete and cover the implementation details. Of course, this means that if a future scaling problem requires a change in data structures, then the test will also need to be ported to the new architecture. If specifications or design documents are used, that can speed the implementer's work, but leaves open some questions of correctness (e.g., did the implementer use the right architecture in the right way).

We solved these problems (and more) in true Agile fashion: through good communication among the customer,the test designer, and the developer, but this experience reinforced to us that tests alone are insufficient, and good communication needs to be maintained in the development process.

Real specs can help, too.

Rejecting SSLv2 politely or brusquely

Once upon a time there were still people using browsers that only supported SSLv2. It's been a long time since those browsers were current, but when running an ecommerce site you typically want to support as many users as you possibly can, so you support old stuff much longer than most people still need it.

At least 4 years ago, people began to discuss disabling SSLv2 entirely due to fundamental security flaws. See the Debian and GnuTLS discussions, and this blog post about PCI's stance on SSLv2, for example.

To politely alert people using those older browsers, yet still refusing to transport confidential information over the insecure SSLv2 and with ciphers weaker than 128 bits, we used an Apache configuration such as this:

# Require SSLv3 or TLSv1 with at least 128-bit cipher
<Directory "/">
    SSLRequireSSL
    # Make an exception for the error document itself
    SSLRequire (%{SSL_PROTOCOL} != "SSLv2" and %{SSL_CIPHER_USEKEYSIZE} >= 128) or %{REQUEST_URI} =~ m:^/errors/:
    ErrorDocument 403 /errors/403-weak-ssl.html
</Directory>

That accepts their SSLv2 connection, but displays an error page explaining the problem and suggesting some links to free modern browsers they can upgrade to in order to use the secure part of the website in question.

Recently we've decided to drop that extra fuss and block SSLv2 entirely with Apache configuration such as this:

SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:-LOW:-SSLv2:-EXP

The downside of that is that the SSL connection won't be allowed at all, and the browser doesn't give any indication of why or what the user should do. They would simply stare at a blank screen and presumably go away frustrated. Because of that we long considered the more polite handling shown above to be superior.

But recently, after having completely disabled SSLv2 on several sites we manage, we have gotten zero complaints from customers. Doing this also makes PCI and other security audits much simpler because SSLv2 and weak ciphers are simply not allowed at all and don't raise audit warnings.

So at long last I think we can consider SSLv2 dead, at least in our corner of the Internet!

JavaScript fun with IE 8

I ran into, and found solutions for, two major gotchas targeting IE 8 with a jQuery-based (and rather JavaScript-heavy) web application.

First is to specify the 'IE 8 Standard' rendering mode by adding the following meta tag: <meta equiv="X-UA-Compatible" content="IE=8">

The default rendering mode is rather glitchy and tends to produce all sorts of garbage from 'clean' HTML and JavaScript. The result renders slightly different sizes, reports incorrect values from common jQuery calls, etc.

The default rendering also caused various layout issues (CSS handling looked more like IE 6 than IE 7). Also, minor errors (an extra '' tag on one panel) caused the entire panel to not render.

Another issue is the browser is overly lazy about invalidating the cache for AJAX pulled content, especially (X)HTML. This means that though you think you're pulling current data, in reality it keeps feeding you the same old data. This also means that if you use the same exact URL for HTML & JSON data, you must add a parameter to avoid running into cache collisions. IE 8 only seemed to honor 'Cache-control: no-cache' in the header to cause it to behave properly.

On the other side, I've got a big thumbs up for jQuery. I was able to produce a skinned fairly 'heavy' client-side application that works equally well (and looks almost the same) on Firefox, Chrome, Safar, and now IE 8.