Welcome to End Point’s blog

Ongoing observations by End Point people

WebP images experiment on End Point website

WebP is an image format for RGB images on the web that supports both lossless (like PNG) and lossy (like JPEG) compression. It was released by Google in September 2010 with open source reference software available under the BSD license, accompanied by a royalty-free public patent license, making it clear that they want it to be widely adopted by any and all without any encumbrances.

Its main attraction is smaller file size at similar quality level. It also supports an alpha channel (transparency) and animation for both lossless and lossy images. Thus it is the first image format that offers the transparency of PNG in lossy images at much smaller file size, and animation only available in the archaic limited-color GIF format.

Comparing quality & size

While considering WebP for an experiment on our own website, we were very impressed by its file size to quality ratio. In our tests it was even better than generally claimed. Here are a few side-by-side examples from our site. You'll only see the WebP version if your browser supports it:

12,956 bytes JPEG

2186 bytes WebP

11,149 bytes JPEG

2530 bytes WebP

The original PNG images were converted by ImageMagick to JPEG, and by `cwebp -q 80` to WebP. I think we probably should increase the WebP quality a bit to keep a little of the facial detail that flattens out, but it's amazing how good these images look for file sizes that are only 17% and 23% of the JPEG equivalent.

One of our website's background patterns has transparency, making the PNG format a necessity, but it also has a gradient, which PNG compression is particularly inefficient with. WebP is a major improvement there, at 13% the size of the PNG. The image is large so I won't show it here, but you can follow the links if you'd like to see it:

337,186 bytescontainer-pattern.png
43,270 bytescontainer-pattern.webp

Browser support

So, what is the downside? WebP is currently natively supported only in Chrome and Opera among the major browsers, though amazingly, support for other browsers can be added via WebPJS, a JavaScript WebP renderer.

Why don't the other browsers add support given the liberal license? Especially Firefox you'd expect to support it. In fact a patch has been pending for years, and a debate about adding support still smolders. Why?

WebP does not yet support progressive rendering, Exif tagging, non-RGB color spaces such as CMYK, and is limited to 16,384 pixels per side. Some Firefox developers feel that it would do the Internet community a disservice to support an image format still under development and cause uncertain levels of support in various clients, so they will not accept WebP in its current state.

Many batch image-processing tools now support WebP, and there is a free Photoshop plug-in for it. Some websites are quietly using it just because of the cost savings due to reduced bandwidth.

For our first experiment serving WebP images from the End Point website, I decided to serve WebP images only to browsers that claim to be able to support it. They advertise that support in this HTTP request header:

Accept: image/webp,*/*;q=0.8

That says explicitly that the browser can render image/webp, so we just need to configure the server to send WebP images. One way to do that is in the application server, by having it send URLs pointing to WebP files.

Let's plan to have both common format (JPEG or PNG) and WebP files side by side, and then try a way that is transparent to the application and can be enabled or disabled very easily.

Web server rewrites

It's possible to set up the web server to transparently serve WebP instead of JPEG or PNG if a matching file exists. Based on some examples other people posted, we used this nginx configuration:

    set $webp "";
    set $img "";
    if ($http_accept ~* "image/webp") { set $webp "can"; }
    if ($request_filename ~* "(.*)\.(jpe?g|png)$") { set $img $1.webp; }
    if (-f $img) { set $webp "$webp-have"; }
    if ($webp = "can-have") {
        add_header Vary Accept;
        rewrite "(.*)\.\w+$" $1.webp break;

It's also good to add to /etc/nginx/mime.types:

image/webp .webp

so that .webp files are served with the correct MIME type instead of the default application/octet-stream, or worse, text/plain with perhaps a bogus character set encoding.

Then we just make sure identically-named .webp files match .png or .jpg files, such as those for our examples above:

-rw-rw-r-- 337186 Nov  6 14:10 container-pattern.png
-rw-rw-r--  43270 Jan 28 08:14 container-pattern.webp
-rw-rw-r--  14734 Nov  6 14:10 josh_williams.jpg
-rw-rw-r--   3386 Jan 28 08:14 josh_williams.webp
-rw-rw-r--  13420 Nov  6 14:10 marina_lohova.jpg
-rw-rw-r--   2776 Jan 28 08:14 marina_lohova.webp

A request for a given $file.png will work as normal in browsers that don't advertise WebP support, while those that do will instead receive the $file.webp image.

The image is still being requested with a name ending in .jpg or .png, but that's just a name as far as both browser and server are concerned, and the image type is determined by the MIME type in the HTTP response headers (and/or by looking at the file's magic numbers). So the browser will have a file called $something.jpg in the DOM and in its cache, but it will actually be a WebP file. That's ok, but could be confusing to users who save the file for whatever reason and find it isn't actually the JPEG they were expecting.

301/302 redirect option

One remedy for that is to serve the WebP file via a 301 or 302 redirect instead of transparently in the response, so that the browser knows it's dealing with a different file named $something.webp. To do that we changed the nginx configuration like this:

    rewrite "(.*)\.\w+$" $1.webp permanent;

That adds a little bit of overhead, around 100-200 bytes unless large cookies are sent in the request headers, and another network round-trip or two, though it's still a win with the reduced file sizes we saw. However, I found that it isn't even necessary right now due to an interesting behavior in Chrome that may even be intentional to cope with this very situation. (Or it may be a happy accident.)

Chrome image download behavior

Versions of Chrome I tested only send the Accept: image/webp [etc.] request header when fetching images from an HTML page, not when you manually request a single file or asking the browser to save the image from the page by right-clicking or similar. In those cases the Accept header is not sent, so the server doesn't know the browser supports WebP, so you get the JPEG or PNG you asked for. That was actually a little confusing to hunt down by sniffing the HTTP traffic on the wire, but it may be a nice thing for users as long as WebP is still less-known.

Batch conversion

It's fun to experiment, but we needed to actually get all the images converted for our website. Surprisingly, even converting from JPEG isn't too bad, though you need a higher quality setting and the file size will be larger. Still, for best image quality at the smallest file size, we wanted to start with original PNG images, not recompress JPEGs.

To make that easy, we wrote two shell scripts for Linux, bash, and cwebp. We found a few exceptional images that were larger in WebP than in PNG or JPEG, so the script deletes any WebP file that is not smaller, and our nginx configuration will in that case not find a .webp file and will serve the original PNG or JPEG.

Full-page download sizes compared

Here are performance tests run by using Chrome 32 on Windows 7 on a simulated cable Internet connection. The total download size difference is most impressive, and on a slower mobile network or with higher latency (greater distance from the server) would affect the download time more.

Page URL With WebP Without WebP
Bytes Time Details Bytes Time Details 374 KB 2.9s report 850 KB 3.4s report 613 KB 3.6s report 1308 KB 4.1s report


This article is not even close to a comprehensive shootout between WebP and other image types. There are other sites that consider the image format technical details more closely and have well-chosen sample images.

My purpose here was to convert a real website in bulk to WebP without hand-tuning individual images or spending too much time on the project overall, and to see if the overall infrastructure is easy enough to set up, and the download size and speed improved enough to make it worth the trouble, and get real-world experience with it to see if we can recommend it for our clients, and in which situations.

So far it seems worth it, and we plan to continue using WebP on our website. With empty browser caches, visit using Chrome and then one of the browsers that doesn't support WebP, and see if you notice a speed difference on first load, or any visual difference.

I hope to see WebP further developed and more widely supported.

Further reading

Mobile Emulation in Chrome DevTools

I have been doing some mobile development lately and wanted to share the new Mobile Emulation feature in Chrome Canary with y'all. Chrome Canary is a development build of Chrome which gets updated daily and gives you the chance to use the latest and greatest features in Chrome. I've been using it as my primary browser for the past year or so and it's been fairly stable. What's great is that you can run Chrome Canary side-by-side with the stable release version of Chrome. For the odd time I do have issues with stability etc., I can just use the latest stable Chrome and be on my way. If you need more convincing, Paul Irish's Chrome Canary for Developers post might be helpful.

I should mention that Chrome Canary is only available for OS X and Windows at this point. I tested Dev channel Chromium on Ubuntu 13.10 this afternoon and the new mobile emulation stuff is not ready there yet. It should not be long though.

Mobile Emulation in Chrome Dev Tools

Once enabled, the Emulation panel shows up in the Dev Tools console drawer. It gives you the option of emulating a variety devices (many are listed in the drop-down) and you also have the ability to fine tuning the settings à la carte. If you choose to emulate the touchscreen interface the mouse cursor will change and operate like a touch interface. Shift+drag allows you to pinch and zoom. There are some cool features for debugging and inspecting touch events as well.

Learning More

If you would like to learn more, be sure to check out the Mobile emulation documentation at the Chrome DevTools docs site.

Functional Handler - A Pattern in Ruby

First, a disclaimer. Naming things in the world of programming is always a challenge. Naming this blog post was also difficult. There are all sorts of implications that come up when you claim something is "functional" or that something is a "pattern". I don't claim to be an expert on either of these topics, but what I want to describe is a pattern that I've seen develop in my code lately and it involves functions, or anonymous functions to be more precise. So please forgive me if I don't hold to all the constraints of both of these loaded terms.

A pattern

The pattern that I've seen lately is that I need to accomplish of myriad of steps, all in sequence, and I need to only proceed to the next step if my current step succeeds. This is common in the world of Rails controllers. For example:

def update
  @order = Order.find params[:id]

  if @order.update_attributes(params[:order])
    @order.send_invoice!  if @order.complete?
    flash[:notice] = "Order saved"
    redirect_to :index
    render :edit


What I'm really trying to accomplish here is that I want to perform the following steps:

  • Find my order
  • Update the attributes of my order
  • Calculate the tax
  • Calculate the shipping
  • Send the invoice, but only if the order is complete
  • Redirect back to the index page.

There are a number of ways to accomplish this set of steps. There's the option above but now my controller is doing way more than it should and testing this is going to get ugly. In the past, I may have created a callback in my order model. Something like after_save :calculate_tax_and_shipping and after_save :send_invoice if: :complete?. The trouble with this approach is that now anytime my order is updated these steps also occur. There may be many instances where I want to update my order and what I'm updating has nothing to do with calculating totals. This is particularly problematic when these calculations take a lot of processing and have a lot of dependencies on other models.

Another approach may be to move some of my steps into the controller before and after filters (now before_action and after_action in Rails 4). This approach is even worse because I've spread my order specific steps to a layer of my application that should only be responsible for routing user interaction to the business logic of my application. This makes maintaining this application more difficult and debugging a nightmare.

The approach I prefer is to hand off the processing of the order to a class that has the responsibility of processing the user’s interaction with the model, in this case, the order. Let's take a look at how my controller action may look with this approach.

def update
  handler =

  if hander.order_saved?
    redirect_to :index
    @order = handler.order
    render :edit

OK, now that I have my controller setup so that it’s only handling routing, as it should, how do I implement this OrderControllerHandler class? Let’s walk through this:

class OrderControllerHandler

  attr_reader :order

  def initialize(params)
    @params = params
    @order = nil # a null object would be better!
    @order_saved = false

  def execute!

  def order_saved?


We now have the skeleton of our class setup and all we need to do is proceed with the implementation. Here’s where we can bust out our TDD chops and get to work. In the interest of brevity, I’ll leave out the tests, but I want to make the point that this approach makes testing so much easier. We now have a specific object to test without messing with all the intricacies of the controller. We can test the controller to route correctly on the order_saved? condition which can be safely mocked. We can also test the processing of our order in a more safe and isolated context. Ok, enough about testing, let’s proceed with the implementation. First, the execute method:

def execute!

Looks good right? Now we just need to create a method for each of these statements. Note, I’m not adding responsibility to my handler. For example, I’m not actually calculating the tax here. I’m just going to tell the order to calculate the tax, or even better, tell a TaxCalculator to calculate the tax for my order. The purpose of the handler class is to orchestrate the running of these different steps, not to actually perform the work. So, in the private section of my class, I may have some methods that look like this:

def lookup_order
  @order = Order.find(@params[:id])

def update_order
  @saved_order = @order.update_attributes(@params[:order])

def calculate_tax

... etc, you get the idea

Getting function(al)

So far, so good. But we have a problem here. What do we do if the lookup up of the order fails? I wouldn’t want to proceed to update the order in that case. Here’s where a little bit of functional programming can help us out (previous disclaimers apply). Let’s take another shot at our execute! method again and this time, we’ll wrap each step in an anonymous function aka, stabby lambda:

def execute!
  steps = [
    ->{ lookup_order },
    ->{ update_order },
    ->{ calculate_tax },
    ->{ calculate_shipping },
    ->{ send_invoice! },

  steps.each { |step| break unless }

What does this little refactor do for us? Well, it makes each step conditional on the return status of the previous step. Now we will only proceed through the steps when they complete successfully. But now each of our steps needs to return either true or false. To pretty this up and add some more meaning, we can do something like this:

def stop; false; end
def proceed; true; end

def lookup_order
  @order = Order.find(@params[:id])
  @order ? proceed : stop

Now each of my step methods has a nice clean way to show that I should either proceed or stop execution that reads well and is clear on its intent.

We can continue to improve this by catching some errors along the way so that we can report back what went wrong if there was a problem.

attr_reader :order, :errors

def initialize(params)
  @params = params
  @order = nil # a null object would be better!
  @order_saved = false
  @errors = []



def proceed; true; end
def stop(message="")
  @errors << message if message.present?

def invalid(message)
  @errors << message

def lookup_order
  @order = Order.find(@params[:id])
  @order ? proceed : stop("Order could not be found.")


I’ve added these helpers to provide us with three different options for capturing errors and controlling the flow of our steps. We use the proceed method to just continue processing, invalid to record an error but continue processing anyway, and stop to optionally take a message and halt the processing of our step.

In summary, we’ve taken a controller with a lot of mixed responsibilities and conditional statements that determine the flow of the application and implemented a functional handler. This handler orchestrates the running of several steps and provides a way to control how those steps are run and even captures some error output if need be. This results in much cleaner code that is more testable and maintainable over time.

Homework Assignment

  • How could this pattern be pulled out into a module that could be easily included every time I wanted to use it?
  • How could I decouple the OrderControllerHandler class from the controller and make it a more general class that can be easily reused throughout my application anytime I needed to perform this set of steps?
  • How could this pattern be implemented as a functional pipeline that acts on a payload? How is this similar to Rack middleware?


def steps
    ->(payload){ step1(payload) },
    ->(payload){ step2(payload) },
    ->(payload){ step3(payload) },

def execute_pipeline!(payload)
  last_result = payload
  steps.each do |step|
    last_result =


Unbalanced HTML considered harmful for jQuery

This isn't earth-shattering news, but it's one of those things that someone new to jQuery might trip over so I thought I'd share.

I had a bad experience recently adding jQuery to an existing page that had less than stellar HTML construction, and I didn't have time nor budget to clean up the HTML before starting work. Thus, I was working with something much more complex than, but equally broken as what follows:

<input type="text">

The jQuery I added did something like this:

$('form input').css('background-color: red');
and of course, I was quite puzzled when it didn't work. The pitfall here is that jQuery may or may not be able to handle misconstructed or unbalanced HTML, at least not as well as your average browser, which will shift things around internally until something makes sense to it. The minimal solution is to move the opening and closing "form" tags outside the table.

Using Google Maps and jQuery for Location Search

Example of Google maps showing Paper Source locations.

A few months ago, I built out functionality to display physical store locations within a search radius for Paper Source on an interactive map. There are a few map tools out there to help accomplish this goal, but I chose Google Maps because of my familiarity and past success using it. Here I'll go through some of the steps to implement this functionality.

Google Maps API Key

Before you start this work, you'll want to get a Google Maps API key. Learn more here.

Geocoder Object

At the core of our functionality is the use of the google.maps.Geocoder object. The Geocoder converts a search point or search string to into geographic coordinates. The most basic use of the geocoder might look like this:

var geocoder = new google.maps.Geocoder();
//search is a string, input by user
geocoder.geocode({ 'address' : search }, function(results, status) {
  if(status == "ZERO_RESULTS") {
    //Indicate to user no location has been found
  } else {
    //Do something with resulting location(s)

Rendering a Map from the Results

After a geocoder results set is acquired, a map and locations might be displayed. A simple and standard implementation of Google Maps can be executed, with the map center set to the geocoder results set center:

var mapOptions = {
  center: results[0].geometry.bounds.getCenter(),
  zoom: 10,
  mapTypeId: google.maps.MapTypeId.ROADMAP
var map = new google.maps.Map(document.getElementById("map"), mapOptions);

Searching within a Radius

Next up, you may want to figure out how to display a set of locations inside the map bounds. At the time I implemented the code, I found no functionality that automagically did this, so I based my solution off of a few references I found online. The following code excerpt steps through the process:

//search center is the center of the geocoded location
var search_center = results[0].geometry.bounds.getCenter();

//Earth's radius, used in distance calculation
var R = 6371;

//Step through each location
$.each(all_locations, function(i, loc) {
  //Calculate distance from map center
  var loc_position = new google.maps.LatLng(loc.latitude, loc.longitude);
  var dLat  = locations.rad(loc.latitude -;
  var dLong = locations.rad(loc.longitude - search_center.lng());
  var a = Math.sin(dLat/2) * Math.sin(dLat/2) +
    Math.cos(locations.rad( * 
    Math.cos(locations.rad( * 
    Math.sin(dLong/2) * Math.sin(dLong/2);
  var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
  var d = R * c;
  loc.distance = d;

  //Add the marker to the map
  var marker = new google.maps.Marker({
    map: map,
    position: loc_position,
    title: loc.store_title

  //Convert distance to miles (readable distance) for display purposes
  loc.readable_distance = (google.maps.geometry.spherical.
    computeDistanceBetween(search_center, loc_position)*

The important thing about this code is that it renders markers for all the locations, but a subset of them will be visible.

Figuring out which Locations are Visible

If you want to display additional information in the HTML related to current visible locations (such as in the screenshot at the top of this post), you might consider using the map.getBounds.contains() method:

var render_locations = function(map) {
  var included_locations = [];
  //Loop through all locations to determine which locations are contained in the map boundary
  $.each(all_locations, function(i, loc) {
    if(map.getBounds().contains(new google.maps.LatLng(loc.latitude, loc.longitude))) {

  // sort locations by distance if desired
  // render included_locations

The above code determines which locations are visible, sorts those locations by readable distance, and then those locations are rendered in the HTML.

Adding Listeners

After you've got your map and location markers added, a few map listeners will add more functionality, described below:

var listener = google.maps.event.addListener(map, "idle", function() {
  google.maps.event.addListener(map, 'center_changed', function() {
  google.maps.event.addListener(map, 'zoom_changed', function() {

After the map has loaded (via the map "idle" event), render_locations is called to render the HTML for visible locations. This method is also triggered any time the map center or zoom level is changed, so the HTML to the left of the map is updated whenever a user modifies the map bounds.

Advanced Elements

Two advanced pieces implemented were the use of extending the map bounds and modifying listeners in a mobile environment. When it was desired that the map explicitly contain a set of locations within the map bounds, the following code was used:

var current_bounds = results[0].geometry.bounds;
$.each(locations_to_include, function(i, loc) {

And in a mobile environment, it was desired to disable various map options such as draggability, zoomability, and scroll wheel use. This was done with the following conditional:

if($(window).width() < 656) {
    draggable: false,
    zoomControl: false,
    scrollwheel: false, 
    scrollwheel: false,
    disableDoubleClickZoom: true,
    streetViewControl: false


Of course, all the code shown above is just in snippet form. Many of the building blocks described above were combined to build a user-friendly map feature. There are a lot of additional map features – check out the documentation to learn more.

End Point’s New Tennessee Office — We’re Hiring!

End Point has opened a new office for our Liquid Galaxy business in Bluff City, TN. Bluff City is in the Tri-Cities region on the eastern border of Tennessee. Our new office is 3500+ square feet and has ample office and warehouse space for our growing Liquid Galaxy business.


From our start back in 1995, End Point has always been a “distributed company” with a modest headquarters in Manhattan. (The headquarters was especially modest in our early days.) The majority of End Point’s employees work from their home offices. Our work focusing on development with Open Source software and providing remote systems support requires relatively little space, so our distributed office arrangement has been a cost-effective and wonderful way to work. However, over the last four years our work with Liquid Galaxy systems has presented us with some old-fashioned physical-world challenges. Our space requirements have steadily increased as we’ve tested more and more components and built several generations of Liquid Galaxies, and as we’ve prepped and packed increasing numbers of systems going out for permanent installations and events. We’ve well outstripped the space of our Manhattan headquarters, my garage, and a storage unit; hence, our new office in Bluff City.


We moved into our Tennessee facility in November and have been whipping it into shape since then. Our Liquid Galaxy team remains distributed, with a good number of staffers in our NYC office, plus elsewhere in the US and internationally, but we now have a core of talented personnel in our new office too: Matt Vollrath, Will Plaut, and Neil Elliott. And, we will be adding to our staff in the new office as well! At last we have lots of space! If you are interested in joining us in our Bluff City office developing and supporting Liquid Galaxy systems and have excellent Linux or Geospatial Information Systems skills then let us know through our contact form.


News of FreeOTP, RHEL/CentOS, Ruby, Docker, HTTP

I've had interesting tech news items piling up lately and it's time to mention some of those that relate to our work at End Point. In no particular order:

  • FreeOTP is a relatively new open source 2-factor auth app for Android by Red Hat. It can be used instead of Google Authenticator which last year became proprietary as noted in this Reddit thread. The Google Authenticator open source project now states: "This open source project allows you to download the code that powered version 2.21 of the application. Subsequent versions contain Google-specific workflows that are not part of the project." Whatever the reason for that change was, it seems unnecessary to go along with it, and to sweeten the deal, FreeOTP is quite a bit smaller. It's been working well for me over the past month or more.
  • Perl 5.18.2 was released.
  • Ruby 2.1.0 was released.
  • Ruby 1.9.3 end of life set for February 2015, and the formerly end-of-life Ruby 1.8.7 & 1.9.2 have had their maintenance extended for security updates until June 2014 (thanks, Heroku!).
  • Red Hat and CentOS have joined forces, in an unexpected but exciting move. Several CentOS board members will be working for Red Hat, but not in the Red Hat Enterprise Linux part of the business. Red Hat's press release and the CentOS announcement give more details.
  • Red Hat Enterprise Linux 7 is available in beta release. The closer working with CentOS makes me even more eagerly anticipate the RHEL 7 final release.
  • Docker has been getting a lot of well-deserved attention. Back in September, Red Hat announced it would help modify Docker to work on RHEL. To date only very recent Ubuntu versions have been supported by Docker because it relies on AUFS, a kernel patch never expected to be accepted in the mainline Linux kernel, and deprecated by the Debian project. Now Linode announced their latest kernels support Docker, making it easier to experiment with various Linux distros.
  • New maintenance releases of PostgreSQL, PHP, Python 2.7, and Python 3.3 are also out. Not to take all the small steps for granted!
  • Finally, for anyone involved in web development or system/network administration I can recommend a nice reference project called Know Your HTTP * Well. I've looked most closely at the headers section. It helpfully groups headers, has summaries, and links to relevant RFC sections.

And we're already two weeks into January!

Copy Data Between MySQL Databases with Sequel Pro

Sequel Pro

Sequel pro

I often use Sequel Pro when I'm getting up to speed on the data model for a project or when I just want to debug in a more visual way than with the mysql command-line client. It's a free OS X application that lets you inspect and manage MySQL databases. I also find it very useful for making small changes to the data while I develop and test web apps.

Quickly Copy Data Between Databases

I recently needed a way to copy a few dozen records from one camp to another. I tried using the "SELECT...INTO OUTFILE" method but ran into a permissions issue with that approach. Using mysqldump was another option but that seemed like overkill in this case — I only needed to copy a few records from a single table. At this point I found a really neat and helpful feature in Sequel Pro: Copy as SQL INSERT

Copy as sql insert

I simply selected the records I wanted to copy and used the "Copy as SQL INSERT" feature. The SQL insert statement I needed was now copied to the system clipboard and easily copied over to the other camp and imported via the mysql command-line client.


The Sequel Pro website describes Bundles which extend the functionality in various ways — including copying data as JSON. Very handy stuff. Many thanks to the developers of this fine software. If you're on OS X, be sure to give it a try.

IPTables: All quotes are not created equal

We have been working on adding comments to our iptables rules to makes it a lot easier to know what each rule is for when reviewing the output of /sbin/iptables -L. If you aren't familiar with the comment ability in iptables it is pretty straight forward to use. You just add this to an existing or new rule:

-m comment --comment "testing 1 2 3"

I had the displeasure of learning this weekend, while updating a system, that though it is a pretty easy addition to make the quotes you use do make a difference. As you can see in the example above, it uses double quotes. The culprit of my displeasure was the dreaded single quote.

When the server rebooted I noticed that iptables didn't start as expected so I tried to start it using service iptables start and was greeted with this error:

iptables: Applying firewall rules: Bad argument `1'
Error occurred at line: 30

I loaded up the /etc/sysconfig/iptables file in vim and started to try to figure out what had changed on line 30. I reviewed the rule and it looked pretty straight forward.

-A INPUT  -s -p tcp -m multiport --dports 22,80 -j ACCEPT -m comment --comment 'testing 1 2 3'

Why was it freaking out over the 1 in the comment? I knew we had done comments before with spaces in them before without having any issues. So what was the deal? Well knowing in other instances that the type of quote mattered after a few minutes of scratching my head I decided to try changing them from single quotes to double quotes on a hunch. After adjusting the quotes, I ran service iptables start and much happiness was had as it started. I moved on with the other systems I needed to get done and called it a night.

Today I decided to circle back and get a better idea on why the single quotes were causing the problem so I started testing different setups. My first test was to remove the quotes completely and see if that worked. It failed as expected which was good to see. My second test was to switch back to single quotes and remove all spaces. This did work but didn't generate the results I was expecting. What I ended up with in iptables -L was output that looked like this:

ACCEPT     tcp  --           multiport dports 22,80 /* 'testing' */ 

I didn't notice it at first but notice how the single quotes actually made it into the comment itself? This means that iptables didn't actually parse the quotes at all. It took them just as characters in my comment. After realizing this it was clear why my comment 'testing 1 2 3' was causing iptables to throw an error:it was seen as 'testing 1 2 3', spaces and single quotes themselves included, instead of "testing 1 2 3" as a unique string, spaces included and double quotes excluded. Changing single quotes to double quotes did the trick and iptables finally was seeing it as a complete unique string.

Spot On Cost Effective Performance Testing

AWS is, in my humble opinion, pricy. However, they provide a nice alternative to the on-demand EC2 instances most people are familiar with: Spot instances. In essence, spot instances allow you to bid on otherwise compute idle time. Recent changes to the web console seem to highlight spot instances a bit more than they used to, but I still don't see them mentioned often.

The advantage is you get the same selection of instance sizes, and they perform the same as a normal on-demand instance for (usually) a fraction of the price. The downside is that they may disappear at a moment's notice if there's not enough spare capacity when someone else spins up a normal on-demand instance, or simply outbids you. It certainly happened to us on occasion, but not as frequently as I originally expected. They also take a couple minutes to evaluate the bid price when you put in a request, which can be a bit of a surprise if you're used to the almost-instantaneous on-demand instance provision time.

We made extensive use of spot instances in some of the software cluster testing we recently did. For our purposes those caveats were no big deal. If we got outbid, the test could always be restarted in a different Availability Zone with a little more capacity, or we just waited until the demand went down.

At the height of our testing we were spinning up 300 m1.xlarge instances at once. Even when getting the best price for those spot instances, the cost of running a cluster that large adds up quickly! Automation was very important. Our test scripts took hold of the entire process, from spinning up the needed instances, kicking off the test procedure and keeping an eye on it, retrieving the results (and all the server metrics, too) once done, then destroying the instances at the end.

Here's how we did it:

In terms of high level key components, first, that test driver script was home-grown and fairly specific to the task. Something like Chef could have been taught to spin up spot instances, but those types of configuration management tools are better at keeping systems up and running. We needed the ability to run a task, and immediately shut down the instances when done. That script was written in Python, and leans on the boto library to control the instances.

Second, a persistent 'head node' was kept up and running as a normal instance. This ran a Postgres instance and provided a centralized place for the worker nodes to report back to. Why Postgres? I needed a simple way to number and count the nodes in a way immune to race conditions, and sequences were what came to mind. It also gave us a place to collect the test results and system metrics, and compress down before transferring out from AWS.

Third, customized AMI's were used. Why script the installation of ssh keys, Java, YCSB or Cassandra or whatever, system configuration like our hyper 10-second interval sysstat, application parameters, etc, onto each of those 300 stock instances? Do it once on a tiny micro instance, get it how you want it, and snapshot the thing into an AMI. Everything's ready to go from the start.

There, those are the puzzle pieces. Now how does it all fit together?

When kicking off a test we give the script a name, a test type, a data node count, and maybe a couple other basic parameters if needed. The script performs the calculations for dataset size, number of client nodes needed to drive those data nodes, etc. Once it has all that figured out the script creates two tables in Postgres, one for data nodes and one for client nodes, and then fires off two batch requests for spot instances. We give them a launch group name to make sure they're all started in the same AZ, our customized AMI, a bit of userdata, and a bid price:

max_price = '0.10'
instance_type = 'm1.xlarge'
    max_price, ami, instance_type=instance_type, count=count,
    launch_group=test_name, availability_zone_group=test_name,

Okay, at this point I'll admit the AMI's weren't quite that simple, as there's still some configuration that needs to happen on instance start-up. Luckily AWS gives us a handy way to do that directly from the API. When making its request for a bunch of spot instances, our script sets a block of userdata in the call. When userdata is formulated as text that appears to be a script -- starting with a shebang, like #!/bin/bash -- that script is executed on first boot. (If you have cloud-init in your AMI's, to be specific, but that's a separate conversation.) We leaned on that to relay test name and identifier, test parameters, and anything else our driver script needed to communicate to the instances at start. That thus became the glue that tied the instances back to the script execution. It also let us run multiple tests in parallel.

You may have also noticed the call explicitly specifies the block device map. This overrides any default mapping that may (or may not) be built into the selected AMI. We typically spun up micro instances when making changes to the images, and as those don't have any instance storage available we couldn't preconfigure that in the AMI. Setting it manually looks something like:

from boto.ec2.blockdevicemapping import BlockDeviceMapping, BlockDeviceType
bdmap = BlockDeviceMapping()
sdb = BlockDeviceType()
sdb.ephemeral_name = 'ephemeral0'
bdmap['/dev/sdb'] = sdb
sdc = BlockDeviceType()
sdc.ephemeral_name = 'ephemeral1'
bdmap['/dev/sdc'] = sdc
sdd = BlockDeviceType()
sdd.ephemeral_name = 'ephemeral2'
bdmap['/dev/sdd'] = sdd
sde = BlockDeviceType()
sde.ephemeral_name = 'ephemeral3'
bdmap['/dev/sde'] = sde

Then, we wait. The process AWS goes through to evalutate, provision, and boot takes a number of minutes. The script actually goes through a couple of stages at this point. Initially we only watched the tables in the Postgres database, and once the right number of instances reported in, the test was allowed to continue. But we soon learned that not all EC2 instances start as they should. Now the script gets the expected instance ID's, and tells us which ones haven't reported in. If a few minutes pass, and one or two still aren't reporting in (more on that in a bit) we know exactly which instances are having problems, and can fire up replacements.

An example output from the test script log, if i-a2c4bfd1 doesn't show up soon and we can't connect to it ourselves, we can be confident it's never going to check in:

2014-01-02 05:01:46 Requesting node allocation from AWS...
2014-01-02 05:02:50 Still waiting on start-up of 300 nodes...
2014-01-02 05:03:51 Still waiting on start-up of 5 nodes...
2014-01-02 05:04:52 Checking that all nodes have reported in...
2014-01-02 05:05:02 I see 294/300 data servers reporting...
2014-01-02 05:05:02 Missing Instances: i-e833499b,i-d63349a5,i-d43349a7,i-c63349b5,i-a2c4bfd1,i-d03349a3
2014-01-02 05:05:12 I see 294/300 data servers reporting...
2014-01-02 05:05:12 Missing Instances: i-e833499b,i-d63349a5,i-d43349a7,i-c63349b5,i-a2c4bfd1,i-d03349a3
2014-01-02 05:05:22 I see 296/300 data servers reporting...
2014-01-02 05:05:22 Missing Instances: i-e833499b,i-c63349b5,i-a2c4bfd1,i-d63349a5
2014-01-02 05:05:32 I see 298/300 data servers reporting...
2014-01-02 05:05:32 Missing Instances: i-a2c4bfd1,i-e833499b
2014-01-02 05:05:42 I see 298/300 data servers reporting...
2014-01-02 05:05:42 Missing Instances: i-a2c4bfd1,i-e833499b
2014-01-02 05:05:52 I see 299/300 data servers reporting...
2014-01-02 05:05:52 Missing Instances: i-a2c4bfd1
2014-01-02 05:06:02 I see 299/300 data servers reporting...
2014-01-02 05:06:02 Missing Instances: i-a2c4bfd1
2014-01-02 05:06:12 I see 299/300 data servers reporting...
2014-01-02 05:06:12 Missing Instances: i-a2c4bfd1
2014-01-02 05:06:22 I see 299/300 data servers reporting...
2014-01-02 05:06:22 Missing Instances: i-a2c4bfd1
2014-01-02 05:06:32 I see 299/300 data servers reporting...
2014-01-02 05:06:32 Missing Instances: i-a2c4bfd1

Meanwhile on the AWS side, as each instance starts up that userdata mini-script writes out its configuration to various files. The instance then kicks off a phone home script, which connects back to the Postgres instance on the head node, adds its own ID, IP address, and hostname, and receives back its node number. (Hurray INSERT ... RETURNING!) It also discovers any local instance storage it has, and configures that automatically. The node is then configured for its application role, which may depend on what it's discovered so far. For example, nodes 2-n the Cassandra cluster will look up the IP address for node 1, and use that for its gossip host, as well as use their node numbers for the ring position calculation. Voila, hands-free cluster creation for Cassandra, MongoDB, or whatever we need.

Back on the script side, once everything's reported in and running as expected, a sanity check is run on the nodes. For example with Cassandra it checks that the ring reports the correct number of data nodes, or similarly for MongoDB that the correct number of shard servers are present. If something's wrong, the human that kicked off the test (who hopefully hasn't run off to dinner expecting that all is well at this point) is given the opportunity to correct the problem. Otherwise, we continue with the tests, and the client nodes are all instructed to begin their work at the same time, beginning with the initial data load phase.

Coordinated parallel execution isn't easy. Spin off threads within the Python script and wait until each returns? Set up asynchronous connections to each, and poll to see when each is done? Nah, pipe the node IP address list using the subprocess module to:

xargs -P (node count) -n 1 -I {} ssh root@{} (command)

It almost feels like cheating. Each step is distributed to all the client nodes at once, and doesn't return until all of the nodes complete.

Between each step, we perform a little sanity check, and push out a sysstat comment. Not strictly necessary, but if we're looking through a server's metrics it makes it easy to see which phase/test we're looking at, rather than try to refer back to timestamps.

run_across_nodes(data_nodes+client_nodes, "/usr/lib/sysstat/sadc -C \\'Workload {0} finished.\\' -".format(workload))

When the tests are all done, it's just a matter of collecting the test results (the output from the load generators) and the metrics. The files are simply scp'd down from all the nodes. The script then issues terminate() commands to AWS for each of the instances it's used, and considers itself done.

Fun AWS facts we learned along the way:

Roughly 1% of the instances we spun up were duds. I didn't record any hard numbers, but we routinely had instances that never made it through the boot process to report in, or weren't at all accessible over the network. Occasionally it seemed like shortly after those were terminated, a subsequent run would be more likely to get a dud instance. Presumably I was just landing back on the same faulty hardware. I eventually learned to leave the dead ones running long enough to kick off the tests I wanted, then terminate them once everything else was running smoothly.

On rare occasion, instances were left running after the script completed. I never got around to figuring out if it was a script bug or if AWS didn't act on a .terminate() command, but I soon learned to keep an eye on the running instance list to make sure everything was shut down when all the test runs were done for the day.