News

Welcome to End Point’s blog

Ongoing observations by End Point people

Converting JSON to PostgreSQL values, simply

In the previous post I showed a simple PostgreSQL table for storing JSON data. Let's talk about making the JSON data easier to use.

One of the requirements was to store the JSON from the files unchanged. However using the JSON operators for deep attributes is a little bit unpleasant. In the example JSON there is attribute country inside metadata. To access this field, we need to write:

SELECT data->'metadata'->>'country' FROM stats_data;

The native SQL version would rather look like:

SELECT country FROM stats;

So let's do something to be able to write the queries like this. We need to repack the data to have the nice SQL types, and hide all the nested JSON operators.

I've made a simple view for this:

CREATE VIEW stats AS
SELECT
  id                                                          as id,
  created_at                                                  as created_at,
  to_timestamp((data->>'start_ts')::double precision)         as start_ts,
  to_timestamp((data->>'end_ts')::double precision)           as end_ts,
  tstzrange(
    to_timestamp((data->>'start_ts')::double precision),
    to_timestamp((data->>'end_ts')::double precision)
  )                                                           as ts_range,
  ( SELECT array_agg(x)::INTEGER[]
    FROM jsonb_array_elements_text(data->'resets') x)         as resets,
  (data->'sessions')                                          as sessions,
  (data->'metadata'->>'country')                              as country,
  (data->'metadata'->>'installation')                         as installation,
  (data->>'status')                                           as status
FROM stats_data;

This is a normal view, which means that it is only a query stored in the database. Each time the view is queried, the data must be taken from the stats_data table.

There is some code I could extract to separate functions. This will be useful in the future, and the view sql should be cleaner.

Here are my new functions:

CREATE OR REPLACE FUNCTION to_array(j jsonb) RETURNS integer[] AS $$
  SELECT array_agg(x)::INTEGER[] FROM jsonb_array_elements_text(j) x;
$$
LANGUAGE sql
IMMUTABLE;

CREATE OR REPLACE FUNCTION to_timestamp(j jsonb) RETURNS timestamptz AS $$
  SELECT to_timestamp(j::text::double precision);
$$
LANGUAGE sql
IMMUTABLE;

CREATE OR REPLACE FUNCTION to_timestamp_range(start_ts jsonb, end_ts jsonb) RETURNS tstzrange AS $$
  SELECT tstzrange(
    to_timestamp(start_ts::text::double precision),
    to_timestamp(end_ts::text::double precision)
  );
$$
LANGUAGE sql
IMMUTABLE;

And now the view can be changed to this:

CREATE VIEW stats AS
SELECT
  id                                                   as id,
  created_at                                           as created_at,
  to_timestamp(data->'start_ts')                       as start_ts,
  to_timestamp(data->'end_ts'  )                       as end_ts,
  to_timestamp_range(data->'start_ts', data->'end_ts') as ts_range,
  to_array(data->'resets')                             as resets,
  (data->'sessions')                                   as sessions,
  (data->'metadata'->>'country')                       as country,
  (data->'metadata'->>'installation')                  as installation,
  (data->>'status')                                    as status
FROM stats_data;

So currently we have normal SQL fields, except for the sessions part, which is there as JSON for a purpose.

The types made by PostgreSQL are:

  
                 View "public.stats"
    Column    │           Type           │ Modifiers
──────────────┼──────────────────────────┼───────────
 id           │ integer                  │
 created_at   │ timestamp with time zone │
 start_ts     │ timestamp with time zone │
 end_ts       │ timestamp with time zone │
 ts_range     │ tstzrange                │
 resets       │ integer[]                │
 sessions     │ jsonb                    │
 country      │ text                     │
 installation │ text                     │
 status       │ text                     │

The data from this view looks like this:

SELECT * FROM stats WHERE id = 1;

-[ RECORD 1 ]+----------------------------------------------------------------
id           | 1
created_at   | 2016-02-09 16:46:15.369802+01
start_ts     | 2015-08-03 21:10:33+02
end_ts       | 2015-08-03 21:40:33+02
ts_range     | ["2015-08-03 21:10:33+02","2015-08-03 21:40:33+02")
resets       | 
sessions     | [{"end_ts": 1438629089, "start_ts": 1438629058, "application": "first"},
                {"end_ts": 1438629143, "start_ts": 1438629123, "application": "second"},
                {"end_ts": 1438629476, "start_ts": 1438629236, "application": "third"}]
country      | USA
installation | FIRST
status       | on

The last part left is to extract information about the sessions. To make the reports simpler, I've extracted the sessions list into another view. However, because the operation of extracting the data is more expensive, I made it as a MATERIALIZED VIEW. This means that this view not only stores the query, but also keeps all the view data. This also means that this view is not updated automatically when the stats_data changes. I refresh the view data automatically in a script which loads the JSON files.

The sessions view looks like this:

CREATE MATERIALIZED VIEW sessions AS
SELECT
  id                                                                            as id,
  country                                                                       as country,
  installation                                                                  as installation,
  s->>'application'                                                             as appname,
  to_timestamp_range(s->'start_ts', s->'end_ts')                                as ts_range,
  COALESCE(bool(s->>'occupancy_triggered'), false)                              as occupancy_triggered,
  to_timestamp(s->'end_ts') - to_timestamp(s->'start_ts')                       as session_length
FROM stats, jsonb_array_elements(sessions) s
;

CREATE INDEX i_sessions_country  ON sessions (country);
CREATE INDEX i_sessions_retailer ON sessions (installation);
CREATE INDEX i_sessions_ts_range ON sessions USING GIST (ts_range);

I've also created indexes on the materialized view, as my report queries will contain the where clause like:

WHERE country='' and installation='' and ts_range && tstzrange(fromdate, todate)

An example data extracted from the JSON looks like this:

select * from sessions;

id | country | installation | appname |                      ts_range                       | occupancy_triggered | session_length
----+---------+--------------+---------+-----------------------------------------------------+---------------------+----------------
 1 | USA     | FIRST        | first   | ["2015-08-03 21:10:58+02","2015-08-03 21:11:29+02") | f                   | 00:00:31
 1 | USA     | FIRST        | second  | ["2015-08-03 21:12:03+02","2015-08-03 21:12:23+02") | f                   | 00:00:20
 1 | USA     | FIRST        | third   | ["2015-08-03 21:13:56+02","2015-08-03 21:17:56+02") | f                   | 00:04:00
(3 rows)

In the next part I will show how to load the same JSON files multiple times, without any errors in a very simple Python script.

Storing Statistics JSON Data in PostgreSQL

We have plenty of Liquid Galaxy systems, where we write statistical information in json files. This is quite a nice solution. However we end with a bunch of files on a bunch of machines.

Inside we have a structure like:

{
    "end_ts": 1438630833,
    "resets": [],
    "metadata": {
        "country": "USA",
        "installation": "FIRST"
    },
    "sessions": [
        {
            "application": "first",
            "end_ts": 1438629089,
            "start_ts": 1438629058
        },
        {
            "application": "second",
            "end_ts": 1438629143,
            "start_ts": 1438629123
        },
        {
            "application": "third",
            "end_ts": 1438629476,
            "start_ts": 1438629236
        }
    ],
    "start_ts": 1438629033,
    "status": "on"
}

And the files are named like "{start_ts}.json". The number of files is different on each system. For January we had from 11k to 17k files.

The fields in the json mean:

  • start_ts/end_ts - timestamps for start/end for the file
  • resets - is an array of timestamps when system was resetted
  • sessions - a list of sessions, each contains application name and start/end timestamps

We keep these files in order to get statistics from them. So we can do one of two things: keep the files on disk, and write a script for making reports. Or load the files into a database, and make the reports from the database.

The first solution looks quite simple. However for a year of files, and a hundred of systems, there will be about 18M files.

The second solution has one huge advantage: it should be faster. A database should be able to have some indexes, where the precomputed data should be stored for faster querying.

For a database we chose PostgreSQL. The 9.5 version released in January has plenty of great features for managing JSON data.

The basic idea behind the database schema is:

  • the original jsons should be stored without any conversion
  • the report queries must be fast
  • there should be only one json entry for a site for given time
  • the script loading the data should load the same file many times without any error

I've started with the main table for storing jsons:

CREATE TABLE stats_data (
  id SERIAL PRIMARY KEY,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  data JSONB NOT NULL
);

This is not enough. We also want to avoid storing the same json multiple times. This can easily be done with an EXCLUDE clause.

CREATE TABLE stats_data (
  id SERIAL PRIMARY KEY,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  data JSONB NOT NULL,

  CONSTRAINT no_overlapping_jsons
  EXCLUDE USING gist (
    tstzrange(
      to_timestamp((data->>'start_ts')::double precision),
      to_timestamp((data->>'end_ts'  )::double precision)
    ) WITH &&,
    ((data->>'metadata')::json->>'country')      WITH =,
    ((data->>'metadata')::json->>'installation') WITH =
  )
);

The above SQL requires a small extention to be installed

CREATE EXTENSION IF NOT EXISTS btree_gist;

And now inserting the same json results in error:

$ insert into stats_data(data) select data from stats_data;
ERROR:  conflicting key value violates exclusion constraint "no_overlapping_jsons"

So for now we have a simple table with original json, and with a constraint disallowing to insert overlapping jsons.

In the next part I will show how to make simple reports and load the json files.

Migrating to Devise in a Legacy Rails App

I've recently started working in a Rails 4 application that has accrued a lot of technical debt since it began life on Rails 1. To avoid spending the next few months tip-toeing around the code base, scared to bump something or step on a boobie-trapped brick, I pitched the client for a slice of budget to pay down some of that debt and write some tests. I got the thumbs up.

I decided to tackle the app’s home-grown user authentication. The encryption it was using wasn’t up to industry standards any more, and there were a few bugs with the flash messages it set. I went with the obvious choice of using the Devise gem and backed the whole thing with some simple integration tests using the Cucumber gem. I didn’t want to annoy the users by making them reset their password when we deployed, so I came up with a way to migrate users’ passwords with no manual intervention or interruption to their workflow.

I ran the Devise generator and trimmed down the generated migration to set up only the database_authenticatable module.

[phunk@work ]$ rails generate devise:install

# db/migrate/20160121005233_add_devise_to_users.rb
class AddDeviseToUsers < ActiveRecord::Migration
  def self.up
    change_table(:users) do |t|
      ## Database authenticatable
      ## null: true because we are transitioning from legacy passwords to devise
      t.string :encrypted_password, null: true, default: ""
    end
  end
end

Next, I commented out all the legacy authentication code sprinkled throughout the application. (I deleted it afterwards, but I kept it around for reference during this work.) I powered through setting the Devise routes, tweaking the Devise views for sign in and password changes, and aliasing helper methods to call the new Devise versions. Lastly, I created Cucumber tests for the common scenarios. Each test expectation verified that the page contained the expected string as they're defined in config/locales/devise.en.yml.

Feature: User authentication

Background:
  Given User bob has an account with a password of pizza1

  Scenario: User signs in with correct credentials
    When he signs in login bob and password of pizza1
    Then he should be see Signed in successfully

  Scenario: User signs in with incorrect credentials
    When he signs in login bob and password of wrongpass
    Then he should be see Invalid login or password

My first stab at keeping existing user passwords involved implementing a Devise custom encryptor. I ended up bailing on that approach. I didn’t like adding that amount of complexity in exchange for a solution that was slightly less cryptographically secure than Devise’s bcrypt default. I needed to find a better way.

This is where Devise’s valid_password? method comes in. I created my own version of that method and mixed it into the User model via a new ActiveSupport::Concern. Here are the relevant parts of that module:

# app/concerns/user_authentication.rb
require 'active_support/concern'

module UserAuthentication
  extend ActiveSupport::Concern

  included do

  def valid_password?(password)
    if !has_devise_password? && valid_transitional_password?(password)
      convert_password_to_devise(password)
      return true
    end

    super
  end

  def has_devise_password?
    encrypted_password.present?
  end

  def valid_transitional_password?(password)
    # secret legacy magic happens here
  end

  def convert_password_to_devise(password)
    update!(password: password)
  end
end

The valid_transitional_password? method calls the legacy auth check. If it’s a match, it updates the user’s password attribute, which is supplied by Devise and saves the Devise-encrypted version of the password param. Note that it’s a one way street - Once they have a Devise-encrypted password in the database, that’s what they get authenticated against via that call to super.

Our User model mixes in the new module:

# app/models/user.rb
require 'user_authentication'

class User < ActiveRecord::Base
  include UserAuthentication

  devise :database_authenticatable

end

Once this has been running in production for a while and all the users have signed in and auto-migrated their passwords, clean-up will be easy: I’ll delete this mixin module, the two lines in the User class that reference it, and drop the legacy encrypted password column from the users table. All traces of this migration code will be gone, and this will look and run just like a vanilla Devise setup.

A side note about the arguably gratuitous one-liner methods in that mix-in module: Those help maintain a consistent level of abstraction in the main method and support the use case where a developer needs to quickly answer the question, "What’s special about how this app handles passwords?" (There are also some subtleties that these one-liners help hide, such as the legacy and Devise attribute names differing by only two characters.)

By adding integration tests to the sign in use cases, switching to Devise behind the scenes and coming up with a way to securely migrate passwords without interrupting users’ work, I was able to reduce technical debt, familiarize myself with some of the surrounding code, and make it easier and safer for us to work in this app. I also got to delete a ton of crufty code, which might be my favorite part.

PS. If you’re working with Devise, you want to be spending your time in the extensive Devise Wiki on GitHub.



Grammy Awards at Musica Russica

I was excited to learn this week about a Grammy Award recently awarded in connection with one of End Point’s valued customers, Musica Russica. Musica Russica is a specialized e-commerce website selling Russian choral sheet music as well as choral music recordings.

Last week’s Grammy-Award-winning (Best Choral Performance) recording by the Kansas City/Phoenix Chorales, conducted by Charles Bruffy, is featured at Musica Russica. Take a moment and listen to samples of the Rachmaninoff's All-Night Vigil.

This is the third Grammy award for best choral recordings won with music published by Musica Russica, adding to their awards from 2007 and 2015.

Musica Russica is the largest publisher of Russian choral music outside of Russia. Musica Russica’s business is built on a large collection of choral music which the founder, Dr. Vladimir Morosan, collected in his home country. Vladimir was passionate about this historical music and was concerned that it would not be properly preserved. In the 1980s he collected sheet music from libraries and churches in the former Soviet Union bringing the collection to the United States.

Not only did Vladimir collect music and develop a business to distribute it to music lovers around the world, he has also spent a career directing choirs to sing the music he loves. Vladimir is the founder and director of the Archangel Voices ensemble. On the Musica Russica website, you can purchase CDs or MP3s of recorded music. For much of the music, you can find the underlying sheet music on the website as well.

End Point has worked with Musica Russica since 2012 when we built their custom e-commerce website with the Ruby on Rails framework. We developed an easy to use shopping cart and one page checkout for them as an open source project called Piggybak.

End Point congratulates everyone at Musica Russica and all those involved with the Rachmaninoff's All-Night Vigil recording.

Building an e-commerce Website

If you are considering starting out your own e-commerce business, this article is dedicated to you.

After working with clients on multiple different projects — including e-commerce sites — I know that there are aspects that new owners are not aware of before they dive deep. There isn’t as much written about the problems and failures web store owners had before they learnt. This means that one has to learn it all the hard way — very often soon after their development team has already put a lot of work into the e-commerce site.

With this article I’d like to save you some of the hassle and money and present you with a small introduction of what to expect.

Preparations

Apart from thinking about the budget, kinds of products, warehouses etc. you’ll most probably want to know how to put it all out there for the public to see. You’ve seen other web stores and intuition may tell you that you just need a “website” with some backend functionality to manage products and see the orders. You may have heard about some “free” stacks you could use or some online hassle-free platforms where you could just pay monthly and have your web store in almost no time.

All this is to some extent true — but in the end, very often only partially. The free solutions aren’t really free as I’ll point out later on. The “hassle-free” solutions aren’t always that hassle-free and definitely aren’t free or even cheap if you plan on being a great success.

Some of you may have heard or read, that it’s often very advantageous to hire a team of developers (or a solo developer) who will create the whole application for you. They can take the “free” solutions and adopt them to your specific needs. Indeed, In my experience, many companies end up employing developers who steer them into using a well-known e-commerce platform they were excited about.

The popularity of such “platforms” can be seen, looking at the rankings published from time to time:

The most recent report:

Some of the older listings:

What these popularity charts do not reveal, is that many users of such frameworks often become very frustrated, when they try to add things that they don’t handle out of the box.

A problem with “ready to use” e-commerce solutions is that they take a one-size-fits-all approach. Many of them have “plugin” systems allowing users to choose from a list of third party extensions. They are never tailored to your specific needs though. The large list of features only impairs the ability to implement the ones that you really need. It’s still possible but it’s much more costly and much more bug prone.

The “online platform”, or what's called the SaaS approach takes out the development time induced hassle from the picture giving you the set of features most owners would want. The downside is that you have even less room for customizations and you certainly won’t have the flexibility of adding anything you’d like as in the “ready to use” or even “ground up work” approaches. This is especially true if you'd like to maximize your profit by tailoring your web store to your specific type of a customer. Web stores aren't that rare anymore and you'll need to differentiate yourself somehow in order to gain a competitive edge (apart from the one your products give you on their own). The story of Amazon and their "one click checkouts" tells it all. The other disadvantage is that pricing can be based on e. g. the number of visitors or orders. This means that the greater success you are, the more you have to pay.

No business is the same, it may be that the “ready to use” platform is exactly what you need — at least at this point in time. The SaaS solutions cover the needs of some users as well. It’s always a matter a weighing pros and cons — many things are beautiful in this world but nothing is perfect. I’d like to present my humble set of advice to at least steer your thinking to line it up with what my experience tells me is reasonable. With this I'll focus solely on the approach of hiring a developer or a team to maximize your ability to mold your web store and react to the needs of the market.

1. Don’t let yourself be too excited

There are many technologies that garner a lot of buzz these days. For non-technical people, it's very easy to believe the excitement of a professional developer recommending buzz-worthy technology. I’m here to tell you though, that out of experience — no technology I was ever very excited about, remained as great in my eyes after some time of using them. There’s a downside to everything — this doesn’t mean these technologies were useless and that I’d choose differently. It just means that the excitement can cloud the thinking making it much easier to make silly mistakes. It’s also easy to trap yourself with something that doesn’t even deserve any consideration — even though it’s packaged so neatly.

If it seems that your developers are getting too excited about a particular new and hot technology, take their recommendations and endorsements with a grain of salt. There is no silver bullet, technology trends come and go and it's best to monitor innovations and adopt them when they have been tried, tested and proven.

2. Do not ever trust teams offering fixed project prices

Many business owners are tempted to expect a fixed price negotiated with a team of developers. It’s very natural — nobody wants to invest in a project without a clearly defined cost. This is how things work in other aspects of our lives (like buying a house). In the software development world this approach almost never works as it might be expected. The reason is that creating new software isn’t like buying a new house — it’s more like building a new house.

Software development is very complex. There are many unknowns at almost every stage of the project that make it difficult to create accurate estimates. A simple website may end up taking much longer than anticipated — not to mention large-scale e-commerce sites.

With fixed-price projects, as the completion deadline approaches, features may need to be cut or scaled back because of unforeseen challenges in the earlier stages.

At End Point we usually give estimates in the form of a range. Rather than committing to complete your project in 500 hours, we'll break it down into smaller tasks and the aggregated estimate would be a range from 480-520 hours as an example. This allows visibility into the cost for the completed project and offers flexibility for the challenges that arise as the project moves from inception to completion.

3. Let go of perfectionism

It’s so easy nowadays, to read a lot about the "proper" ways of building anything. We’re bombarded with information: people are writing about SEO and having mobile-friendly UI. Others will discuss quality of the design, the importance of typography, email campaigns, payment gateways, data security etc. Don’t get me wrong — It’s all very important and mostly good advice. The problem though is that it’s such a complex endeavor that it’s easy to get paralyzed thinking you have to get it perfectly or else it will not work.

In reality, it’s completely safe to put it all aside and first decide what’s the most important to reach your initial goals. A great way to focus on that is to talk with experienced developers — they will help you understand the critically important aspects the application will need to handle properly, like data security and contrast that with some of the other e-commerce wants and needs.

4. Put first things first

Very often, online business owners come up with a large list of features they want, only to discover in the end that a small part of it in fact was making them their money. It’s equally true that it’s very easy to get excited at each stage of development and ask developers “to add this-or-that while they’re on it”. What may seem as “simple” additions, at different stages of the project may become things taking much longer than expected. In the end owners need to pay much, much more — for something that may not even contribute to the business success at all.

We as developers have a number of interesting battle tested principles. One of them is what’s known as YAGNI - "you ain’t gonna need it" (https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it). It means that you're better off focusing only on the essentials that will do the job rather than focusing on the large number of "nice-to-haves".

You have to think of the crucial, most important set of features that will make your web shop make money for you. Once you have that set, go through the list and try to prune it a bit more. Have you heard about the Pareto principle? It says that only about 20% of the things you’ll do, will bring you the 80% of the income. From the development point of view — the simpler the better. For you as an investor, it also means much, much smaller invoice to pay.

5. Invest in a really good team from the start

These days the barrier to entry for web development has become very low. It’s quite easy to become a developer. This led to the situation that it’s possible to hire a team very cheaply. I don’t want to scare you too much — but there’s a lot that can go badly in your web store. Most of those things you won’t be able to see, nor will you have the control over. What developers call the backend (which is everything that’s not related to the appearance of the site in your browser) — has to be coded exceptionally well to minimize the threat of bad things happening to your business because of it.

The application logic can go terribly wrong. Badly coded software could not be charging the clients properly making you lose your money. The promotion coupons could be applied when they shouldn’t. Promotions themselves could be starting too soon or ending too late. Orders could seem to complete successfully with customer being charged, while the order never makes it to the database. The pricing could be incorrect. One customer could see another customer's login data or discounts. There could be problems with shipping or tax calculation when third-party services have outages. With little real world experience, a weak development team can make the application very slow or make it completely fail during high traffic. Also — there are a lot of ill-minded people wanting to get into your system and either steal private customer data or take down the whole web store.

It pays to pay for a really great development team.

There’s much more

There’s much, much more to the subject that an article like this can fit. I invite to you contact us if you’re considering starting out an e-commerce business and would like to learn more. We’re also available to help you in case you already have an established business.

Creating Composite Docker Containers with Docker Compose

Composite Docker Containers

Docker is an application container system which allows logical isolation and automation of software components into isolated virtual server instances. This model is quite powerful for creating new instances of a given application rapidly, and creating automated system stacks from high-availability to high-performance clusters. Even though there is no technical limitation, the idea behind this model is that these containers should be in a 1:1 relationship with each application component. If you deploy a Docker image for Apache Tomcat, the container will contain Tomcat, and only Tomcat and its core dependencies. If you needed a Tomcat application server, and a PostgreSQL database to go with that application server; in general you would need to create two separate containers, one with the Tomcat image and the other with the PostgreSQL image. This can lead to an undesirable situation with complexity where you must manage both containers separately, even though they are both part of the same stack. In order to solve this problem, recently the Docker team developed Docker Compose to allow these complex applications to all live inside one distinct container configuration.

Creating a composite stack with separated containers

Using the standard Dockerfile configurations and continuing the example above, you could create a Tomcat application server with a corresponding PostgreSQL database server using two separate containers. Here is an example of the Tomcat Dockerfile:

FROM java:8-jre

ENV CATALINA_HOME /usr/local/tomcat
ENV PATH $CATALINA_HOME/bin:$PATH
RUN mkdir -p "$CATALINA_HOME"
WORKDIR $CATALINA_HOME

# see https://www.apache.org/dist/tomcat/tomcat-8/KEYS
RUN gpg --keyserver pool.sks-keyservers.net --recv-keys \
 05AB33110949707C93A279E3D3EFE6B686867BA6 \
 07E48665A34DCAFAE522E5E6266191C37C037D42 \
 47309207D818FFD8DCD3F83F1931D684307A10A5 \
 541FBE7D8F78B25E055DDEE13C370389288584E7 \
 61B832AC2F1C5A90F0F9B00A1C506407564C17A3 \
 79F7026C690BAA50B92CD8B66A3AD3F4F22C4FED \
 9BA44C2621385CB966EBA586F72C284D731FABEE \
 A27677289986DB50844682F8ACB77FC2E86E29AC \
 A9C5DF4D22E99998D9875A5110C01C5A2F6059E7 \
 DCFD35E0BF8CA7344752DE8B6FB21E8933C60243 \
 F3A04C595DB5B6A5F1ECA43E3B7BBB100D811BBE \
 F7DA48BB64BCB84ECBA7EE6935CD23C10D498E23

ENV TOMCAT_MAJOR 8
ENV TOMCAT_VERSION 8.0.30
ENV TOMCAT_TGZ_URL https://www.apache.org/dist/tomcat/tomcat-$TOMCAT_MAJOR/v$TOMCAT_VERSION/bin/apache-tomcat-$TOMCAT_VERSION.tar.gz

RUN set -x \
 && curl -fSL "$TOMCAT_TGZ_URL" -o tomcat.tar.gz \
 && curl -fSL "$TOMCAT_TGZ_URL.asc" -o tomcat.tar.gz.asc \
 && gpg --verify tomcat.tar.gz.asc \
 && tar -xvf tomcat.tar.gz --strip-components=1 \
 && rm bin/*.bat \
 && rm tomcat.tar.gz*

EXPOSE 8080
CMD ["catalina.sh", "run"]

Here is an example of the PostgreSQL Dockerfile:

# vim:set ft=dockerfile:
FROM debian:jessie

# explicitly set user/group IDs
RUN groupadd -r postgres --gid=999 && useradd -r -g postgres --uid=999 postgres

# grab gosu for easy step-down from root
RUN gpg --keyserver pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates wget && rm -rf /var/lib/apt/lists/* \
 && wget -O /usr/local/bin/gosu "https://github.com/tianon/gosu/releases/download/1.2/gosu-$(dpkg --print-architecture)" \
 && wget -O /usr/local/bin/gosu.asc "https://github.com/tianon/gosu/releases/download/1.2/gosu-$(dpkg --print-architecture).asc" \
 && gpg --verify /usr/local/bin/gosu.asc \
 && rm /usr/local/bin/gosu.asc \
 && chmod +x /usr/local/bin/gosu \
 && apt-get purge -y --auto-remove ca-certificates wget

# make the "en_US.UTF-8" locale so postgres will be utf-8 enabled by default
RUN apt-get update && apt-get install -y locales && rm -rf /var/lib/apt/lists/* \
 && localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
ENV LANG en_US.utf8

RUN mkdir /docker-entrypoint-initdb.d

RUN apt-key adv --keyserver ha.pool.sks-keyservers.net --recv-keys B97B0AFCAA1A47F044F244A07FCC7D46ACCC4CF8

ENV PG_MAJOR 9.5
ENV PG_VERSION 9.5.0-1.pgdg80+2

RUN echo 'deb http://apt.postgresql.org/pub/repos/apt/ jessie-pgdg main' $PG_MAJOR > /etc/apt/sources.list.d/pgdg.list

RUN apt-get update \
 && apt-get install -y postgresql-common \
 && sed -ri 's/#(create_main_cluster) .*$/\1 = false/' /etc/postgresql-common/createcluster.conf \
 && apt-get install -y \
  postgresql-$PG_MAJOR=$PG_VERSION \
  postgresql-contrib-$PG_MAJOR=$PG_VERSION \
 && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /var/run/postgresql && chown -R postgres /var/run/postgresql

ENV PATH /usr/lib/postgresql/$PG_MAJOR/bin:$PATH
ENV PGDATA /var/lib/postgresql/data
VOLUME /var/lib/postgresql/data

COPY docker-entrypoint.sh /

ENTRYPOINT ["/docker-entrypoint.sh"]

EXPOSE 5432
CMD ["postgres"]

Creating a composite container with Docker Compose

Using Docker Compose you can define multiple container images within a single configuration file so to keep them together as a single logical unit. In order to do so you give each container a name, and then provide within the definition for each container the same Docker parameters you would use in a regular Dockerfile. Here is an example of the situation discussed earlier where you have a Tomcat image and a PostgreSQL image container which go together:

db:
  image: postgres
web:
  build: .
  command: /usr/local/tomcat/bin/catalina.sh run
  volumes:
    - .:/code
  ports:
    - "8080:8080"
  links:
    - db
  log_driver: "syslog"
  log_opt:
    syslog-facility: "daemon"

Within these configuration values, there are a number of things being defined. Firstly the database container is created and defined with the default postgres image. Then the web application container defines that the docker image will be built using the application code and Dockerfile present in the CWD. After both containers are built, Docker will perform some actions to get things ready like starting Catalina and copying the code from the CWD into /code on the new container volume. In addition there are some values to configure application logging and to allow for the database container to be linked to the web container. After the file is created, all that is required to start the containers is to call 'docker-compose up' from the CWD where the docker-compose.yml file and the application code are located.

After starting the containers you should see output in 'docker ps' showing the new containers. Here is the output from my test with Tomcat/PostgreSQL from when I was doing some testing on the Struts 2 web framework for Java:

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
3e3656879a7b        strutsdocker_web    "/usr/local/tomcat/bi"   3 weeks ago         Up 3 weeks          0.0.0.0:8080->8080/tcp   strutsdocker_web_1
cb756e473ed8        postgres            "/docker-entrypoint.s"   8 weeks ago         Up 3 weeks          5432/tcp                 strutsdocker_db_1

Both containers keep the naming convention of the directory name of the CWD where the application source code and Docker configuration files are located, along with the composite container name (db and web in this case) and an incrementing number for each instance you create. This can be really helpful in case you need to update any of the application source code and rebuild, as the DB container will be retained and all of its volume data will still be intact. It's worth noting though that in the opposite situation, where the database needs to be rebuilt, you could do this without impacting the web container data but they would both be shutdown and startup together as they are seen by Docker as two containers with a dependency which make up one logical application.

Conclusions

The concept of using individual images for applications, and distributing those images by using the public software ecosystem makes the initial deployment phase very easy as much of the initial work is already done. However the 1:1 relationship of one application per container does not really reflect the current state of web development. For complex applications that need a data layer in a database, a presentation layer in an application server as well as components like search indexing, having individual containers for each one would be unmanageable. Using a composite container allows you to keep the same benefits of the Docker image ecosystem, while adding in the ease of managing all the pieces of the application holistically as one container.

Bonked By Basic_auth Because Bcrypt

tl;dr - don't use a high bcrypt cost with HTTP basic auth!

Recently we had a client approach us with reports of a slow wiki experience. This was for a MediaWiki we recently installed for them; there were no fancy extensions, and the hardware, the OS, and the Apache web server were solid, perfectly normal choices. I was tasked to dive in and solve this issue.

The first step in any troubleshooting is to verify and duplicate the problem. While the wiki did feel a bit sluggish, it was not as bad as the reports we were getting of taking over 15 seconds to view a page. A side-by-side comparison with a similar wiki seemed a good place to start. I called up the main wiki page on both the client wiki and End Point's internal wiki. Both were running the latest version of MediaWiki, had the same type of servers (located a similar distance from me), were using the same version of Apache, and had roughly the same server load. While both wiki's pages had roughly the same amount of content, the client one loaded noticeably slower. It took less than a second for the End Point wiki, and around ten seconds for the client one!

The first possible culprit was MediaWiki itself. Perhaps something was misconfigured there, or some extension was slowing everything down? MediaWiki has good debugging tools. Inside the both wiki's LocalSettings.php file I turned on debugging temporarily with:

$wgDebugLogFile         = '/tmp/mediawiki.debug';
$wgDebugDBTransactions  = true;
$wgDebugDumpSql         = true;
$wgDebugTimestamps      = true;

I reloaded the page, then commented out the $wgDebugLogFile line to stop it from growing large (the debug output can be quite verbose!). Here's some snippets from the generated log file:

0.9151   4.2M  Start request GET /wiki/Main_Page
...
[caches] main: SqlBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
0.9266   9.2M  Implicit transaction open enabled.
0.9279   9.2M  Query wikidb (1) (slave): SET /* DatabasePostgres::open  */ client_encoding='UTF8'
0.9282   9.2M  Resource id #127: Transaction state changed from IDLE -> ACTIVE
0.9268   9.2M  Query wikidb (2) (slave): SET /* DatabasePostgres::open  */ datestyle = 'ISO, YMD'
...
0.9587   9.2M  Query wikidb (11) (slave): SELECT /* LCStoreDB::get  */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'deps'  LIMIT 1
0.9573   9.5M  Query wikidb (12) (slave): SELECT /* LCStoreDB::get  */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'list'  LIMIT 1
0.9567  10.8M  Query wikidb (13) (slave): SELECT /* LCStoreDB::get  */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'preload'  LIMIT 1
0.9572  10.8M  Query wikidb (14) (slave): SELECT /* LCStoreDB::get  */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'preload'  LIMIT 1
...
0.9875  21.2M  Query wikidb (195) (slave): SELECT /* LCStoreDB::get Greg */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'messages:accesskey-pt-mycontris'  LIMIT 1
0.9873  21.2M  Query wikidb (196) (slave): SELECT /* LCStoreDB::get Greg */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'messages:tooltip-pt-logout'  LIMIT 1
0.9868  21.2M  Query wikidb (197) (slave): SELECT /* LCStoreDB::get Greg */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'messages:accesskey-pt-logout'  LIMIT 1
0.9883  21.2M  Query wikidb (198) (slave): SELECT /* LCStoreDB::get Greg */  lc_value  FROM "l10n_cache"   WHERE lc_lang = 'en' AND lc_key = 'messages:vector-more-actions'  LIMIT 1

Just to load a simple page, there were 194 SELECT statements! And 137 of those were trying to look in the l10n_cache table, one row at a time. Clearly, there is lots of room for improvement there. Someday, I may even jump in and tackle that. But for now, despite being very inefficient, it is also very fast. Because of the $wgDebugTimestamps, it was easy to compute how much time both wikis spent actually creating the page and sending it back to Apache. In this case, the difference was minimal, which meant MediaWiki was not the culprit.

I then turned my attention to Apache. Perhaps it was compiled differently? Perhaps there was some obscure SSL bug slowing things down for everyone? These were unlikely, but it was worth checking the Apache logs (which were in /var/log/httpd). There are two main logs Apache uses: access and error. The latter revealed nothing at all when I loaded the main wiki page. The access logs looked fairly normal:

85.236.207.120 - greg [19/Jan/2016:12:23:21 -0500] "GET /wiki/Main_Page HTTP/1.1" 200 23558 "-" "Mozilla/5.0 Firefox/43.0"
85.236.207.120 - greg [19/Jan/2016:12:23:22 -0500] "GET /mediawiki/extensions/balloons/js/balloon.config.js HTTP/1.1" 200 4128 "https://wiki.endpoint.com/wiki/Main_Page
" "Mozilla/5.0 Firefox/43.0"
...
85.236.207.120 - greg [19/Jan/2016:12:23:22 -0500] "GET /mediawiki/load.php?debug=false&lang=en&modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles&only=styles&skin=vector HTTP/1.1" 200 58697 "https://wiki.endpoint.com/wiki/Main_Page" "Mozilla/5.0 Firefox/43.0"
85.236.207.120 - greg [19/Jan/2016:12:23:22 -0500] "GET /mediawiki/resources/assets/poweredby_mediawiki_88x31.png HTTP/1.1" 200 3525 "https://wiki.endpoint.com/wiki/Main_Page" "Mozilla/5.0 Firefox/43.0"

Still nothing out of the ordinary. What to do next? When all else fails, go to the system calls. It's about as close to bare metal as you can easily get on a Linux system. In this case, I decided to run strace on the Apache daemon to see exactly where the time was being spent. As expected, there were a large handful of httpd processes already spawned and waiting for a connection. While there was no way to know which one would field my requests, some shell-fu allowed me to strace them all at once:

## The -u prevents us from picking the parent httpd process, because it is owned by root!
$ strace -o greg.httpd.trace -tt -ff `pgrep -u apache httpd | xargs -n 1 echo -p | xargs`
Process 5148 attached
Process 4848 attached
Process 5656 attached
Process 4948 attached
Process 5149 attached
Process 5148 attached
Process 4858 attached
Process 5657 attached
Process 4852 attached
Process 4853 attached
^CProcess 5148 detached
Process 4848 detached
Process 5656 detached
Process 4948 detached
Process 5149 detached
Process 5148 detached
Process 4858 detached
Process 5657 detached
Process 4852 detached
Process 4853 detached

Looking at one of the output of one of these revealed some important clues:

$ head greg.httpd.trace.4948
13:00:28.799807 read(14, "\27\3\3\2\221\0\0\0\0\0\0\0\1\35-\332\3123(\200\302\"\251'g\256\363b5"..., 8000) = 666
13:00:28.799995 stat("/wiki/htdocs/mediawiki/load.php", {st_mode=S_IFREG|0644, st_size=1755, ...}) = 0
13:00:28.800126 open("/wiki/htpasswd.users", O_RDONLY|O_CLOEXEC) = 15
13:00:28.800176 fstat(15, {st_mode=S_IFREG|0640, st_size=947, ...}) = 0
13:00:28.800204 read(15, "alice:{SHA}jA0EAgMCMEpo4Wa3n/9gy"..., 4096) = 2802
13:00:28.800230 close(15)               = 0
13:00:29.496369 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0
13:00:29.496863 rt_sigaction(SIGPROF, {0x7fc962da7ab0, [PROF], SA_RESTORER|SA_RESTART, 0x7fc970605670}, {0x7fc962da7ab0, [PROF], SA_RESTORER|SA_RESTART, 0x7fc970605670

Aha! If you look close at those timestamps, you will notice that the time gap from the call to close() and the subsequent setitimer() is quite large at .69 seconds. That's a long time for Apache to be waiting around for something. The second clue is the file it just opened: "htpasswd.users". Seeing the top of the file, with the {SHA} in quotes, made me realize the problem - htpasswd files now support bcrypt as an authentication method, and bcrypt is designed to be secure - and slow. Sure enough, the htpasswd file had bcrypt entries with a high cost for the people that were having the most issues with the speed. This is what the file looked like (names and values changed):

alice:{SHA}jA0EAgMCMEpo4Wa3n/9gybBBsDPa
greg:$2y$13$+lE6+EwgtzP0m8K8VQDnYMRDRMf6rNMRZsCzko07QQpskKI9xbb/y9
mallory:$2y$15$ww8Q4HMI1Md51kul2Hiz4ctetPqJ95cmspH8T81JHfqRvmg===rVgn
carol:7RnEKJWc38uEO
bob:$apr1$uKX9Z63CqPOGX4lD1R4yVZsloJyZGf+
jon:$2y$08$SUe3Z8sgEpyDWbWhUUUU5wtVTwlpEdc7QyXOg3e5WBwM4Hu35/OSo1
eve:$apr1$I/hv09PcpU0VfXhyG7ZGaMz7Vhxi1Tm

I recognized the bcrypt format right away ($2y$13$). The people who were complaining the most (e.g. mallory in the example above) about the speed of the wiki had the highest costs, while those with low costs (e.g. jon), and those using something other than bcrypt (everyone else above), were not complaining at all!

The 'cost' is the number after the second dollar sign: as you can see, some of them had a cost of 15, which is much more expensive than a cost of 13, which is what my user ("greg") was using. This was a smoking gun, but one more step was needed for proof. I adjusted the cost of my password to something low using the htpasswd program:

$ htpasswd -B -C 6 /wiki/htpasswd.users greg
New password: 
Re-type new password: 
Updating password for user greg

Voila! The page loaded in a flash. I then changed the cost to 15 and suddenly the wiki was even slower than before - taking upwards of 15 seconds to load the main page of the wiki. Mystery solved. All those high cost bcrypt requests are also not good for the server: not only does it use a lot of CPU, but ends up keeping the Apache daemon tied up waiting for the bcrypt to finish, rather than simply finishing up quickly and going back to the main pool.

You may be asking a few questions at this point, however. Why would htpasswd offer a footgun like this? Why such a radical difference in effect for slightly different costs? Is bcrypt a good practice for a htpasswd file? Let's attempt to answer those. Before we do, we have to learn a little bit about bcrypt and passwords in general. Some of this is purposefully oversimplified, so be gently in the comments. :)

Passwords themselves are never stored on a server (aka the machine doing the authentication). Instead, the server stores a hash of the password. This is created by what is known as a "one-way" function, that creates a unique fingerprint of your password. If this fingerprint (aka hash) is discovered, there is no direct way to see the password that created it. When you login to a site, it creates a hash of the password you give it, then compares that hash to the one it has stored. Thus, it can verify that you have given it the correct password without actually having to store the password.

For a long time, very simple algorithms were used to create these hashes. However, as computers became more powerful, and as the field of cryptography advanced, it became easier to "crack" these hashes and determine the password that was used to create them. This was an important problem, and one of the solutions that people came up with was the bcrypt algorithm, which makes the computation of the hash very expensive, in terms of computer speed. Furthermore, that speed is adjustable, and determined by the "cost" given at creation time. You may have noticed the -C option I used in the htpasswd example above. That number indicates the number of rounds the algorithm must go through. However, the cost given leads to 2^code rounds, which means that the cost is exponential. In other words, a cost of 13 means that bcrypt runs 2 to the 13th power rounds, or 8,192 rounds. A cost of 14 is 2 to the 14th power, or 16,384 rounds - twice as slow as a cost of 13! A cost of 15 is 32,768 rounds, etc. Thus, one can see why even a cost of 15 would be much slower than a cost of 13.

A web page usually returns more than just the requested HTML. There are commonly images, CSS, and JavaScript that must also be loaded from the webserver to fully render the page. Each of these requests must go through basic auth, and thus get slowed down by bcrypt. This is why even though each basic authentication via bcrypt of 15 only takes a couple of seconds, the entire web page can take much longer.

What encryption options are available for htpasswd program? The bcrypt option was introduced without much fanfare in version 2.4.4 of Apache, which was released on February 25, 2013. So, it's been around a while. The output of --help shows us that bcrypt is the only secure one, but allows for other legacy ones to be used. Also note that the range of costs for bcrypt range from 4 to 31:

 -m  Force MD5 encryption of the password (default).
 -B  Force bcrypt encryption of the password (very secure).
 -C  Set the computing time used for the bcrypt algorithm
     (higher is more secure but slower, default: 5, valid: 4 to 31).
 -d  Force CRYPT encryption of the password (8 chars max, insecure).
 -s  Force SHA encryption of the password (insecure).
 -p  Do not encrypt the password (plaintext, insecure).

So should you use bcrypt for your htpasswd? Absolutely yes. Even a lower cost bcrypt is incredibly more secure than using MD5, CRYPT, or SHA. A cost of 10 is roughly the same speed as those, but a much, much better choice. You can measure the time it takes to create or update your password via the command-line htpasswd command to get a rough idea of how much impact it will have on your website. You can use the time it takes to run the htpasswd command as rough proxy for the total page load time. Here are some numbers I generated on my local box. Numbers represent the average of a few runs:

Bcrypt cost htpasswd creation time Web page load time
100.0795.68 seconds
120.2686.77 seconds
140.97910.78 seconds
163.68425.72 seconds
1814.68388.85 seconds
2058.680358.80 seconds
22236.3691357.82 seconds
31186,173 seconds
(51 hours and 42 minutes!!)
Ah...no

There are times where you really do want a higher bcrypt cost. The basic auth usage in this scenario is really the exception, and not the norm. In most cases, a password will be used to log in to something, and you will either create a persistent connection (e.g. SSH), or a cookie with a temporary token will be issued (e.g. almost every website in the world). In those cases, a few seconds delay are quite acceptable, as it is a rare event.

So why do we even care about passwords so much, especially for something like basic auth and a htpasswd file? After all, if someone can view the contents of the htpasswd file, they can also more than likely view whatever material on the web server it was designed to protect. These days, however, it's important to view strong hashes such as bcrypt as not just protecting data, but protecting the password as well. Why? Password reuse. It's very common for people to use the same (or very similar) password on all the sites they visit. The danger is thus not that an attacker can view the file contents protected by the htpasswd file, but that an attacker can use that password on the user's email accounts, or on other sites the user may have visited and used the same password.

What bcrypt cost should you use? The general answer is to use the highest possible cost you can get away with. Take something with such a high cost that is causes discomfort to the users, then dial it back a tiny bit. Measure it out and see what your server can handle. For general bcrypt use, start with 13, but don't be afraid to keep going up until it takes a wall clock second or two to run. For basic auth, use something very fast: perhaps 9 or less. Anything that takes over a second to create via htpasswd will slow a site down noticeably!

Zurb Foundation 5: Applying Styles Conditionally Based on Viewport Size

The Zurb Foundation 5 front-end framework provides many convenient features such as the ability to control the visibility of HTML elements for different browser window sizes using CSS classes. Foundation CSS classes like 'show-for-small-only' and 'hide-for-large-up' (full list at http://foundation.zurb.com/sites/docs/v/5.5.3/components/visibility.html) make it easy to add mobile-specific content to your page or prevent certain page elements from being displayed on mobile devices.

Having an easy way to show/hide elements based on viewport size is nice, but what if you want to style an element differently based on the size of the browser that's viewing the page? Foundation has you covered there, too, though the method is less obvious. It's possible to use Foundation's media query SCSS variables when writing your own custom styles in order to apply different styling rules for different viewport sizes.

For example, if you have an element that you want to offset with a margin in a large window but be flush with the left edge of a small window, you can use the Foundation media query variables to apply a styling override that's specific to small windows:

#element { 
    margin-left: 100px;
 
    @media #{$small-only} {
        margin-left: 0;
    }
}

This will apply a specific set of styling to the element for small viewports and a different set for medium and larger viewports (with the definitions for 'small', 'medium', etc. corresponding to the same values used by Foundation's visibility classes like 'show-for-small-only' which were mentioned at the start of this post).

It wasn't immediately obvious to me how to apply conditional styling using Foundation's own definitions of small, medium, etc. viewport sizes but luckily the variable definitions provided by the SCSS framework make it easy to do so.