Welcome to End Point’s blog

Ongoing observations by End Point people

Convert Line Endings of Mac and Windows to Unix in Rails and Test with RSpec

Line endings or newline is a special character(s) to define the end of a line. The line endings special character(s) vary across the operating systems. Let's take an example, we are developing a Rails feedback application which will be used by a wide range of users. The users might submit the feedback from different operating systems which has different kind of line end character(s). The content should have formatted for standard line endings before storing into backend.

Mostly two special characters used to define the line endings in most of the operating systems.
1. Line Feed (LF) - \n
2. Carriage Return (CR) - \r

The usage of these two special characters for Unix, Mac and Windows are

OS Characters Name
Unix \n LF
Mac \r CR
Windows \r\n CRLF
Note:- \r is the newline character up to Mac OS version 9, after that Mac uses Unix line endings.

It is a developer's job to convert all kinds of line endings to Unix line ending format to maintain the standard. We can achieve this by a regex pattern replace. The regex pattern should convert \r(Mac) or \r\n(Windows) to \n(Unix).
standard_content = non_standard_content.gsub(/\r\n?/,"\n")
We can see the conversion of line endings to Unix from Mac and Windows using irb (Interactive Ruby) shell.

1. Mac
irb(main):001:0> mac ="Hello \r Mac"
=> "Hello \r Mac"
irb(main):002:0> mac.gsub(/\r\n?/,"\n")
=> "Hello \n Mac"
2. Windows
irb(main):001:0> windows="Hello \r\n Windows"
=> "Hello \r\n Windows"
irb(main):002:0> windows.gsub(/\r\n?/,"\n")
=> "Hello \n Windows"

RSpec Tests
After the implementation of line endings conversion, it should covered with test cases for the best practices of development. Here is the bunch of Rspec code to test both Mac and Windows line endings conversion.

  describe "POST feedback requests" do
    it "validates Mac line endings converted to Unix" do     
      _params = { :content => "Hello \r Mac", :user => "myuser"}
      post '/feedback.json', _params
      response.status.should == 200
      result = JSON.parse(response.body)
      result['content'].should_not include("\r")

    it "validates Windows line endings converted to Unix" do
      _params = { :content => "Hello \r\n Windows", :user => "myuser"}
      post '/feedback.json', _params
      response.status.should == 200
      result = JSON.parse(response.body)
      result['content'].should_not include("\r\n") 
The line endings conversion plays a crucial role in standardising the content. It is recommended to convert line endings to Unix style when providing web service features.

Liquid Galaxy installation at Westfield State University

This past week End Point installed its newest Liquid Galaxy installation at Westfield State University. The unveiling on April 23rd featured a talk from local glass artist Josh Simpson (whom I have long admired) and a video conference with his wife Cady Coleman, a NASA astronaut. I volunteered to head up the installation effort for two reasons really: First, I was excited to have another Liquid Galaxy in an academic setting, and second, I am originally from western Massachusetts and I couldn’t resist the prospect of some home cooked meals courtesy of my mother.

I arrived at Westfield State to a small army of people who helped me with every aspect of the installation from creative electrical wiring to campus security kindly giving my car a jump. The team at Westfield made this installation a breeze and all the smiling faces made it easy to put in the extra effort to finish in time for the opening. As always I got great remote support from my End Point colleagues during the installation process.

Once the Liquid Galaxy started to take shape the excitement of the students who happened to be passing by was certainly a motivating factor, but the greatest perk was getting to show my parents and their friends what it is exactly that I have been working on for over a year now. Explaining or describing a Liquid Galaxy, or even showing videos of it, gives you only a rough idea of what it is. Actually standing at the system’s podium and controlling the Galaxy is the only way to grasp how incredible a tool or toy it can be. I know the GIS professors at Westfield would describe the system as a tool, but for me its greatest quality is that it is also a toy that engages and challenges you to think about your world in a different way. The Liquid Galaxy draws you in and encourages you to explore the world in a way that you haven’t done before. My work for End Point has focused on supporting the systems and extending the statistics we gather from them. It is easy to forget what these systems do and the incredible interactions people have with them on a daily basis. That being said, I am excited to help the professors and students at Westfield extend the functionality of their system by helping them load custom content and think of ways they can use the system in an education setting. One of my friends pointed out that Liquid Galaxy is combining fun and learning (a scary thought, I know) and as someone who struggled in traditional learning settings, this is something I used to daydream about during my lecture/note taking classes.

The unveiling went well and everyone present found something different to call their favorite, be it Google Art Project or Grand Canyon Street View or simply finding their favorite vacation spot in Google Earth. A couple of local news sources covered the opening: 22 News WWLP and ABC 40 WGGB both aired segments featuring the installation. I would like to thank all the student and staff at Westfield State who were so hospitable to me during my short time on their campus. I am also excited to see the new ways in which they use the Liquid Galaxy.

An example work glass artist Josh Simpson.

RailsConf 2014: My Sketchnote Summary

My RailsConf Sketchnote Summary

In the heat of the moment, it's hard for me to do anything other than regurgitate RailsConf talks into blog articles (case in point:
my review of Leah Silber's talk on building an open source focused company, and my summary of the machine learning workshop at RailsConf). But after a bit of processing and brain-churning, I've pieced together the common themes that resonated for me throughout RailsConf, in Sketchnote form!

There was of course a bit of bantering regarding code as "writing" versus code as "engineering" and the arguments supporting both ends of the spectrum. As my fifth RailsConf, I did walk away with some Rails topical knowledge, but the overwhelming theme of many talks I attended was the human element behind coding. People as a part of building communities, sharing commonalities, building relationships and mentorships, being influenced by their own biases and perceptions. Many of my memorable talks of the conference for me had great stories, great humor, and both.

Here's a list of memorable talks referenced in my sketchnote above if you are interested in checking them out. Not all of the slides have been posted, so I'll revisit and update as they are added:

Custom plans prepared statements in PostgreSQL 9.2

Image by Flickr user Brett Neilson

Someone was having an issue on the #postgresql channel with a query running very fast in psql, but very slow when using DBD::Pg. The reason for this, of course, is that DBD::Pg (and most other clients) uses prepared statements in the background. Because Postgres cannot know in advance what parameters a statement will be called with, it needs to devise the most generic plan possible that will be usable with all potential parameters. This is the primary reason DBD::Pg has the variable pg_server_prepare. By setting that to 0, you can tell DBD::Pg to avoid using prepared statements and thus not incur the "generic plan" penalty. However, that trick will not be needed much longer for most people: version 9.2 of Postgres added a great feature. From the release notes:

Allow the planner to generate custom plans for specific parameter values even when using prepared statements.

Because the original IRC question involved a LIKE clause, let's use one in our example as well. The system table pg_class makes a nice sample table: it's available everywhere, and it has a text field that has a basic B-tree index. Before we jump into the prepared statements, let's see the three cases of LIKE queries we want to try out: no wildcards, a trailing wildcard, and a leading wildcard. The relname column of the pg_class table is used in the index pg_class_relname_nsp_index. Yes, there is a touch of Hungarian notation in those system catalogs! (Technically relname is type "name", not type "text", but they are identical as far as this example goes).

The first case is a LIKE with no wildcards. When Postgres sees this, it converts it to a simple equality clause, as if the LIKE was an equal sign. Thus, it is able to quite easily use the B-tree index:

test# EXPLAIN SELECT 1 FROM pg_class WHERE relname LIKE 'foobar'
                          QUERY PLAN                                           
 Index Only Scan using pg_class_relname_nsp_index on pg_class
   Index Cond: (relname = 'foobar'::name)
   Filter: (relname ~~ 'foobar'::text)

Now consider the case in which we only know the first part of the word, so we put a wildcard on the end:

test# EXPLAIN SELECT 1 FROM pg_class WHERE relname LIKE 'foo%'
                             QUERY PLAN                                           
 Index Only Scan using pg_class_relname_nsp_index on pg_class
   Index Cond: ((relname >= 'foo'::name) AND (relname < 'fop'::name))
   Filter: (relname ~~ 'foo%'::text)

As we know how the string starts, there is no problem in using the index. Notice how Postgres is smart enough to change the foo% into a range check for anything between foo and fop!

Finally, the most interesting one: the case where we only know the end of the relname, so the wildcard goes in the front:

test# EXPLAIN SELECT 1 FROM pg_class WHERE relname LIKE '%bar'
             QUERY PLAN                        
 Seq Scan on pg_class
   Filter: (relname ~~ '%bar'::text)

In this case, Postgres falls back to a sequential scan of the main table, and does not use the index at all, for it offers no gain. The B-tree is useless, and the entire table must be walked through (this can be worked around by clever use of a reverse clause)

So those are the three potential variations of LIKE. When a prepared statement is created, the argument is unknown and left as a placeholder. In other words, Postgres does not know in advance if we are going to search for 'foobar', 'foo%', '%bar', or something else. Watch what happens when we create a basic prepared statement based on the queries above:

test# PREPARE zz(TEXT) AS SELECT 1 FROM pg_class WHERE relname LIKE $1

The $1 is the parameter that will be passed to this statement when it is executed. Because Postgres has no way of knowing what will be passed in, it must create a plan that can work with all possible inputs. This means using a sequential scan, for as we've seen above, a wildcard at the start of the input requires one. All the examples using indexes can safely fall back to a sequential scan as well. We can use EXPLAIN EXECUTE to see the plan in action:

test# EXPLAIN EXECUTE zz('%bar');
         QUERY PLAN                        
 Seq Scan on pg_class
   Filter: (relname ~~ $1)

As expected, this plan is the only one available for the query given, as the index cannot be used with a leading wildcard. Now for the fun part. Let's put the wildcard on the end, and see what happens on Postgres version 9,1:

test# SELECT substring(version() from '(.+?) on');
PostgreSQL 9.1.13
# EXPLAIN EXECUTE zz('foo%');
         QUERY PLAN                        
 Seq Scan on pg_class
   Filter: (relname ~~ $1)

That's really not a good plan! It gets worse:

# EXPLAIN EXECUTE zz('foobar');
         QUERY PLAN                        
 Seq Scan on pg_class
   Filter: (relname ~~ $1)

Before version 9.2, the prepared statement's plan was locked in place. This was the cause of many woes, and the reason why programs and functions were "slow" but the same queries were fast on the command line. Enter Tom Lane's commit from September 2011:

Redesign the plancache mechanism for more flexibility and efficiency. Rewrite plancache.c so that a "cached plan" (which is rather a misnomer at this point) can support generation of custom, parameter-value-dependent plans, and can make an intelligent choice between using custom plans and the traditional generic-plan approach. The specific choice algorithm implemented here can probably be improved in future, but this commit is all about getting the mechanism in place, not the policy.

Yes, you read that correctly - new plans can be generated to match the parameters! (In case you were wondering, things have been improved since this commit, as hoped for in the last sentence.) Let's see what happens when we run the exact same prepared statements above, but on Postgres version 9.3:

# SELECT substring(version() from '(.+?) on');
PostgreSQL 9.3.4
test# EXPLAIN EXECUTE zz('%bar');
             QUERY PLAN                        
 Seq Scan on pg_class
   Filter: (relname ~~ '%bar'::text)

test# EXPLAIN EXECUTE zz('foo%');
                              QUERY PLAN                        
 Index Only Scan using pg_class_relname_nsp_index on pg_class
   Index Cond: ((relname >= 'foo'::name) AND (relname < 'fop'::name))
   Filter: (relname ~~ 'foo%'::text)

test# EXPLAIN EXECUTE zz('foobar');
                       QUERY PLAN                        
 Index Only Scan using pg_class_relname_nsp_index on pg_class
   Index Cond: (relname = 'foobar'::name)
   Filter: (relname ~~ 'foobar'::text)

Tada! We have three different plans for the same prepared statement, If you look close, you will see that even the first plan is now a "custom" one, as it has the exact parameter string rather than just $1 as before. The moral of the story: don't settle for anything less than version 9.2 of Postgres!

Building an Open Source Software-Centric Company at RailsConf 2014

It's RailsConf 2014 in Chicago and Day 4 (the last day)! Today I attended many good talks. One of them was Building an OSS-Centric Company (and Why You Want To) by Leah Silber. Leah is super passionate and knowledgeable and has good perspective from being involved in many successful open source software companies and projects through the years, including her current work at Tilde on Skylight and Ember.js.

Why Open Source?

First, Leah covered the justification for building a company with a focus on open source (ie how to convince the people that pay you to focus on open source). Here are the bullet points for why open source?:

  • Expertise: When you have a company with a focus on open source, you are building the expertise on your own open source tools and you have the available resources as opposed to outsourcing to other companies for expertise. Of course, in my mind, this shouldn't be a justification for creating an unneeded open source tool, but it's more of a long-term benefit after your open source projects gain traction.
  • Influence and Access: Building a OSS-centric company allows you to sit at the table as stakeholders when it comes to making big decisions about focus, priorities, and direction on those open source projects that will directly influence your revenue stream.
  • Recruiting: When you have an OSS-centric company, you have access to top talent and the ability to grow a fantastic team with that access.
  • Good Will: This can demonstrate that you are a company about adding to the knowledge-base, and contributing to the greater good.

Funding Tactics

Next, Leah covered some of the funding/revenue tactics of building an open source company. Again, I'm regurgitating what she discussed in list form:

    What Works

  • Symbiosis: This wasn't one of Leah's bullet points, but she touched a few times on the importance of having a symbiotic arrangement between funding and your open source tools. You want the funding and revenue to drive the open source development, and your open source goals to help drive revenue. This seems like an obvious point, but really important that goals align.
  • Consulting: Consulting is a great revenue stream for an OSS-centric company. The one piece of advice Leah offered
    is to focus on consulting for those using your open source tools, as it will allow you to see real use cases of your open source project to gain perspective. The disadvantage to consulting could be that your attention is diverted from development or maintenance of the project, but Leah touched on how to mitigate this later in the talk.
  • Training: Training can provide expected, scheduled bursts of income, and can be on the spectrum of very focused content, small, high cost per attendee, to more general, large, low cost per attendee. Leah has a preference for being on the "very focused content, small, high cost per attendee" approach because it may be more amenable to growing a passionate, involved community.
  • Events: Events are a great way to build a community around an open source tool, generate leads for the other revenue streams, and perpetuate the addiction to the good will & exposure that comes with open source. In terms of events, sponsorship, affiliate training opportunities, and admissions are all potential sources of revenue here.

What Doesn't Work

  • Venture Capital funding: VC funding may result in influencing your priorities and deliverables to focus on the sprint rather than the marathon, which ultimately means that your product and/or team may not be around in 3 or 5 years.
  • Donations: While donations won't hurt, simple donation options on your website may not return much money (and is unreliable), and the cost-benefit analysis of things like donation drives may prove to be more costly than they are worth.


Building an OSS-centric company does not come without challenges. A few common challenges one might experience in building an open source company that Leah covered were:

  • Emotional ups & downs: Leah suggests to set expectations, delegate work, track successes, don't feed the trolls, and take a break from your work to get a perspective and recharge.
  • Understand the Distinction between Frugal vs. Cheap: There are small opportunities to build good will and contribute growing to the community, such as small sponsorships, knowing how to funnel resources to "swag" effectively, and sponsoring user groups.
  • Avoid Being a Corporate Overlord: What I took away from Leah's discussion here is to let sincerity guide your decisions.
  • Delegate: It's really important to build an open source community that will survive when you are gone. Delegating means not only building a community, but also leaders as well. Leah later added that important roles for a successful open source project might include coders, website maintainers, infrastructure folks, speakers/evangelists, and logistics/organization folks.
  • Don't put too much You Into it: Make sure there isn't too tight of coupling between your open source product and your company, so that your open source product is general purpose enough to be adopted and isn't misguided by your company product decisions.
  • Symbiosis: Again, make sure your revenue stream is not at odds to your product goals.

It's all about Sustainability

Leah focused a lot on sustainability. From a revenue and personnel perspective, your approach needs to be sustainable. It's not a sprint, it's a marathon. Many of the challenges Leah covered seem pretty obvious, but when organized collectively, it's a great resource for understanding how your own company and/or projects might benefit from these ideas.

RailsConf 2014 on Machine Learning

This year at RailsConf 2014, there are workshop tracks which are focused sessions double or triple the length of the normal talk. Today I attended Machine Learning for Fun and Profit by John Paul Ashenfelter. Some analytics tools are good at providing averages on data (e.g. Google Analytics), but averages don't tell you a specific story or context of your users, which can be valuable and actionable. In his story-telling approach, John covered several stories for generating data via machine learning techniques in Ruby.

Make a Plan

First, one must formulate a plan or a goal for which to collect actionable data. More likely than not, the goal is to make money, and the hope is that machine learning can help you find actionable data to make more money! John walked through several use cases and examples code with machine learning and I'll add a bit of ecommerce context to each story below.

Act 1: Describe your Users

First, John talked about a few tools used for describing your users. In the context of his story, he wanted to figure out what gender ratio of shirts to order for the company. He used the sexmachine gem, which is based on census data, to predict the sex of a person based on a first name. The first name from all your users would be passed into this gem to segregate via gender, and from there you may be able to take action (e.g. order shirts with an estimated gender ratio).

Next, John covered geolocation. John wanted to how to scale support hours to customers using the product, likely a very common reason for geolocation for any SaaS or customer-centric tools. His solution uses, Python and Go, and free Maxmind data. The example code is available here.

With these tools, gender assignment & geolocation reveals basic but valuable information about your users. In the ecommerce space, determining gender ratios and geolocation may help determine the target of marketing and/or product development efforts, for example targeting a specific marketing message to a female East Coast demographic.

Act 2: Clustering

In the next step, John talked about using machine learning to cluster users. The context John provided was to cluster users into three groups: casual users, super users and professional users, to potentially learn more about the super users and how to get more users in that group. An ecommerce story might be to cluster users in amount spent buckets which have rewards at higher levels, to incentivize users to spend more money to climb the hierarchy for more rewards. Here John used ai4r gem, which uses k-means clustering to group users. In as few words as possible, k-means clustering randomly creates X clusters (step 1), computes the center of each cluster (step 2), moves nodes if they are closer to a different cluster center (step 3), and repeats steps 2 & 3 until no nodes have been moved. The actual code is quite simple with the gem. Alternative clustering tools are hierarchical clusterers or divisive hierarchical clustering, which will yield slightly different results. John also mentioned that there are much better numerical tools like Python, R, Octave/Matlab, and Mathematica.

Act 3: Similarity

The third and final topic John covered was determining similarity between users, or perhaps finding other users similar to user X. The context of this was to understand how people collaborate and spread knowledge. In the ecommerce space, the obvious use-case here is building a product recommendation engine, e.g. recommending products to a user based on what they have bought, are looking at, or what is in their cart. John didn't dive into the specific linear algebra math here (linear algebra is hard!), but he provided example code using the linalg gem that does much of the hard work for you.


The conclusion of this workshop was again to share Ruby tools that can help solve problems about your user and business. It's very important to have a plan and/or goal to strive for and to determine actionable data analysis and metrics to help reach those goals.

ActsAsTaggable acts quirky

I've been recently working on a project with the extensive use of ActsAsTaggable gem (version 2.4.1). I would like to share a couple of things that were not immediately evident and caused some confusion.

Issue #1
You should be consistent with how you assign tags. ActsAsTaggable is very versatile as it provides such features as tagging contexts and tagging ownership. Contexts allow to create named subsets of tags for the item. Ownerships allow for the item to have different tags for different owners. It is very easy to assign tag in the default "tags" context and later be wondering why the tag is not showing up in the custom context, or vice versa. The following method always assigns the tag within the "tags" context:
@photo.tag_list = "New York, USA"
So if you update the tag list in the custom context later:
@photo.set_tag_list_on("album_3", "New York, USA")
you will basically be creating the second tagging for the "New York" tag with the "album_3" context.
The third tagging is generated if the same photo is tagged by the "owner" @user.
@user.tag(@photo, :with => "New York", :on => "album_3")

Issue #2
Tag count methods include the common "tags" context no matter which "tagging_key" you specify.
tags = Photo.tag_counts_on("album_3") {|t| == "New York" }.first.count
=> 28470 
The above will return tags with the "album_3" and "tags" contexts.

Issue #3
"Tagged_with" method may return duplicates. It happens as a result of the query including the "tags" context by default. If multiple contexts were present you will have to manually exclude the duplicates from the list to get the correct counts.
f = Photo.tagged_with("New York", :on => "album_3").size
=> 28470 
f = Photo.tagged_with("New York", :on => "album_3").uniq.size
=> 27351 

Rails Tips & Tricks at RailsConf 2014

Rails: They start so young!

One of the talks I attended on Day two of RailsConf 2014 was Tricks that Rails didn't tell you about by Carlos Antonio. As the title suggests, Carlos covered a number of Rails items that are either not widely used or not well documented. I've taken the lazy approach and listed out all the topics he covered, but provided documentation or relevent links to all of the tricks in case you'd like to learn more.



  • Relation.merge, e.g. Product.joins(:reviews).merge(Review.approved), allows you to utilize scope from another model rather than passing in exact SQL conditional, i.e. limiting knowledge of Review to Product.
  • Utilize group counting to group results by a column and get the count. This also accepts multiple fields to group.
  • relation.first! and relation.last! are similar to first & last but raise exception RecordNotFound if there are no records.
  • where.not is a cool finder trick, e.g. scope :some_scope, -> { where.not status: 'draft' }
  • You can control eager or lazy loading in Rails via eager_load and preload. These are a little tricky to sum up in a small example, so hopefully those links provide more.
  • Instead of passing in SQL in finder methods or scopes, you can call with :desc, e.g. scope :some_scope, -> { order created_at: :desc }
  • pluck is a wonderful tool. e.g. Product.pluck(:name), but also Product.pluck(:name, :price) will retrieve product name and pricing information.
  • to_param allows you to override the default of using an object's id to a custom finder key.
  • The difference between exists?/any? and present? is that the former two will first look at the database and then loop through, while the latter will look at the database once and does not require a second look.
  • ActiveRecord's "baked-in" benchmark allows you to benchmark blocks for runtime analysis

Active Model

  • include ActiveModel::Model in a class to allow Rails form helpers to be used on instances of that class, as well as leverage ActiveModel validation methods.

Action Mailer

  • Rails mailers come with some built in mapping between i18n defaults and the mailer language, class, method, value. This is a really hard thing to Google for, so I suggest to read all the things.

Action View

  • content_tag_for builds HTML for a loop of objects.
  • render(collection) returns false if a collection is empty, so rather than clouding your view with an if / else block, you can do something like: render(collection) || content_tag(:p, 'No Article Found')
  • Learn more about local_assigns. I didn't quite understand this tip in the talk but I found a Stack Overflow qna on the subject.
  • truncate does what you'd expect, truncate text after a number of characters.
  • Rails locales have a *_html method automatically which allows you to pass HTML and does not require you specify raw(that content) when you use the content. See above about reading all the things in i18n.
  • benchmark in AV allows you to benchmark view rendering

Action Controller

  • You can redirect with params, e.g. redirect('/articles/%{id}').
  • config.exceptions_app allows you to define your own custom routes for application exceptions.


  • In the Rails console, app is an available variable which can make requests.
  • In the Rails console, the helper variable can be used to access Rails helpers available in your views.
  • Running rails console --sandbox will rollback data changes after the console is exited


  • rake notes will list comments marked TODO, FIXME, OPTIMIZE, and has the ability to generate custom annotations

Updating Rails

  • When upgrading Rails, rake rails:update reports on diffs in all configuration files and helps you work through conflicts

Spree Security Update 2.x.x Error, undefined method `assume_from_symbol' for Money:Class (ActionView::Template::Error)

On March 25, 2014 Spree Commerce posted an announcement that there was a security vulnerability for all Spree 2.x.x versions. As reported on Spree's website,

"The exploit would require the attacker to randomly guess valid order numbers, but once achieved, the technique would reveal private customer information associated with the order. Credit card details are never stored in Spree and were never at risk by this exploit."

So, this update is pretty typical and as usual, you should make sure nothing has broken with any changes that were made. I wanted to note a couple of "gotchas" with this most recent update however.

The first of which is probably obvious and really more of a reminder as sometimes people forget. Further, the instructions for this update in the Spree docs don't mention it, so, as a reminder - after updating be sure to run

$ bundle exec rake railties:install:migrations
$ bundle exec rake db:migrate

Now, here is the tricky one. After updating and installing/running your migrations you may find that you are getting an error:

undefined method `assume_from_symbol' for Money:Class (ActionView::Template::Error)

I have been unable to find anything on this in the Spree documentation but the problem is that in the update the money gem version 6.1.1 breaks the product display. The work around is to lock the money gem at 6.0.1. So, specify version 6.0.1 for the money gem in your Gemfile

gem 'money', '=6.0.1'

Updating Spree versions can sometimes be a little tense if things start breaking and you don't know why. I hope that if this error brought you here this article has saved you some time.

Dictionary Comprehensions in Python

Python has many features which usually stay unknown to many programmers.

List Comprehensions

List comprehensions are much simpler way of creating lists. This is one feature which is rather widely used and I saw this in many examples and source of many libraries.

Imagine you have a function which returns a list of data. A good example of this is xrange(start, end) function which returns all numbers within the range [start, end), so it excludes the end. This is a generator, so it doesn't return all numbers at once, but you need to call this function many times, and each time it returns the next number.

Getting all numbers from range [1, 10] using this function can be done like this:

numbers = []
for i in xrange(1, 10):

If you want to get only the even numbers, then you can write:

numbers = []
for i in xrange(1, 11):
    if i % 2 == 0:

List comprehensions can make the code much simpler.

The whole expression evalutes to a list, and the main syntax is:

[ expression for item in list if conditional ]

The first example can be then written as:

numbers = [i for i in xrange(1, 11)]

and the second:

numbers = [i for i in xrange(1, 11) if i % 2 == 0]

Of course this syntax can be a little bit strange at the very first moment, but you can get used to it, and then the code can be much easier.

Removing Duplicates

Another common usage of collections is to remove duplicates. And again there are plenty of ways to do it.

Consider a collection like this:

numbers = [i for i in xrange(1,11)] + [i for i in xrange(1,6)]

The most complicated way of removing duplicates I've ever seen was:

unique_numbers = []
for n in numbers:
    if n not in unique_numbers:

Of course it works, but there is another much easier way, you can use a standard type like set. Set cannot have duplicates, so when converting a list to set, all duplicates are removed. However at the end there will be set, not list, if you want to have list, then you should convert it again:

unique_numbers = list(set(numbers))

Removing Object Duplicates

With objects, or dictionaries, the situation is a little bit different. You can have a list of dictionaries, where you use just one field for identity, this can look like:

data = [
  {'id': 10, 'data': '...'},
  {'id': 11, 'data': '...'},
  {'id': 12, 'data': '...'},
  {'id': 10, 'data': '...'},
  {'id': 11, 'data': '...'},

Removing duplicates, again, can be done using more or less code. Less is better, of course. With more code it can be:

unique_data = []
for d in data:
    data_exists = False
    for ud in unique_data:
        if ud['id'] == d['id']:
          data_exists = True
    if not data_exists:

And this can be done using a thing I discoverd a couple of days ago, this is dictionary comprehension. It has a similar syntax as list comprehension, however evaluates to dicionary:

{ key:value for item in list if conditional }

This can be used to make a list without all duplicates using a custom field:

{ d['id']:d for d in data }.values()

The above code creates a dictionary with key, which is the field I want to use for uniqueness, and the whole dictionary as value. The dictionary then contains only one entry for each key. The values() function is used to get only values, as I don't need the key:value mappings any more.

RailsConf 2014: Highlights from Day One

The Best Conference Sidekick

I'm here in Chicago for RailsConf 2014, my fifth RailsConf! This time I'm attending the conference with my sidekick, and of course my sidekick's babysitter, so I'm having an experience a bit different than previous attendance.

The Bees Knees

One of the first talks I attended today was Saving the World (literally) with Ruby on Rails by Sean Marcia.

Sean works with a Professor at George Mason University who researches bees and dying colonies. Armed with Raspberry Pis powered by solar panels, Sean uses the following technologies to monitor beehive metrics:

  • gpio, pi_piper, wiringpi-ruby
  • dashing
  • Ruby & Sinatra. Sean had previously used Rails but found it a bit too heavyweight and went with a lighter Sinatra app.
  • 3 Cronjobs to record data
  • passenger, isc-dhcp-server, hostapd, iw for server configuration

Hive temperature, hive weight, outside temperature, and humidity are monitored in hopes of collecting metrics for identifying collapsing hives. The project is quite young, but the hope is to collect more metrics (e.g. gas permeability) and more actionable data as time goes on.

Views as Objects

Another interesting and relevant talk I attended today was Where did the OO go? Views should be objects too! by Andrew Warner. Andrew believes the Rails Holy Grail is an app that provides the best user experience and the best development experience. What is the best user experience? One with fast page loads or rather, one that does not require full page loads. What is the best dev experience? One that is developer friendly, DRY, allows for a client with logic (what Andrew called a thick client), mostly one language, and one that has good SEO.

In the node.js space, rendrjs is a popular tool that addresses all the main points, except it's not in Rails. Andrew then went on to discuss the current approaches in Rails:

  • duplicate code that lives on client & server (e.g. a mustache template and an erb template). This approach doesn't follow DRY and is not written utilizing one language.
  • turbolinks: covers everything except thick client
  • the ember / angular / backbone approach, where the entire app lives on client. This approach is not SEO friendly and not in one language.

Each of these above options has a key tradeoff when measured against the main criteria. The lowest common denominator between the client and server (i.e. the dumbest possible templates) would be mustache, which happens to be available in Ruby and JavaScript. Mustache has tags, booleans, and loops, so it has the most simple "logic" that one might need in a view.

Andrew created a gem called perspectives, which is a Ruby class that returns JSON or HTML based on the type of request based on a single mustache template. The Ruby class accepts objects and arguments and outputs and renders an HTML request for initial requests and JSON for subsequent requests. Perspectives also offers a nice separation of concerns, because the view is quite dumb it can't contain advanced logic other than basic conditionals and loops, so this logic is contained in the perspective object. The output of perspectives is easy to test, and it leverages Rails Russian doll caching strategy.

Andrew didn't focus too much on his gem, but the story that he told regarding the problem and various approaches was accurate and relevant to experiences I've had. I'd love to see how I can leverage the tool that he's written.


Another very technical talk I attended today was Advanced aRel: When ActiveRecord Just Isn't Enough by Cameron Dutro. This is one of those talks that ended up being something different than I expected, but the technical points were interesting and in depth. aRel is not the same thing as ActiveRecord. Here are several points that Cameron touched on during the talk:

  • aRel is relational algebra that allows you to build SQL queries, applies query optimizations, and enables chaining.
  • aRel knows nothing about your models and database and does not retrieve or store data (which is what ActiveRecord does)
  • aRel generates ASTs, which are trees that represent queries
  • aRel has many methods for selects, supports methods like .minimum, .maximum, and .sum, but also supports other arbitrary methods supported by the database such as length.
  • aRel supports subqueries.
  • aRel supports where methods like eq, not_eq, gt, lt, gteq, lteq
  • aRel supports outer joins, has and belongs to many relationships.
  • aRel supports fuzzy matching (e.g. matches('%test%'))
  • The use of aRel is chainable and does not use strings.
  • aRel has a query builder that is essentially an module that has encapsulated finder methods which provides cleaner and more reusable methods

Cameron created scuttle-rb, which is a Ruby library for converting raw SQL into ActiveRecord/aRel queries, and he runs this library on a server that is publicly accessible. There were many great code examples but it was hard to catch a lot of them, so I'll provide a link to the talk when it's available.

Stay tuned for more blog posts on the remaining three days of the conference!

Chrome, onmousemove, and MediaWiki JavaScript

Image by Flickr user Dennis Jarvis

tl;dr: avoid using onmousemove events with Google Chrome.

I recently fielded a complaint about not being able to select text with the mouse on a wiki running the MediaWiki software. After some troubleshooting and research, I narrowed the problem down to a bug in the Chrome browser regarding the onmousemove event. The solution in this case was to tweak JavaScript to use onmouseover instead of onmousemove.

The first step in troubleshooting is to duplicate the problem. In this case, the page worked fine for me in Firefox, so I tried using the same browser as the reporter: Chrome. Sure enough, I could no longer hold down the mouse button and select text on the page. Now that the browser was implicated, it was time to see what it was about this page that caused the problem.

It seemed fairly unlikely that something like this would go unfixed if it was happening on the flagship MediaWiki site, Wikipedia. Sure enough, that site worked fine, I could select the text with no problem. Testing some other random sites showed no problems either. Some googling indicated others had similar problems with Chrome, and gave a bunch of workarounds for selecting the text. However, I wanted a fix, not a workaround.

There were hints that JavaScript was involved, so I disabled JavaScript in Chrome, reloaded the page, and suddenly everything started working again. Call that big clue number two. The next step was to see what was different between the local MediaWiki installation and Wikipedia. The local site was a few versions behind, but I was fortuitously testing an upgrade on a test server. This showed the problem still existed on the newer version, which meant that the problem was something specific to the wiki itself.

The most likely culprit was one of the many installed MediaWiki extensions, which are small pieces of code that perform certain actions on a wiki. These often have their own JavaScript that they run, which was still the most likely problem.

Then it was some basic troubleshooting. After turning JavaScript back on, I edited the LocalSettings.php file and commented out all the user-supplied extensions. This made the problem disappear again. Then I commented out half the extensions, then half again, etc., until I was able to narrow the problem down to a single extension.

The extension in question, known simply as "balloons", has actually been removed from the MediaWiki extensions site, for "prolonged security issues with the code." The extension allows creation of very nice looking pop up CSS "balloons" full of text. I'm guessing the concern is because the arguments for the balloons were not sanitized properly. In a public wiki, this would be a problem, but this was for a private intranet, so we were not worried about continuing to use the extension. As a side note, such changes would be caught anyway as this wiki sends an email to a few people on any change, including a full text diff of all the changes.

Looking inside the JavaScript used by the extension, I was able to narrow the problem down to a single line inside balloons/js/balloons.js:

  // Track the cursor every time the mouse moves
  document.onmousemove = this.setActiveCoordinates;

Sure enough, duck-duck-going through the Internet quickly found a fairly incriminating Chromium bug, indicating that onmousemove did not work very well at all. Looking over the balloon extension code, it appeared that onmouseover would probably be good enough to gather the same information and allow the extension to work while not blocking the ability for people to select text. One small replacement of "move" to "over", and everything was back to working as it should!

So in summary, if you cannot select text with the mouse in Chrome (or you see any other odd mouse-related behaviors), suspect an onmousemove issue.

Piggybak: Upgrade to Rails 4.1.0

Piggybak and gems available in the demo (piggybak_variants, piggybak_giftcerts, piggybak_coupons, piggybak_bundle_discounts, piggybak_taxonomy) have been updated to Rails 4.1.0, Ruby 2.1.1 via Piggybak version gem 0.7.1. Interested in the technical details of the upgrade? Here are some fine points:

  • Dependencies were refactored so that the parent Rails app controls the Rails dependency only. There was a bit of redundancy in the various plugin gemspec dependencies. This has been cleaned up so the parent Rails app shall be the canonical reference to the Rails version used in the application.
  • Modified use of assets which require "//= require piggybak/piggybak-application" to be added to the assets moving forward. There have been several observed issues with precompling and asset behavior, so I simplified this by requiring this require to be added to the main Rails application.js for now. The engine file is supposed to have a way around this, but it has not behaved as expected, specifically on unique deployment architectures (e.g. Heroku). Patches welcome to address this.
  • Tables migrated to namespaced tables, e.g. "orders" migrated to "piggybak_orders". This is how namespaced engine tables are supposed to look, and this upgrade fixes the table names with a migration and related code.
  • Handled strong parameters. This was one of the most significant jumps from Rails 3 to Rails 4. The main element of Piggybak that needed updating here was the orders controller, which receives the order parameters and must determine which parameters to handle. Any references to attr_accessible in the code were removed.
  • ActiveRecord "find" method replaced with where & chaining, where applicable. The jump to Rails 4.0 deprecated find methods, but did not remove support, but the jump to Rails 4.1 removed support of these finder methods. These were removed.
  • Scope syntax update. Rails 4 handles scopes with new syntax, and all default scope and named scopes were updated to reflect this new syntax.
  • Validates syntax updated. Rails 4 has new validates syntax which accepts arguments, e.g. presence: true, uniqueness: true. Piggybak was upgraded to use the new syntax, although the old syntax is still supported.
  • Significant routes update. Rails 4 introduced a significant change in routing, and Piggybak was updated to reflect these changes.

The full commits of Piggybak are available for browsing here and here.


There are a few things that I'd love to see adopted in Piggybak, with the help of the community. These include:

  • Consider move to CoffeeScript. I'm still on the fence about this, but I'm seeing more projects with node and CoffeeScript lately, so I wonder if it would be worth the overhead to move to CoffeeScript.
  • Add test coverage. Perhaps Travis CI integration would make sense since it hooks into github nicely?
  • Build out more features. Things like reviews & ratings, saved cart, wishlist support, and saved address support have been on the feature list for a while. It'd be nice to see movement here.

Firefox, Input (type=button), and Line-Height

I recently discovered a discrepancy in the way Firefox treats inputs with a line-height style defined and how other browsers handle the same styling rule. Specifically, Firefox completely ignores it.

This behavior seemed odd enough to me that I did some Googling to determine if this was recently introduced, a long standing issue, or something I was just doing wrong. I found some interesting discussions on the issue. Several of the search results used the word “bug” in the title though it appears to be more of a deliberate (though possibly outdated) “feature” instead. Along with the discussions, I also came across a couple of suggestions for a solution.

First of all, I was able to locate a simple enough explanation of what’s causing the behavior. As Rob Glazebrook explains:

Basically, Firefox is setting the line-height to “normal” on buttons and is enforcing this decision with an !important declaration.” and, “browser-defined !important rules cannot be over-ruled by author-defined !important rules. This rule cannot be overruled by a CSS file, an inline style — anything.

Good news is I can stop experimenting hoping for different results.

I also located a Bugzilla ticket opened in 2011 which contains some discussion on the pros and cons of allowing designers to control the line-height of input elements. The last few comments suggest that Firefox 30 may remove the !important declaration which would open up access to the styling property. At the time of this writing, Firefox version 30 appears to be in alpha.

Due to this long-standing stance by Mozilla, Twitter Bootstrap makes the recommendation to avoid using inputs with type set to button, submit, or reset. Instead, they recommend using button tags paired with the types already mentioned. Button tags are much more flexible in what styling rules are allowed to be applied and are therefore easier to get similar rendering results from the widest range of browsers.

If switching to button tags is not an option for whatever reason, another possible solution is to adjust the padding values for your input buttons. By shrinking the top padding, you can more easily fit text that needs to wrap due to a limited available width. Adjusting the top padding can better center the text on the button by preventing the first line of the text from rendering vertically dead center of the button.

SPF, DKIM and DMARC brief explanation and best practices

Spam mail messages have been a plague since the Internet became popular and they kept growing more and more as the number of devices and people connected grew. Despite the numerous attempts of creation of anti-spam tools, there's still a fairly high number of unwanted messages sent every day.

Luckily it seems that lately something is changing with the adoption of three (relatively) new tools which are starting to be widely used: SPF, DKIM and DMARC. Let's have a quick look at each of these tools and what they achieve.

What are SPF, DKIM and DMARC

SPF (Sender Policy Framework) is a DNS text entry which shows a list of servers that should be considered allowed to send mail for a specific domain. Incidentally the fact that SPF is a DNS entry can also considered a way to enforce the fact that the list is authoritative for the domain, since the owners/administrators are the only people allowed to add/change that main domain zone.

DKIM (DomainKeys Identified Mail) should be instead considered a method to verify that the messages' content are trustworthy, meaning that they weren't changed from the moment the message left the initial mail server. This additional layer of trustability is achieved by an implementation of the standard public/private key signing process. Once again the owners of the domain add a DNS entry with the public DKIM key which will be used by receivers to verify that the message DKIM signature is correct, while on the sender side the server will sign the entitled mail messages with the corresponding private key.

DMARC (Domain-based Message Authentication, Reporting and Conformance) empowers SPF and DKIM by stating a clear policy which should be used about both the aforementioned tools and allows to set an address which can be used to send reports about the mail messages statistics gathered by receivers against the specific domain [1].

How do they work?

All these tools relies heavily on DNS and luckily their functioning process, after all the setup phase is finished, is simple enough to be (roughly) explained below:


  • upon receipt the HELO message and the sender address are fetched by the receiving mail server
  • the receiving mail server runs an TXT DNS query against the claimed domain SPF entry
  • the SPF entry data is then used to verify the sender server
  • in case the check fails a rejection message is given to the sender server

Source [*]


  • when sending an outgoing message, the last server within the domain infrastructure checks against its internal settings if the domain used in the "From:" header is included in its "signing table". If not the process stops here
  • a new header, called "DKIM-Signature", is added to the mail message by using the private part of the key on the message content
  • from here on the message *main* content cannot be modified otherwise the DKIM header won't match anymore
  • upon reception the receiving server will make a TXT DNS query to retrieve the key used in the DKIM-Signature field
  • the DKIM header check result can be then used when deciding if a message is fraudulent or trustworthy

Source [*]


  • upon reception the receiving mail server checks if there is any existing DMARC policy published in the domain used by the SPF and/or DKIM checks
  • if *one or both* the SPF and DKIM checks succeed while still being *aligned* with the policy set by DMARC, then the check is considered successful, otherwise it's set as failed
  • if the check fails, based on the action published by the DMARC policy, different actions are taken

Source [*]

The bad news: limits and best practices

Unfortunately even by having a perfectly functional mail system with all the above tools enforced you won't be 100% safe from the bad guys out there. Not all servers are using all three tools shown above. It's enough to take a look at the table shown in Wikipedia [2] to see how that's possible.

Furthermore there are some limits that you should always consider when dealing with SPF, DKIM and DMARC:

  • as already said above DKIM alone doesn't grant in any way that the sender server is allowed to send outgoing mail for the specific domain
  • SPF is powerless with messages forged in shared hosting scenario as all the mail will appear as the same coming IP
  • DMARC is still in its early age and unfortunately not used as much as hoped to make a huge difference
  • DMARC can (and will) break your mail flow if you don't set up both SPF and DKIM before changing DMARC policy to anything above "none".

Please work through the proper process carefully, otherwise your precious messages won't be delivered to your users as potentially seen as fraudulent by a wrong SPF, DKIM or DMARC setup.

What's the message behind all this? Should I use these tools or not?

The short answer is: "Yes". The longer answer is that everybody should and eventually will in future, but we're just not there yet. So even if all these tools already have a lot of power, they're not still shining as bright as they should because of poor adoption.

Hopefully things will change soon and that starts by every one of us adopting these tools as soon as possible.

[1] The lack of such a monitoring tool is considered one of the reasons why other tools (such as ADSP) in past have failed during the adoption phase.
[2] Comparison of mail servers on Wikipedia

jQuery Content Replacement with AJAX

This is not a huge breakthrough, but it colored in some gaps in my knowledge so I thought I would share. Let's say you have a product flypage for a widget that comes in several colors. Other than some of the descriptive text, and maybe a hidden field for use in ordering one color instead of another, all the pages look the same. So your page looks like this (extremely simplified):
... a lot of boilerplate ...
... a lot more boilerplate ...
Probably the page is generated into a template based on a parameter or path segment:
What we're going to add is a quick-and-dirty way of having your page rewrite itself on the fly with just the bits that change when you select a different version (or variant) of the same product. E.g.,

The old-school approach was something like:
   document.location.href = my_url + $(this).val();
I.e., we'll just send the browser to re-display the page, but with the selected SKU in the URL instead of where we are now. Slow, clunky, and boring! Instead, let's take advantage of the ability to grab the page from the server and only freshen the parts that change for the desired SKU (warning: this is a bit hand-wavy, as your specifics will change up the code below quite a bit):
// This is subtly wrong:
    async: false,
    url: my_url + $(this).val(),
    complete: function(data){
      $('form#order_item').html( $(data.responseText).find('form#order_item').html() );
Why wrong? Well, any event handlers you may have installed (such as the .change() on our selector!) will fail to fire after the content is replaced, because the contents of the form don't have those handlers. You could set them up all over again, but there's a better way:
// This is better:
$('form#order_item').on('change', 'select[name=sku]',
    async: false,
    url: my_url + $(this).val(),
    complete: function(data){
      var doc = $(data.responseText);
      var $form = $('form#order_item');
      var $clone = $form.clone( true );
Using an "on" handler for the whole form, with a filter of just the select element we care about, works better – because when we clone the form, we copy its handler(s), too. There's room for improvement in this solution, because we're still fetching the entire product display page, even the bits that we're going to ignore, so we should look at changing the .ajax() call to reference something else – maybe a custom version of the page that only generates the form and leaves out all the boilerplate. This solution also leaves the browser's address showing the original product, not the one we selected, so a page refresh will be confusing. There are fixes for both of these, but that's for another day.

Speeding Up Saving Millions of ORM Objects in PostgreSQL

The Problem

Sometimes you need to generate sample data, like random data for tests. Sometimes you need to generate it with huge amount of code you have in your ORM mappings, just because an architect decided that all the logic needs to be stored in the ORM, and the database should be just a dummy data container. The real reason is not important - the problem is: let’s generate lots of, millions of rows, for a sample table from ORM mappings.

Sometimes the data is read from a file, but due to business logic kept in ORM, you need to load the data from file to ORM and then save the millions of ORM objects to database.

This can be done in many different ways, but here I will concentrate on making that as fast as possible.

I will use PostgreSQL and SQLAlchemy (with psycopg2) for ORM, so all the code will be implemented in Python. I will create a couple of functions, each implementing another solution for saving the data to the database, and I will test them using 10k and 100k of generated ORM objects.

Sample Table

The table I used is quite simple, just a simplified blog post:

  title TEXT NOT NULL,
  payload TEXT NOT NULL

SQLAlchemy Mapping

I'm using SQLAlchemy for ORM, so I need a mapping, I will use this simple one:
class BlogPost(Base):
    __tablename__ = "posts"

    id = Column(Integer, primary_key=True)
    title = Column(Text)
    body = Column(Text)
    payload = Column(Text)

The payload field is just to make the object bigger, to simulate real life where objects can be much more complicated, and thus slower to save to the database.

Generating Random Object

The main idea for this test is to have a randomly generated object, however what I really check is the database speed, and the whole randomness is used at the client side, so having a randomly generated object doesn’t really matter at this moment. The overhead of a fully random function is the same regardless of the method of saving the data to the database. So instead of randomly generating the object, I will use a static one, with static data, and I will use the function below:

TITLE   = "title"      * 1764
BODY    = "body"       * 1764
PAYLOAD = "dummy data" * 1764

def generate_random_post():
    "Generates a kind of random blog post"
    return BlogPost(title=TITLE, body=BODY, payload=PAYLOAD)

Solution Ideas

Generally there are two main ideas for such a bulk inserting of multiple ORM objects:

  • Insert them one-by-one with autocommit
  • Insert them one-by-one in one transaction

Save One By One

This is the simplest way. Usually we don’t save just one object, but instead we save many different objects in one transaction, and making a couple of related changes in multiple transactions is a great way leading to a database with bad data.

For generating millions of unrelated objects this shouldn’t cause data inconsistency, but this is highly inefficient. I’ve seen this multiple times in code: create an object, save it to the database, commit, create another object and so on. It works, but is quite slow. Sometimes it is fast enough, but for the cost of making a very simple change in this algorithm we can make it 10 times faster.

I’ve implemented this algorithm in the function below:

def save_objects_one_by_one(count=MAX_COUNT):
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()

Save All in One Transaction

This solution is as simple as: create objects, save them to the database, commit the transaction at the end, so do everything in one huge transaction.

The implementation differs only by four spaces from the previous one, just run commit() once, after adding all objects:

def save_objects_one_transaction(count=MAX_COUNT):
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()

Time difference

I ran the tests multiple times, truncating the table each time. The average results of saving 10k objects were quite predictable:

  • Multiple transactions - 268 seconds
  • One transaction - 25 seconds

The difference is not surprising, the whole table size is 4.8MB, but after each transaction the database needs to write the changes on disk, which slows the procedure a lot.


So far, I’ve described the most common methods of generating and storing many ORM objects. I was wondering about another, which may seem surprising a little bit at the beginning.

PostgreSQL has a great COPY command which can copy data between a table and a file. The file format is simple: one table row per one file row, fields delimited with a defined delimiter etc. It can be a normal csv or tsv file.

My crazy idea was: how about using the COPY for loading all the generated ORM objects? To do that, I need to serialize them to a text representation, to create a text file with all of them. So I created a simple function, which does that. This function is made outside the BlogPost class, so I don't need to change the data model.

def serialize_post_to_out_stream(post, out):
    import csv
    writer = csv.writer(out, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
    writer.writerow([post.title, post.body, post.payload])

The function above gets two parameters:

  • post - the object to be serialized
  • out - the output stream where the row with the post object will be saved, in Python it is a file-like object, so an object with all the functions a file object has

Here I use a standard csv module, which supports reading and writing csv files. I really don’t want to write my own function for escaping all the possible forms of data I could have - this usually leads to many tricky bugs.

The only thing left is to use the COPY command. I don’t want to create a file with data and load that later; the generated data can be really huge, and creating temporary files can just slow things down. I want to keep the whole procedure in Python, and use pipes for data loading.

I will use the psql program for accessing the PostgreSQL database. Psql has a different command called \COPY, which can read the csv file from psql's standard input. This can be done using e.g.: cat file.csv | psql database.

To use it in Python, I’m going to use the subprocess module, and create a psql process with stdin=subprocess.PIPE which will give me write access to the pipe psql reads from. The function I’ve implemented is:

def save_objects_using_copy(count=MAX_COUNT):
    import subprocess
    p = subprocess.Popen([
        'psql', 'pgtest', '-U', 'pgtest',
        '-c', '\COPY posts(title, body, payload) FROM STDIN',
        ], stdin=subprocess.PIPE
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        serialize_post_to_out_stream(post, p.stdin)


I’ve also tested that on the same database table, truncating the table before running it. After that I’ve also checked this function, and the previous one (with one transaction) on a bigger sample - 100k of BlogPost objects.

The results are:

Sample size Multiple Transactions One Transaction COPY
10k 268 s 25 s 5 s
100k 262 s 51 s

I haven’t tested the multiple transactions version for 100k sample, as I just didn’t want to wait multiple hours for finishing that (as I run each of the tests multiple times to get more reliable results).

As you can see, the COPY version is the fastest, even 5 times faster than the full ORM version with one huge transaction. This version is also memory friendly, as no matter how many objects you want to generate, it always needs to store one ORM object in memory, and you can destroy it after saving.

The Drawbacks

Of course using psql poses a couple of problems:

  • you need to have psql available; sometimes that’s not an option
  • calling psql creates another connection to the database; sometimes that could be a problem
  • you need to set up a password in ~/.psql file; you cannot provide it in the command line

You could also get the pcycopg2 cursor directly from the SQLAlchemy connection, and then use the copy_from() function, but this method needs to have all the data already prepared in memory, as it reads from a file-like object, e.g. StringIO. This is not a good solution for inserting millions of objects, as they can be quite huge - streaming is much better in this case.

Another solution to this is to write a generator, which is a file like object, and the copy_from() method can read from it directly. This function calls the file's read() method trying to read 8192 bytes per call. This can be a good idea when you don't have access to the psql, however due to the overhead for generating the 8192 bytes strings, it should be slowever than the psql version.

Sanity, thy name is not MySQL

Probably old news, but I hit this MySQL oddity today after a long day dealing with unrelated crazy stuff and it just made me go cross-eyed:
CREATE TABLE foo (id integer, val enum('','1'));
INSERT INTO foo VALUES (2, '1');
SELECT * FROM foo WHERE val = 1;
What row do you get? I'll wait while you second- and third-guess yourself. It turns out that the "enum" datatype in MySQL just translates to a set of unique integer values. In our case, that means:
  • '' == 1
  • '1' == 2
So you get the row with (1,''). Now, if that doesn't confuse readers of your code, I don't know what will.

Filling Gaps in Cumulative Sum in Postgres

I found an interesting problem. There was a table with some data, among which there was a date and an integer value. The problem was to get cumulative sum for all the dates, however including dates for which we don't have entries. In case of such dates we should use the last calculated sum.

Example Data

I will use an example table:


with sample data:

# INSERT INTO test(d,v)
  VALUES('2014-02-01', 10),
        ('2014-02-02', 30),
        ('2014-02-05', 10),
        ('2014-02-10', 3);

Then the data in the table looks like:

# SELECT * FROM test;
     d      |  v
 2014-02-01 | 10
 2014-02-02 | 30
 2014-02-05 | 10
 2014-02-10 |  3
(4 rows)

What I want is to have a cumulative sum for each day. Cumulative sum is a sum for all the earlier numbers, so for the above data I want to get:

     d      |  v
 2014-02-01 | 10
 2014-02-02 | 40
 2014-02-05 | 50
 2014-02-10 | 53
(4 rows)

The simple query for getting the data set like shown above is:

FROM test

Filling The Gaps

The query calculates the cumulative sum for each row. Unfortunately this way there are gaps between dates, and the request was to fill those in using the values from previous days.

What I want to get is:

     d      |  v
 2014-02-01 | 10
 2014-02-02 | 40
 2014-02-03 | 40
 2014-02-04 | 40
 2014-02-05 | 50
 2014-02-06 | 50
 2014-02-07 | 50
 2014-02-08 | 50
 2014-02-09 | 50
 2014-02-10 | 53

My first idea was to use the generate_series() function, which can generate a series of data. What I need is a series of all dates between min and max dates. This can be done using:

# SELECT generate_series(
    '1 day')::date;

The generate_series() function arguments are (begin, end, interval). The function returns all timestamps from beginning to end with given interval. The return value is timestamp, so I had to cast it to date with '::date', which is a nice PostgreSQL shortcut for the standard syntax, CAST(generate_series(...) AS DATE).

I also want to use the first query to use the cumulative sum I calculated before. It can be simply achieved using the great WITH command which creates something like a temporary table, which can be queried:

# WITH temp AS 
  SELECT generate_series(1, 1000) d
FROM temp
WHERE d < 4


Combining all the above queries resulted in the below one:

  FROM test
  (SELECT v 
   FROM y 
   WHERE y.d <= g.d
   LIMIT 1)
  (SELECT generate_series(min(d), max(d), '1 day')::date d 
   FROM y) g

After the earlier explanations understanding this one should be easy.

  • I placed the original query calculating the cumulative sum in the WITH block.
  • SELECT creates a row set with two columns
    • The first column is date returns from subselect, just before ORDER BY. There are returned all dates between min and max date from the original data.
    • The second column is a subquery getting calculated cumulative sum. It gets the sum for current date (from the first column), or the previous calculated.
  • And of course we need ordering at the end. The database can reorder the data as it wants during executing the query, so we always need to declare the ordering at the end. Otherwise strange things can happen (like having the same ordering of rows for years, and suddenly a totally different one, just because someone added new row, deleted some other, or just restarted application).

Open Source: Challenges of Modularity and Extensibility

While I was at the I Annotate 2014 conference last week, I spoke with a couple developers about the challenges of working in open source. Specifically, the Annotator JavaScript library that we are using for H2O is getting a bit of a push from the community to decouple (or make more modular) some components as well as improve the extensibility. Similarly, Spree, an open source Ruby on Rails platform that End Point has sponsored in the past and continued to work with, made a shift from a monolithic platform to a modular (via Ruby gems) approach, and Piggybak started out as a modular and extensible ecommerce solution. I like doodling, so here's a diagram that represents the ecosystem of building out an open source tool with a supplemental explanation below:

Here are some questions I consider on these topics:

  • What is the cost of extensibility?
    • code complexity
    • code maintenance (indirectly, as code complexity increases)
    • harder learning curve for new developers (indirectly, as code complexity increases)
    • performance implications (possibly, indirectly, as code complexity increases)
    • difficulty in testing code (possibly)
  • What is the cost of modularity?
    • same as cost of extensibility
    • challenge of determining what features to include in core (or exclude from core)
    • can be both performance implications and performance mitigation
  • What are the values of extensibility?
    • robustness of tool
    • increased community involvement (indirectly, as robustness increases)
    • further reach, increased use cases (indirectly, as robustness increases)
  • What are the values of modularity
    • same as values of extensibility

From a cost-benefit perspective, the goal here should allow the values of extensibility and modularity to outweigh the disadvantages to allow for a flourishing, growing community of developers and users of the tool. Extensibility and modularity are not always easy to figure out, especially in some frameworks, but I think getting these two elements correct are very important factors in the success of an open source project. I also don't think many tools "get it right" the first time around, so there's always a chance to improve and refactor as the user base builds.

I Annotate 2014 Day 2: If it were easy, we would have already solved it

H2O & Annotator

Yesterday I gave my talk on Integrating Annotator with H2O, which covered the specific implementation details of integrating the open source JavaScript based tool Annotator into H2O, including a history of annotation and highlights of some of the challenges. I'll update the link here to my slides when they are available on the I Annotate conference website.


Version Control

One of the interesting recurring topics of the conference was the concept of version control, version control of text and other multi-media content, and how to present a user interface for version control that makes sense to non-developers (or those not familiar with code based version control). A simple example of what I mean by the problem of version control on the web is described in the following Facebook scenario:

  • User A updates status on Facebook
  • User B comments on User A's status
  • User C comments on User A's status, with reference or comment to User B's status
  • User B edits original comment
  • User C's comment no longer is applicable given the context, and doesn't make sense to users who have not seen User B's original comment

Facebook doesn't do anything about this scenario now, other than allow the ability to delete or edit comments. They've only recently introduced the ability to edit comments, so while they are aware of this problem, I don't expect them to build out a complex solution to address this challenge. But if they were to address it, can you imagine both the technical implementation and [intuitive] user interface implementation that would be easily adopted by the masses? If it were easy, it would have already been solved and we wouldn't be talking about it now!

Apply this Facebook use case to content both off and on the web. In the context of this conference, this is:

  • ebooks, PDFs, file types exportable to offline use
  • images, video, audio: all mentioned during this conference
  • all of the text on the internet

While the above types of content may change at various levels of frequency (e.g. text on the web tends to be more easily and frequently changed than video and audio productions), recording and presenting annotations tied to one piece of content in one state (or version) is very challenging. In text, Annotator ties annotations to a specific Range of content, so if any of the markup changes, the annotation range may no longer be accurate. has implemented an approach to mitigate this problem (I'm hesitant to describe it as a "solution") with fuzzy matching, and work is being done to include this work into Annotator. I'm excited to where this goes because I think for this concept of annotation and social discussion around dynamic content [on the web] to work, version control is something that has to be elegantly handled and intuitive in use.

I Annotate 2014 Conference: Day 1

I'm here in San Francisco for the second annual I Annotate conference. Today I'm presenting my work on the H2O project, but in this post I'll share a couple of focus points for the conference thus far, described below.

What do we mean by Annotations?

Annotation work is off the path of End Point's ecommerce focus, and annotations means different things for different users, so to give a bit of context: To me, an annotation is markup tied to single target content (image, text, video). There are other interpretations of annotations, such as highlighted text with no markup (ie flagging some target content), and cases where annotations are tied to multiple pieces of target contents.

Annotations in General

One of the focuses of yesterday's talks was the topic of how to allow for the powerful concept of annotations to succeed on the web. Ivan Herman of the W3C touched on why the web has succeeded, and what we can learn from that success to help the idea of annotations. The web has been a great idea, interoperable, decentralized, and open source and we hope that those concepts can translate to web annotations to help them be successful. Another interesting topic Tom Lehman of RapGenius touched on was how the actual implementation of annotation doesn't matter, but rather it's the community in place to encourage many high quality annotations. For RapGenius, that means offering a user hierarchy that awards users accessibility such as as moderators, editors, contributors, layering on a point-based ranking system, and including encouraging posting RapGenius annotated content in other sites. This talk struck a chord with me, because I know how hard it is to get high quality content for a website.

Specific Use Cases of Annotator

Yesterday's talks also covered several interesting use cases of Annotator, an open source JavaScript-based tool that aims to be the reference platform for web annotations that has been commonly adopted in this space, which is what we are using in H2O. Many of the people attending the conference are using Annotator and interested in its future and capabilities. Some highlights of implementation were:

  • RapGenius: Aforementioned, contains active community of annotating lyrics.
  • SocialBook: A classroom and/or "book club" application for interactive discussions and annotations of books.
  • FinancialTimes: From my understanding, annotations are the guide to how content is aggregated and presented in various facets of the BBC website.
  • collaborative web-based annotation tools under development at MIT which has similarities to H2O.
  • AustESE: Work being done in Australia for scholarly editing, includes a Drupal plugin implemented using Annotator with several plugins layered on top, including image annotation, categorization, threaded discussions.
  • uses tool built on top of annotator, featuring several advanced features such as image annotation, bookmarklet annotation implementation, and real time stream updates with search.

After the morning talks, we broke into two longer small group sessions, and I joined the sessions to delve into deeper issues and implementation details of Annotator, as well as the challenges and needs associated with annotation the law. I'll share my presentation and more notes from today's talks. Stay tuned!