End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

DevCamps: Creating new camps from a non-default Git branch

I recently set up part of a new Rails project DevCamps installation with a unique Git repo setup and discovered a trick for creating camps from a Git branch other than master. Admittedly, the circumstances that led to me discovering this trick are a bit specific to this project, but the trick itself can be useful in other situations as well.

The Git repo specified in local-config had a master branch with nothing in it but the standard "initial commit." This relatively new project uses a simplifed git-flow workflow and as such, all its code was still in the "develop" branch.

In my case, this empty-ish master branch meant there were no tracked files in __CAMP_PATH__/public directory. This meant that Git did not create that directory when the repo is cloned by `mkcamp`. This meant that apache2 would refuse to start. Camping without a web server makes my back hurt, so I snooped around a little bit...

I discovered two things:

  1. You can tell `git clone` which branch to checkout initially by passing it a '--branch $your_non_default_branch' switch
  2. The `mkcamp` command will happily pass that switch (as well as any other spicy options you include) along to the `git clone` system command it executes. To do that, just add it to your camp type's local-config file as part of the 'repo_path_git' config variable. For example:

    repo_path_git:git@github.com:somegituser/somegitrepo.git --branch develop

Note that this option means your fresh new camp won't have a 'master' branch checked out. This might confuse some users, but we all know the 'master' branch is nothing but a tracking branch with some convention mixed in. A simple `git checkout master` will create that expected master branch easily enough. It's probably worth giving your devs a heads up about this, lest they think something wonky is afoot with mkcamp.

Now, there are people out there that may try to find fault with my solution. These detractors, these misanthropes, these malingering sluggards might cry "Why don't you just commit an empty __CAMP_PATH__/public/.gitkeep" to your master branch?" Well, I like a clean git history. So, to those people I would say, "David, that's messy and silly and wouldn't make a very good blog article at all. I'm embarrassed for you for even bringing it up, David."



Automatically kill process using too much memory on Linux

Sometimes on Linux (and other Unix variants) a process will consume way too much memory. This is more likely if you have a fair amount swap space configured -- but within the range of normal, for example, as much swap as you have RAM.

There are various methods to try to limit trouble from such situations. You can use the shell's ulimit setting to put a hard cap on the amount of RAM allowed to the process. You can adjust settings in /etc/security/limits.conf on both Red Hat- and Debian-based distros. You can wait for the OOM (out of memory) killer to notice the process and kill it.

But all those remedies don't help in situations where you want a process to be able to use a lot of RAM, sometimes, when there's a point to it and it's not just in an infinite loop that will eventually use all memory.

Sometimes such a bad process will bog the machine down horribly before the OOM killer notices it.

We put together the following script about a year ago to handle such cases:

It uses the Proc::ProcessTable module from Perl's CPAN to do the heavy lifting. We invoke it once per minute in cron. If you have processes eating up memory so quickly that they bring down the machine in less than a minute, you could run it in a loop every few seconds instead.

It's easy to customize based on various attributes of a process. In our example here we have it ignore root processes which are assumed to be better vetted. We have commented out a restriction to watch only for Ruby on Rails processes in Passenger. And we kill only processes using 1 GiB or more RAM.

If a process makes it past these tests and is considered bad, we print out a report that crond emails to us, so we can investigate and ideally fix the problem. Then we try to kill the process gracefully, and after 5 seconds forcibly terminate it.

It's simple, easily customizable, and has come in handy for us.

Git: Delete your files and keep them, too

I was charged with cleaning up a particularly large, sprawling set of files comprising a git repository. One whole "wing" of that structure consisted of files that needed to stay around in production (they were various PDFs, PowerPoint presentations, and Windows EXEs that were only ever needed by the customer's partners, and downloaded from the live site – our developer camps never wanted to have local copies of these files, which amounted to over 280 MB (and since we have dozens of camps shadowing this repository, all on the same server, this will save a few GB at least).

I should point out that our preferred deployment is to have production, QA, and development all be working clones of a central repository. Yes, we even push from production, especially when clients are the ones making changes there. (Gasp!)

So: the aim here is to make the stuff vanish from all the other clones (when they are updated), but to preserve the stuff in one particular clone (production). Also, we want to ensure that no future updates in that "wing" are tracked.

# From the "production" clone:
 $ cd stuff
 $ git rm -r --cached .
 $ cd ..
 $ echo "stuff" >>.gitignore
 $ git commit ...
 $ git push ...
Now, everything that was in the "stuff" tree remains, for "production", but every other clone will remove these files when they update from the central repository:
$ git pull origin master
 ...
 delete mode 100644 stuff/aaa
 delete mode 100644 stuff/aab
 ... 

Company Update August 2012

Everyone here at End Point has been busy lately, so we haven’t had as much time as we’d like to blog. Here are some of the projects we’ve been knee deep in:

  • The Liquid Galaxy Team (Ben, Adam, Kiel, Gerard, Josh, Matt) has been working on several Liquid Galaxy installations, including one at the Monterey Bay National Marine Sanctuary Exploration Center in Santa Cruz, and one for the Illicit Networks conference in Los Angeles. Adam has also been preparing Ladybug panoramic camera kits for clients to take their own panoramic photos and videos. The Liquid Galaxy team welcomed new employees Aaron Samuel in July, and Bryan Berry just this week.
  • Brian B. has been improving a PowerCLI script to manage automated cloning of VMware vSphere virtual machines.
  • Greg Sabino Mullane has been working on various strange PostgreSQL database issues, and gave a riveting presentation on password encryption methods.
  • Josh Tolley has been improving panoramic photo support for Liquid Galaxy and expanding a public health data warehouse.
  • David has been at work on a web-scalability project to support customized content for a Groupon promotion, while continuing to benefit from nginx caching. He has also been working on the design of highly-available PostgreSQL clusters with multiple indepdent synchronization requirements, and SEO maintenance work on an Interchange-driven site.
  • Ron has been adding shop functionality and DevCamps to several client websites.
  • Phin has been working on a Django-based inventory app to help End Point keep track of servers, virtual machines, updates, backup verification, and other details about our hosting, monitoring, and other infrastructure concerns.
  • Steph has been busy on a Piggybak Ruby on Rails 3 project that is scheduled to launch soon, and she continues to work on other Rails and Interchange projects.
  • Jeff has been integrating support for promotion codes for two clients’ websites, and adding a third-party database support to another client website.
  • Carl set up automated emails for expiring subscriptions, added a separate website that runs from a client’s main admin, and set up a way to show product pages in a third-party website within an iframe.
  • Josh Williams recreated an existing environment in Amazon Web Services EC2 instances and helped multiple clients with streaming replication on PostgreSQL.
  • Phunk has been busy wrapping up a large Rails 3 project and is transitioning to another large Rails 3 with a nice API and elegant metadata management using ElasticSearch, CouchDB, OAuth2 and jQuery.
  • Marina has been working on several Rails projects, including building out a new Rails 3.2 site with RailsAdmin and integration of several third-party community rich features.
  • Greg D. has been building out functionality to help users create data visualizations for another client using Django and Weave. He also recently gave a company presentation on popular JavaScript libraries.
  • Jon has been interviewing candidates for our Ruby on Rails developer position. There have been lots of strong applicants. He’s also been looking into Git’s post-checkout hook for some added automation, troubleshooting some RPM build problems, and adding Rails (for RailsAdmin) to an existing Sinatra site via some neat nginx configuration. And he continues to be surprised by how bad Amazon’s EC2 I/O and CPU performance can be.
  • Mark set up PayPal for two Interchange clients and set up saved credit cards via Authorize.Net CIM integration for another.
  • Mike has been busy at work on a large scale Spree project building out custom support for custom CSS and account management integration.
  • And finally, Rick has been holding down the fort which includes integrating new clients, keeping our many current clients happy, managing company finances, and keeping the machine running and well-oiled overall!

As you can see, we’ve got a lot of variety here, so let us know if you’d like to hear more on any of these topics, or if some features we’ve developed for others may be useful to you.

Paginating API call with Radian6

I wrote about Radian6 in my earlier blog post. Today I will review one more aspect of Radian6 API - call pagination.

Most Radian6 requests return paginated data. This introduces extra complexity of making request several times in the loop in order to get all results. Here is one simple way to retrieve the paginated data from Radian6 using the powerful Ruby blocks.

I will use the following URL to fetch data:
/data/comparisondata/1338958800000/1341550800000/2777/8/9/6/

Let's decypher this.

  • 1338958800000 is start_date, 1341550800000 is end_date for document search. It's June, 06, 2012 - July, 06, 2012 formatted with date.to_time.to_i * 1000.
  • 2777 is topic_id, a Radian6 term, denoting a set of search data for every customer.
  • 8 stands for Twitter media type. There are various media types in Radian6. They reflect where the data came from. media_types parameter can include a list of values for different media types separated by commas.
  • 9 and 6 are page and page_size respectively.

First comes the method to fetch a single page.
In the Radian6 wrapper class:

def page(index, &block)
  data = block.call(index) 
  articles, count = data['article'], data['article_count'].to_i  
  [articles, count]
end

A data record in Radian6 is called an article. A call returns the 'article' field for a page of articles along other useful fields.

Now we will retrieve all pages of data from Radian6:

def paginated_call(&block)
  articles, index, count = [], 0, 0
  begin
    index += 1
    batch, count = page(index, &block)
    articles += batch 
  end while count > 0
  articles
end

Time to enjoy the method! I'm using httparty gem to make requests to API.

paginated_call do |page|
  get("/data/comparisondata/1338958800000/1341550800000/2777/8/#{page}/1000/")
end

Thanks for flying!

Merging Two Google Accounts: My Experience

Before I got married, I used a gmail account associated with my maiden name (let's call this account A). And after I got married, I switched to a new gmail address (let's call this account B). This caused daily annoyances as my use of various Google services was split between the two accounts.

Luckily, there are some services in Google that allow you to easily toggle between two accounts, but there is no easy way to define which account to use as the default for which service, so I found myself toggling back and forth frequently. Unfortunately, Google doesn't provide functionality to merge multiple Google accounts. You would think they might, especially given my particular situation, but I can see how it's a bit tricky in logically determining how to merge data. So, instead, I set off on migrating all data to account B, described in this email.

Consider Your Google Services

First things first, I took at look at the Google Services I used. Here's how things broke down for me:

  • Gmail: Account A forwards to account B. I always use account B.
  • Google+: Use through account A.
  • Google Analytics: Various accounts divided between account A and account B.
  • Blogger: Use through account A.
  • Google Calendar: Various calendars split between account A and account B.
  • Google Documents: Various documents split between account A and account B.
  • Google Reader: Don't use.
  • Google Voice: Use through account B.
  • YouTube: Don't use (other than consumption).

After reviewing this list, I determined I would have to migrate Google+, several Google Analytics accounts, Blogger, Google Calendar, and Google Documents. I set off to look for various directions for merging or migrating data, broken down below.

Google+

Google Plus was easy to migrate. I followed the directions described here, which essentially involves sharing circles from account A to account B and then importing circles in account B. It was quick and easy.

Google Analytics

Google Analytics was a little more time consuming. In all accounts assigned to account A, logged in as account A, I added account B as an Admin user. Then, I logged in as account B, downgraded account A to a regular user from admin and deleted account A. Note that you must downgrade an account before you are allowed to delete the user, from my experience.

Blogger

To migrate Blogger account settings, I invited account B to a blog logged in as account A in browser #1. In browser #2, I logged in as account B and accepted the invitation. In browser #1, I gave account B admin access logged in as account A. After verifying account B admin access in browser #2, I went back to browser #1 and removed account A from the blog. I repeated these steps until I had transitioned all blogs.

Google Documents

The Google Documents migration was by far the most time consuming data migration for all the services. This article from Lifehacker says, "If you're migrating to a regular Google account, transferring your Google Docs is easy. Just select all the documents you want to migrate, go to the More Actions drop down menu, and choose Change Owner. Type in Account 2's address in the box that comes up. You'll see all your documents in Account 2." For about 50% of the documents owned by account A, I was able to change the owner under the shared options from account A to account B.

But for an unexplained reason, I was not allowed to re-assign the owner for the remaining documents. I couldn't find any explanation why this was the case. So, I migrated the data by brute force: I downloaded the remaining data owned by account A's account, and uploaded it as account B. This was irksome and time consuming, but it was my last step in finishing the migration!

Associated Email Accounts

One quick change I had to make here was to remove my End Point email association with account A and add it to account B, so that any documents shared with my End Point email address would be visible by account B. This was done under Google Account Settings.

Conclusion

The time spent on the account migration was worth it, in retrospect! There are many available resources for merging other Google Services if you find yourself in a similar position. Google it ;)





Using Different PostgreSQL Versions at The Same Time.

When I work for multiple clients on multiple different projects, I usually need a bunch of different stuff on my machine. One of the things I need is having multiple PostgreSQL versions installed.

I use Ubuntu 12.04. Installing PostgreSQL there is quite easy. Currently there are available two versions out of the box: 8.4 and 9.1. To install them I used the following command:

~$ sudo apt-get install postgresql-9.1 postgresql-8.4 postgresql-client-common

Now I have the above two versions installed.

Starting the database is also very easy:

~$ sudo service postgresql restart
 * Restarting PostgreSQL 8.4 database server   [ OK ] 
 * Restarting PostgreSQL 9.1 database server   [ OK ] 

The problem I had for a very long time was using the proper psql version. Both database installed their own programs like pg_dump and psql. Normally you can use pg_dump from the higher version PostgreSQL, however using different psql versions can be dangerous because psql uses a lot of queries which dig deep into the PostgreSQL internal tables for getting information about the database. Those internals sometimes change from one database version to another, so the best solution is to use the psql from the PostgreSQL installation you want to connect to.

The solution to this problem turned out to be quite simple. There is a pg_wrapper program which can take care of the different versions. It is enough to provide information about the PostgreSQL version you want to connect to and it will automatically choose the correct psql version.

Below you can see the results of using psql --version command which prints the psql version. As you can see there are different psql versions chosen according to the --cluster parameter.

~$ psql --cluster 8.4/main --version
psql (PostgreSQL) 8.4.11
contains support for command-line editing
~$ psql --cluster 9.1/main --version
psql (PostgreSQL) 9.1.4
contains support for command-line editing

You can find more information in the program manual using man pg_wrapper or at pg_wrapper manual

Hidden inefficiencies in Interchange searching

A very common, somewhat primitive approach to Interchange searching uses an approach like this:

The search profile contains something along the lines of --

  mv_search_type=db
  mv_search_file=products
  mv_column_op=rm
  mv_numeric=0
  mv_search_field=category

[search-region]
  [item-list]
    [item-field description]
  [/item-list]
[/search-region]

In other words, we search the products table for rows whose column "category" matches an expression (with a single query), and we list all the matches (description only). However, this can be inefficient depending on your database implementation: the item-field tag issues a query every time it's encountered, which you can see if you "tail" your database log. If your item-list contains many different columns from the search result, you'll end up issuing many such queries:

[item-list]
    [item-field description], [item-field weight], [item-field color],
    [item-field size], [item field ...]
  ...

resulting in:

SELECT description FROM products WHERE sku='ABC123'
SELECT weight FROM products WHERE sku='ABC123'
SELECT color FROM products WHERE sku='ABC123'
SELECT size FROM products WHERE sku='ABC123'
...

(Now, some databases are smart enough to cache query results, but some aren't, so avoiding this extra work is probably worth your trouble even on a "smart" database, in case your Interchange application gets moved to a "dumb" database sometime in the future.)

Fortunately, it's easy to correct:

mv_return_fields=*

and then

...
    [item-param description]
...

in place of "item-field".

Rails 3 ActiveRecord caching bug ahoy!

Sometimes bugs in other people's code makes me think I might be crazy. I’m not talking Walter Sobchak gun-in-the-air-and-a-Pomeranian-in-a-cat-carrier crazy, but “I must be doing something incredibly wrong here” crazy. I recently ran into a Rails 3 ActiveRecord caching bug that made me feel this kind of crazy. Check out this pretty simple caching setup and the bug I encountered and tell me; Am I wrong?

I have two models with a simple parent/child relationship defined with has_many and belongs_to ActiveRecord associations, respectively. Here are the pertinent bits of each:

class MimeTypeCategory < ActiveRecord::Base
  # parent class
  has_many :mime_types

  def self.all
    Rails.cache.fetch("mime_type_categories") do
    MimeTypeCategory.find(:all, :include => :mime_types)
  end
end

class MimeType < ActiveRecord::Base
  # child class
  belongs_to :mime_type_category
end

Notice how in MimeTypeCategory.all, we are eager loading each MimeTypeCategory’s children MimeTypes because our app tends to use those MimeTypes any time we need a MimeTypeCategory. Then, we cache that entire data structure because it’s a good candidate for caching and we like our app to be fast.

Now, to reproduce this Rails caching bug, I clear my app’s cache using 'Rails.cache.clear' in the rails console, then load any page in my app that calls MimeTypeCategory.all. The page loads successfully and shows no errors. Doesn’t sound like a bug so far, right? If I load that same page a second time, I will get the standard Rails error page with:

undefined class/module MimeType
...
(app/models/mime_type_category.rb:17:in 'all')

Crazy, right? Why does it *appear* that one cannot cache model instances in Rails, and why did it work for exactly one page request after the Rails cache was cleared? Well, the former obviously cannot be true, and the latter is due to how Rails.cache.fetch handles cache misses and cache hits. For a cache miss, Rails.cache.fetch executes its block, serializes the return value, saves it to your cache store, then returns the block’s return value directly. For a cache hit, it reads the cached block from your cache store, deserializes it into whatever objects it identifies itself as, and returns that.

This is all well and good until you’re going along, innocently working on your app in the development Rails environment with config.cache_classes = false (which forces your app to lazy-load requested classes for each page request.) In that situation, Rails will try to deserialize the cached data structure that had references to the MimeType class. But, Rails may not have loaded the MimeType class at that point, so deserialization will fail and produce the error we see there. If you have other code paths in your app that do happen to load the child class before this type of cached parent/child class data structure, you might not hit the bug. Now you’ve entered a world of debugging pain.

I’m not about to give up on automatic class reloading in my development environment, and I don’t want to remove the cached eager loading of my child MimeTypes class because it’s sweet. So, after some digging, I discovered a solution: require_association. Adding “require_association ‘mime_type’ to my parent MimeTypeCategory class forces Rails to load the MimeType model when it loads the MimeTypeCategory model such that it can always deserialize the cached data structure successfully. I’ve used require_association in the same way for other instances of the same caching bug in our app as well.

Hopefully this explanation helps people avoid some of the pain I experienced while trying to determine if it was a Rails bug/feature or if I had finally gone insane. I should point out that some of the reading I’ve done suggests “require_dependency” is the more appropriate solution for this problem. I’ve verified that require_association works in all my cases, but to avoid “programming by coincidence,” I am going to snoop around the Rails core to understand the difference between the two.

Lastly, please remember: You can’t board a Pomeranian - they get upset and their hair falls out.