Welcome to End Point’s blog

Ongoing observations by End Point people

KISS: Slurping up File Attachments

I've been heavily involved in an ecommerce project running on Rails 3, using Piggybak, RailsAdmin, Paperclip for file attachment management, nginx and unicorn. One thing that we've struggled with is handling large file uploads both in the RailsAdmin import process as well as from the standard RailsAdmin edit page. Nginx is configured to limit request size and duration, which is a problem for some of the large files that are uploaded, which are large purchasable, downloadable files.

To allow these uploads, I brainstormed how to decouple the file upload from the import and update process. Phunk recently worked on integration of Resque, a popular Rails queueing tool which worked nicely. However, I ultimately decided that I wanted to go down a simpler route. The implementation is described below.

Upload Status

First, I created an UploadStatus model, to track the status of any file uploads. With RailsAdmin, there's an automagic CRUD interface connected to this model. Here's what the migration looked like:

class CreateUploadStatuses < ActiveRecord::Migration
  def change
    create_table :upload_statuses do |t|
      t.string :filename, :nil => false
      t.boolean :success, :nil => false, :default => false
      t.string :message


RailsAdmin also leverages CanCan, so I updated my ability class to allow list, reads, and delete on the UploadStatus table only, since there is no need to edit these records:

      cannot [:create, :export, :edit], UploadStatus
      can [:delete, :read], UploadStatus

KISS Script

Here's the simplified rake task that I used for the process:

namespace :upload_files do
  task :run => :environment do
    files = Dir.glob("#{Rails.root}/to_upload/*.*")
    files.each do |full_filename|
        ext = File.extname(full_filename)
        name = File.basename(full_filename, ext)

        (klass_name, field, id) = name.split(':')
        klass = klass_name.classify.constantize
        item = klass.find(id)  

        if item.nil?
          UploadStatus.create(:filename => "#{name}#{ext}", :message => "Could not find item from #{id}.")


          UploadStatus.create(:filename => "#{name}#{ext}", :success => true)
      rescue Exception => e
        UploadStatus.create(:filename => "#{name}#{ext}", :message => "#{e.inspect}")

And here's how the process breaks down:

  1. The script iterates through files in the #{Rails.root}/to_upload directory (lines 3-4).
  2. Based on the filename, in the format "class_name:field:id.extension", the item to be updated is retrieved (line 11).
  3. If the item does not exist, an upload_status record is created with a message that notes the item could not be found (lines 13-16).
  4. If the file exists and the update occurs, the original file is deleted, and a successful upload status is recorded (lines 18-23).
  5. If the process fails anywhere, the exception is logged in a new upload status record (lines 24-26).

This rake task is then called via a nightly cron job to slurp up the files. The simple script eliminates the requirement to upload large files via the admin interface, and decouples the upload from Paperclip/database management. It also has the added benefit of reporting the status to the administrators by leveraging RailsAdmin. Many features can be added to it, but it does the job that we need without much development overhead.


Brian Gadoury said...

Honestly, for all the abusive things I've done with Resque in our DevCamps setup, it's been completely solid and carefree. It has a couple advantages over the "cron job running a rake task" approach that I found really nice. But, they're probably not terribly compelling unless they're handling something that's customer-facing (or at least used in higher volume by a bunch of internal users.)

Not as a defense of Resque, but rather just for the sake of argument, here's are the main things I like about Resque that one doesn't get with a sweet and simple corn job:

* An optional, simple admin UI mountable at any url with a simple config/routes.rb change. This provides good visibility into what jobs are running in what queues along with error logs and the ability to re-try failed jobs.

* Multiple job queues that can be assigned multiple worker instances to manage job turn-around time

* Workers stay resident and only fire when there's something to do.

* Jobs are managed in (surprise!) a queue, which avoids the possibility of cron jobs backing up (because the DB had nodded off, etc) all trying to process the same files at the same time.

* Naturally leads the developer to put their worker code where it's more easily testable. :P

Again, these advantages are only advantages in some use cases, and I can see the validity of the simplicity argument as well. As you personally know, we initially started using Resque because we needed to decouple file uploads from post-processing (thumbnails!) while still getting the post-processing done as quickly as possible. That's clearly not exactly what you needed here.


Steph Skardal said...

Thanks for your input Phunk. It looks like you should really be writing some blog articles based on this comment, eh?