End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Spree and Software Development: Git and Ruby techniques

Having tackled a few interesting Spree projects lately, I thought I'd share some software evelopment tips I've picked up along the way.

Gem or Source?

The first decision you may need to make is whether to run Spree from a gem or source. Directions for both are included at the Spree Quickstart Guide, but the guide doesn't touch on motivation from running from a gem versus source. The Spree documentation does address the question, but I wanted to comment based on recent experience. I've preferred to build an application running from the gem for most client projects. The only times I've decided to work against Spree source code was when the Spree edge code had a major change that wasn't available in a released gem, or if I wanted to troubleshoot the internals of Spree, such as the extension loader or localization functionality.

If you follow good code organization practices and develop modular and abstracted functionality, it should be quite easy to switch back and forth between gem and source. However, switching back and forth between Spree gem and source may not be cleanly managed from a version control perspective.

git rebase

Git rebase is lovely. Ethan describes some examples of using git rebase here. When working with several other developers and even when I'm the sole developer, I've included rebasing in my pull and push workflow.

.gitmodules

Git submodules are lovely, also. An overview on git submodules with contributions from Brian Miller and David Christensen can be read here. Below is an example of a .gitmodules from a recent project that includes several extensions written by folks in the Spree community:

[submodule "vendor/extensions/faq"]
        path = vendor/extensions/faq
        url = git://github.com/joshnuss/spree-faq.git
[submodule "vendor/extensions/multi_domain"]
        path = vendor/extensions/multi_domain
        url = git://github.com/railsdog/spree-multi-domain.git
[submodule "vendor/extensions/paypal_express"]
        path = vendor/extensions/paypal_express
        url = git://github.com/railsdog/spree-paypal-express.git

.gitignore

This should apply to software development for other applications as well, but it's important to setup .gitignore correctly at the beginning of the project. I typically ignore database, log, and tmp files. Occasionally, I ignore some public asset files (stylesheets, javascripts, images) if they are copied over from an extension upon server restart, which is standard in Spree.

Overriding modules, controllers, and views

Now, the good stuff! So, let's assume the Spree core is missing functionality that you need. Options for expanding from the Spree core include overriding or extending existing models, controllers, or views, or writing and including new models, controllers, or views. Spree's Extension Tutorial covers adding new controllers, models, and views, so I'll discuss extending and overriding existing models, views and controllers below.

Extend an Existing Controller

To extend an existing controller, I've typically included a module with the extended behavior in the *_extension.rb file. For all examples, let's assume that my extension is named "Site", another standard in Spree. The code below shows the module include in site_extension.rb:

...
def activate
  ProductsController.send(:include, Spree::Site::ProductsController)
end
...

My ProductsController module, inside the Spree::Site namespace, includes the following to define a before filter in the Spree core products controller:

module Spree::Site::ProductsController
  def self.included(controller)
    controller.class_eval do
      controller.append_before_filter :do_stuff
    end
  end

  def do_stuff
    #doing stuff
  end
end

Override an Existing Controller Method

Next, to override a method in an existing controller, I've started the same way as before, by including a module in site_extension.rb:

...
def activate
  CheckoutsController.send(:include, Spree::Site::CheckoutsController)
end
...

The Spree::Site::CheckoutsController module will contain:

module Spree::Site::CheckoutsController
  def self.included(target)
    target.class_eval do
      alias :spree_rate_hash :rate_hash
      def rate_hash; site_rate_hash; end
    end
  end

  def site_rate_hash
    # compute new rate_hash
  end
end

In this example, the core rate_hash method is aliased for later use. And the rate_hash method is redefined inside the class_eval block. This example demonstrates how to override the core shipping rate computation during checkout.

Extend an Existing Model

Next, I'll provide an example of extending an existing Spree model. site_extension.rb will include the following:

...
def activate
  Product.send(:include, Spree::Site::Product)
end
...

And Spree::Site::Product module contains:

module Spree::Site::Product
  def new_product_method
    # new product instance method
  end
end

In the situation where you may want to create a class object method rather than an instance object method, you may include the following in Spree::Site::Product:

module Spree::Site::Product
  def self.included(target)
    def target.do_something_special
      'Something Special!'
    end
  end
end

The above example adds a method to the Product class object to be called from a view, for example <%= Product.do_something_special %> will return 'Something Special!'.

Override an Existing Model Method

To override a method from an existing model, I start with a module include in site_extension.rb:

...
def activate
  Product.send(:include, Spree::Site::Product)
end
...

And Spree::Site::Product contains the following:

module Spree::Site::Product
  def self.included(model)
    model.class_eval do
      alias :spree_master_price :master_price
      def master_price; site_master_price; end
    end
  end
  def site_master_price
    '1 billion dollars'
  end
end 

And from the view, the two methods can be called within the following block:

<% @products.each do |product| -%>
<%= product.master_price %> vs <%= product.spree_master_price.to_s %>
<% end -%>

Extend an Existing View

I previously discussed the introduction of hooks in depth here and here. To extend an existing view that has a hook wrapped around the content you intend to modify, you may add something similar to the following to *_hooks.rb, where * is the extension name:

insert_after :homepage_products, 'shared/promo'

The above code inserts the 'shared/promo' view to be rendered above the homepage_products hook in the Spree gem or Spree source ~/app/views/products.index.html.erb view. Other hook actions include insert_before, replace, or remove.

Override an Existing View

Before the introduction of hooks, the standard method of overriding or extending core views was to copy the core view into your extension view directory, and apply changes. In some cases, hooks are not always in the desired location. To override the footer view since there is no footer hook, I copy the Spree gem footer view to the extension view directory. The diff below compares the Spree gem view and my extension footer view:

- <div id="footer">
-  <div class="left">
-    <p>
-      <%= t("powered_by") %> <a href="http://spreecommerce.com/">Spree</a>
-    </p>
-  </div>
-  <div class="right">
-    <%= render 'shared/language_bar' if Spree::Config[:allow_locale_switching] %>
-  </div>
- </div>
<%= render 'shared/google_analytics' %>
+<p><a href="http://www.endpoint.com/">End Point</a></p>

Sample data

A final tip that I've found helpful when developing with Spree is to create sample data files in the extension db directory to maintain data consistency between developers. In a recent project, I've created the following stores.yml data to initiate several stores for the multi domain extension:

store_1:
  name: Store1
  code: store1
  domains: store1.mysite.com
  default: false
store_2:
  name: Store2
  code: store2
  domains: store2.mysite.com
  default: true
store_3:
  name: Store3
  code: store3
  domains: store3.mysite.com
  default: false

Many of these tips apply to general to software development. The tips specific to development in Spree (and possibly other Rails platforms) include the sample data syntax and the described Ruby techniques to extend and override class and model functionality.

Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.

Xen MAC mismatch VNC mouse escape HOWTO

This is a story that probably shouldn't need to be told if everything is in documentation somewhere. I'm not using any fancy virtualization management tools and didn't have an easy time piecing everything together, so I thought it'd be worth writing down the steps of the manual approach I took.

Dramatis personæ:

  • Server: Red Hat Enterprise Linux 5.4 with Xen kernel
  • Guest virtual server: CentOS 5.4 running paravirtualized under Xen
  • Workstation: Ubuntu 9.10

The situation: I updated the CentOS 5 Xen virtual guest via yum and rebooted to load the new Linux kernel and other libraries such as glibc. According to Xen as reported by xm list, the guest had started back up fine, but the host wasn't reachable over the network via ping, http, or ssh, including from the host network.

The guest wasn't using much CPU (as shown by xm top), so I figured it wasn't just a slow-running fsck during startup. And I was familiar with the iptables firewall rules on this guest, so I was fairly sure I wasn't being blocked there. I needed to get to the console to see what was wrong.

The way I've done this before is using VNC to access the virtual console remotely. The Xen host was configured to accept VNC connections on localhost, which I could see by looking in /etc/xen/xend-config.sxp:

(vnc-listen '127.0.0.1')

There are 11 Xen guests, with consoles listening on TCP ports 5900-5910. Which one was the one? I don't know any simple way to get a list that maps ports to Xen guests, but I did it this way:

ps auxww | grep qemu-dm

I noted the PID of the process that was running for my guest as revealed in its command line. Then I looked for the listener running under that PID:

netstat -nlp

I looked for $pid/qemu-dm in the PID/Program Name column and could then see the TCP port in the Local Address column. In my case it was 127.0.0.1:5903.

So I set up an ssh tunnel to the server for my VNC traffic:

ssh -f -N -L 5903:localhost:5903 root@$remote_host &

Then I opened the default Ubuntu/GNOME VNC viewer, labeled "Remote Desktop Viewer" under the Internet menu. This program is actually called Vinagre, and is basic but works. I connected to localhost:5903, since I'd forwarded my own local TCP port 5903 to the remote port 5903.

The remote console came up, and I was presented with the login banner and prompt. If I hadn't had the root password, I would have needed to reboot the guest and start it in runlevel 0 to get a root shell without a password and change the password. But I did have the password, so that wasn't necessary.

Then when trying to change to another window on my desktop, I ran into the biggest snag of the whole exercise: Getting control of the mouse out of the VNC remote desktop window and back in my own desktop! I couldn't find anything accurate on this in any documentation, forums, etc. Finally I stumbled across the trick: Press F10, which pulls down the Machine menu in Vinagre and as a side-effect takes control of the mouse away from the remote desktop. It was nice not to have to ssh in from another machine to kill Vinagre. But it makes me wonder how I'd send an F10 through to the remote console ...

Armed with the root password I was able to log into the guest and discover that only the lo (loopback) interface started on boot. The eth0 and eth1 interfaces failed because there was no virtual NIC available with the MAC addresses specified in /etc/sysconfig/network-scripts/ifcfg-eth0 and eth1.

That was because the virtual machine image had been cloned from another one and hadn't been given new unique MAC addresses. The problem was easily fixed by updating the ifcfg-eth0 and ifcfg-eth1 files with the MAC addresses actually given to the interfaces, as seen by ifconfig, which were ultimately assigned by the Xen host in /etc/xen/$host in the vif parameter. (You can also specify no MAC addresses in the guest at all and it will use whatever it gets.)

Then after running service network restart the networking was back, and I rebooted to make sure it started correctly on its own.

PostgreSQL Conference East 2010 review

I just returned from the PostgreSQL Conference East 2010. This is one of the US "regional" Postgres conferences, which usually occur once a year on both the East and West coast. This is the second year the East conference has taken place in my home town of Philadelphia.

Overall, it was a great conference. In addition to the talks, of course, there are many other important benefits to such a conference, such as the "hallway tracks", seeing old friends and clients, meeting new ones, and getting to argue about default postgresql.conf settings over lunch. I gave a 90 minute talk on "Postgres for non-Postgres people" and a lightning talk on the indispensable tail_n_mail.pl program.

This year saw the conference take place at a hotel for the first time, and this was a big improvement over the previous school campus-based conferences. Everything was in one building, there was plenty of space to hang out and chat between the talks, and everything just felt a little bit easier. The one drawback was that the rooms were not really designed to lecture to large numbers of people (e.g. no stadium seating), but this was not too much of an issue for most of the talks.

A few of the talks I attended included:

  • Mine! Luckily, my talk was in the very first slot, so I was able to give it and then be done talking for the rest of the conference (with the exception of the lightning talk). My talk was "PostgreSQL for MySQL (and other database people)". A quick show of hands showed that in addition to a good number of MySQL people, we had people coming from Oracle, Microsoft SQL Server, and even Informix. I walked through the steps to take when upgrading your application from using some other database to using Postgres, pointing out some of the pain points and particular Postgres gotchas, focusing on the SQL syntax. The second half of the talk focused on the Postgres project itself, explaining how it all worked, what the "community" and "core" consists of, how companies are involved, how development is done, and the philosophy of the project.
  • "PostgreSQL at myYearbook.com" by Gavin M. Roy. I've heard earlier versions of this talk before, but it was neat to see how much myyearbook.com had grown in just one year and some of the new challenges they faced. Of course, Gavin is still upset about the primary key situation and they are still doing unique indexes instead of PKs so they can do in-place reindexing for bloat removal.
  • Baron Schwartz spoke about "Query Analysis with mk-query-digest". The "mk" is short for maatkit, a nice suite of tools for doing all sorts of database-related things. Granted, it's very MySQL focused at the moment, but Baron has started to port things over to Postgres, and the demo he gave was pretty impressive. I'll definitely be downloading that code and taking a look.
  • Magnus Hagander gave a talk on "Secure PostgreSQL Deployment" which was a lot more interesting than I thought it would be (I knew it had Windows slides). My take-home lessons: never use the ssl mode of "prefer", and always check your Debian systems as they like to switch SSL on everything for no good reason. It's also quite fascinating to see the number of ways you can authenticate to a Postgres database.
  • I attended a talk on "Inside the PostgreSQL Infrastructure" by Dave Page. A lot of it I already knew, as I'm a little involved in said infrastructure, but it was good to hear some of the future plans, including standardizing on Debian instead of FreeBSD in the future.
  • Spencer Christensen's talk on "PostgreSQL Administration for System Administrators" was very well done but mostly review for me :). It was nice to see a shout out in his talk (and some others) for check_postgres.pl.
  • Robert Haas gave a good talk on "The PostgreSQL Query Planner" that seemed to be very well received. The bit about the join removal tech was particularly interesting: the Postgres planner does some really, really clever things when trying to build the best possible plan for your query.

At the lunch on Saturday, Josh Drake asked if anyone else wanted to do a lightning talk, so I made a quick outline on the back of a nearby piece of paper and gave a no-slides, no-notes five minute talk on tail_n_mail.pl. It went pretty well, and I even had 30 seconds left over at the end for questions. To clarify my answer to one of those further now: tail_n_mail.pl can parse CSV logs (indeed, any text file), but it cannot consolidate similar entries yet or any of the other neat things it does until we can teach TNM about how to parse the CSV logs properly.

An excellent conference overall, but I'd be amiss if I didn't offer a little constructive criticism for the next time (and other conferences):

  • Scheduling. The rooms were sometimes hard to find, and the schedule did not list the room next to the talk. That color-coded thing just does not work. In addition, it seemed like similar talks were sometimes stacked up against each other rather than staggered. Thus, you could learn about londiste OR rubyrep, but not both. Similarly, there were two Python talks up against each other.
  • Lightning talks. Always, always put the lightning talks at the *start* of the conference, not the end. Lightning talks are a great way to learn about what other people are doing. By having it at the start of the conference, you have the entire rest of the time to followup with people about their talks and foster more real-life discussions.
  • Lightning talks. Okay, not done talking about these yet. Lightning talks are somewhat notorious for spending lots of time getting the video to work right, as people switch computers, fiddle with plugs, etc. If you can't get it setup in 30 seconds, start the clock! You should be able to give your lightning talk without slides, if need be.

LibrePlanet 2010: Eben Moglen and the future of Oracle in free software

I just got back from Libre Planet 2010, a conference for free software activists put on by the Free Software Foundation. I imagine most readers of this blog are familiar with the language debate over free software vs. open source. Much of the business and software community has settled into using open source as the term of choice, but Libre Planet is certainly a place where saying "free software" is the norm.

I presented two talks - one on how to give good talks by connecting with your audience, and a second about non-coding roles in free software communities. The first talk is built on my work with user groups and giving presentations at primarily free software conferences over the last five years. The second was built off of the great work of Josh Berkus, for a talk that he first gave at a mini-conference I arranged the day before OSCON 2007 for Postgres.

One talk I attended surprised me with an important discussion of the future of the open source database market.

Eben Moglen spoke about the future of the Free Software Foundation and the new challenges that software freedom faces in a world increasingly dominated by network services - social networking, collaboration tools and other software where ownership of data is largely shared, and no single person or entity can be legitimately claimed to be sole owner of the data or structure that emerges.

Eben Moglen said, "We are at a point of inflection in our long campaign." He talked at length about the work the Software Freedom Law Center has done, collaborating with organizations whose goals were not necessarily software freedom, nor directly aligned with the FSF. He specifically brought up patent pools, and work that the SFLC has done to bring non-free companies in the fight against abusive patents.

Eben then turned his attention to the issue of the Oracle/Sun acquisition. He commented we haven't really looked to Oracle for pro-software freedom activity in the past. And then that "every technically competent 15-year old in the world uses MySQL." While this isn't music to the ears of Postgres users and developers, with applications like Wordpress, I'd say that Eben isn't too far off.

What was interesting to me was Eben's conjecture that MySQL is now essentially a tool that's now being sharpened to stab deeply into the heart of Microsoft's SQL Server market. He pointed out that Oracle has about 375,000 customers, and claimed that there's no where you can learn Oracle for free (to which several people have pointed out -- you can download crippled versions of Oracle for free to learn basics.. but I claim that's not the same thing as being able to download and install full server versions of something like MySQL or PostgreSQL).

Regardless of the details, this play by Oracle would be an interesting use of open source software to disrupt a market.

I suggest to the Postgres community that SQL Server to Postgres migrations are a real business opportunity our consultants, and an area in which we as a community should pursue documenting and assisting with transitions as much as possible.

Using psql \o to append to a file

I had a slow query I was working on recently, and wanted to capture the output of EXPLAIN ANALYZE to a file. This is easy, with psql's \o command:

5432 josh@josh# \o explain-results

Once EXPLAIN ANALYZE had finished running, I wanted the psql output back in my psql console window. This, too, is easy, using the \o command without a filename:

5432 josh@josh# \o

But later, after adding an index or two and changing some settings, I wanted to run a new EXPLAIN ANALYZE, and I wanted its output appended to the explain-analyze file I built earlier. At least on my system, \o will normally overwrite the target file, which would mean I'd lose my original results. I realize it's simple to, say, pipe output to a new file ("explain-analyze-2"), but I wasn't interested. Instead, because \o can also accept a pipe character and a shell command to pipe its output to, I did this:

5432 josh@josh# \o | cat - >> explain-results

Life is good.

Update: A helpful commenter pointed out I hadn't actually used the same files in the original post. Oops. Fixed.

MountainWest RubyConf 2010 - Steph's Notes

Last Thursday and Friday, I attended MountainWest RubyConf in Salt Lake City. As usual with any conference, my notebook is full of various tools and tips jotted down for later investigation. I thought I'd summarize some of the topics that grabbed my interest and go more in depth on a couple of topics in the next couple of weeks.

  • Lambda: In a talk early in the conference, lambda in Ruby came up. I had a hard time coming up with use cases for its use in Rails or a web app in another server side language with equivalent functionality (Python, Perl), but I'd like to look into it. An example was presented using Ruby's lambda to calculate Google's PageRank value, which is particularly appealing to my interest in SEO.
  • Chef: I've heard of Chef since I started working with Rails, but have yet to use it. After my recent adventures with Spree on Heroku, I see the value in becoming more familiar with Chef or another configuration management software tool such as Puppet. I'm particularly interested in creating some Chef recipes for Spree.
  • RVM: RVM, or Ruby Version Manager, is a nice tool to work with multiple Ruby environments from interpreters to sets of gems. For a couple of I/O-intensive Rails apps that I work on, I'm interested in performance benchmarking across different Ruby environments to investigate the business case for updating Ruby. RVM also provides functionality for gem bundle management, which might be of particular value when testing code and applications running from different gem versions.
  • Rails 3: I'm pretty excited for Rails 3. Yehuda Katz talked about Rails topics such as method aliasing and method lookup. He talked a bit about how the lack of modularity hurts development, and modularity in code may be defined as reducing assumptions to increase reuse. He also struck a chord with me when he talked about premature optimization: making decisions about modularity or functionality before it's needed and how this can be a mistake. I read some documentation on Rails 3 over the weekend and am looking forward to its release.
  • Rack and Sinatra: I haven't spent much time playing around with Rack or Sinatra, but have certainly heard a lot about these tools. There was a nice lightning talk given on how to create a simple ecommerce site in a very short time using the active merchant gem (also used by Spree), Sinatra, Rack, and the Rack/Payment gem. I'd like to expand on this in a blog post later.
  • NoSQL: While Jon and I attended this conference, Ethan Rowe attended the NoSQL Live conference in Boston that he blogged about here and here. There was a decent talk at MWRC on MongoDB with some examples on data interactions. The speaker discussed how NoSQL data "would be great" for CMS systems because of the diversity and amount of unknown attributes. I'm not quite sure I agree with that statement, but I'm interested in learning more about the business cases for NoSQL.
  • A couple of random book recommendations:
  • Random tools:
    • git hooks with gitty
    • git instaweb to browse git file structure and commit history without an internet connection
    • memprof - profiler to watch for object allocations in Rails
    • yardoc - documentation tool for ruby
    • ruby-processing - a data visualization tool
  • Productivity and Happiness: A common principle that comes up in Ruby/Rails conferences: A happy developer yields good productivity which leads to a happy developer. And Ruby/Rails makes developers happy, right? Well, I can't speak for anyone else, but I like Ruby.

As I said before, I hope to dig more into a couple topics and blog about them later.

NoSQL Live: The Dynamo Derivatives (Cassandra, Voldemort, Riak)

For me, one of the big parts of attending the NoSQL Live conference was to hear more about the differences between the various Dynamo-inspired open software projects. The Cassandra, Voldemort, and Riak projects were all represented, and while they differ in various ways at the logical layer (how one models data) and various features, they all share a similar conceptual foundation as outlined in Amazon's seminal Dynamo paper. So what differentiates these projects? What production deployments exist and what kind of stories do people have from running these systems in production?

Some of these questions can be at least partially answered by combing the interwebs, looking over the respective project sites, etc. Yet that's not quite the same thing as having community players in the same room, talking about stuff.

Of the three projects mentioned, Cassandra clearly has the "momentum" (a highly accurate indicator of future dominance). To me, this felt like the case even before Twitter started getting involved with it, but the Twitter effect was pretty evident based on the number of people sticking around for the Cassandra break-out session with Jonathan Ellis, compared to the break-out session given by Alex Feinberg for Voldemort (both of whom were very kind and thoughtful in answering my stream of irritating questions over lunch).

Regrettably, the break-out sessions were scheduled such that one had to choose between the Riak session and the Voldemort session; having already gone through the effort of setting up a small Riak cluster, manipulating data therein, etc., I felt there was more to be gained by attending the Voldemort session. Consequently, it's possible some of my take-aways from the conference are not entirely fair. Additionally, it seems strange to me that in the big room, Riak's representation was purely on the panel to discuss schema design in document-oriented databases; Riak had no representation on the panels related to scaling, operations, etc., despite that being a major focus of the project.

Most of what I learned had to do with nit-picky technical details, changes in upcoming versions, etc. Probably all of it was already documented. But, anyway, here are my takeaways on this topic, which may have been learned at the conference, or simply confirmed or reinforced by the conference. Random thoughts mixed in. Schema-less design.

  • The simplicity of the pure key/value store (Voldemort and Riak are more like this) brings flexibility in what you represent; having a somewhat more structured data model with which to work (as in Cassandra) can add some complexity to how you design your data, but brings improved flexibility in how you can navigate that data.
  • By digging around the web, one might get the impression that Cassandra has the broadest range of interesting deployments, Voldemort has fewer but is still interesting (Linkedin is certainly no slouch), and Riak has nothing to point to outside Basho Technologies' non-free Enterprise variant. By attending a conference in which each project was represented, one might get exactly the same impression. Brian Fink (for Riak) spoke of usage scenarios and was obviously informed by production experience with Riak, yet no actual use case, company, site, etc. was ever mentioned (again, the break-out session may contradict this).
  • The Voldemort and Cassandra project teams are clearly paying attention to each other's work, at least to some degree. There was even some informal discussion of the merkle tree design in Cassandra potentially making its way into Voldemort. Both Alex and Jonathan had intelligent things to say about Riak, as well, when I pestered them about it.
  • Having Ryan King from Twitter present on the "scaling with NoSQL" panel representing Cassandra was cool, and it offered confirmation that Cassandra in particular, but probably the Dynamo model as a whole, achieves its basic purpose: machines can fail but service is maintained and state is preserved; your structured storage system can scale horizontally, can scale writes, etc. Now, all that said, I wish there had been more detail available. Furthermore, Ryan King (understandably) did not seem particularly well-versed in other production deployments (like Digg's, for instance), so the "scaling with NoSQL" Cassandra representation disproportionately focused on exactly one use case.
  • A lot of good stuff is coming in Cassandra in particular. Eliminating the need for a particular row to fit in memory will make the data model more flexible, particularly in how one designs secondary indexes (in which one needs millions or potentially billions of columns, which are auto-sorted at write time by Cassandra, to effectively form an index using the column names as the indexed value and the related key as the value). The (relatively recent) support added for Hadoop map/reduce expands the use case scenarios for the database. Jonathan Ellis spoke of potentially adding native secondary index support, which would certainly be helpful.
  • We're only at the beginning, here. The share-nothing design of the Dynamo model is a great foundation on which to work. The production experience of early adopters brings valuable knowledge that is rapidly improving the various solutions (as one would expect). As patterns like the secondary index emerge, those patterns can be integrated into the main projects over time.
  • With that in mind, as higher-level abstractions build up over time, it wouldn't surprise me if the space comes to a place in which people write fairly flexible queries that describe the sets they want. In which case, the risk and uncertainty one may feel in contemplating the use of these solutions will probably go down. Additionally, the "NoSQL" name will seem even sillier than it already does.

Quick Thoughts on NoSQL Live Boston Conference

I'm back home now from the Boston "NoSQL Live" conference organized by 10gen.com (the MongoDB folks). It was a good event. A lot of stuff covered, a broad range of topics. I have a fair amount to say, but need to digest, review notes, etc. In any case, thanks to 10gen and the various sponsors that made it happen.

Some quick, random thoughts:

  • Picking a good table at lunch is key: we ended up sitting with four different presenters, including Jonathan Ellis for Cassandra and Alex Feinberg for Voldemort, which happen to be two of the systems I'm personally most interested in using at the moment.
  • There is an undeniable latest-thing-fan(boy|girl)ism aura surrounding the "NoSQL" brand/meme/whatever, but the various presenters and leading lights in various projects appear to be reasonable and fact-based; don't let the breathless silliness of fans fool you.
  • I went in feeling convinced of the desirability of non-relational datastores for specific modeling situations (graphs) and for scalability/availability/volume concerns (Dynamo and BigTable derivatives), while feeling relatively skeptical of "document datastores". I left feeling basically the same way, though decidedly less skeptical of CouchDB than I previously was.
  • There is a lot of good thinking and discussion going on in the space, it's moving very fast, and the future looks bright.

More later. Try to contain your anticipation.

PostgreSQL UTF-8 Conversion

It's becoming increasingly common for me to be involved in conversion of an old version of PostgreSQL to a new one, and at the same time, from an old "SQL_ASCII" encoding (that is, undeclared, unvalidated byte soup) to UTF-8.

Common ways to do this are to run pg_dumpall and then pipe the output through iconv or recode. When your source encoding is all pure ASCII, you don't need to do even that. When it's really all Windows-1252 (a superset of Latin-1 aka ISO-8859-1) it's easy.

But often, the data is stored in various unknown encodings from several sources over the course of years, including some that's already in UTF-8. When you convert with iconv, it dies with an error at the first problem, whereas recode will let you ignore encoding problems, but that leaves you with junk in your output.

The case I'm often encountering is fairly easy, but not perfect: Lots of ASCII, some Windows-1252, and some UTF-8. Since both pure ASCII and UTF-8 can be mechanistically detected, I put together this script to do the detection. It's Perl and uses the nice IsUTF8 module to do its character encoding detection:

Pipe input to the script. It handles one line at a time. When run with any arguments (such as --test) it will swallow pure ASCII lines, write lines it thinks are valid UTF-8 to stderr, and will convert the remaining presumed Windows-1252 lines to stdout, for manual examination.

If its guesses look correct, run it again with no arguments, and it will write all 3 types of encoding to stdout, ready for input to psql in your new UTF-8 encoded database.

(Don't forget to munge your pg_dump file to remove any hardcoded declarations of "SQL_ASCII" encoding from CREATE DATABASE statements, or otherwise make sure your database actually is created with UTF-8 encoding!)

Spree on Heroku for Development

Yesterday, I worked through some issues to setup and run Spree on Heroku. One of End Point's clients is using Spree for a multi-store solution. We are using the the recently released Spree 0.10.0.beta gem, which includes some significant Spree template and hook changes discussed here and here in addition to other substantial updates and fixes. Our client will be using Heroku for their production server, but our first goal was to work through deployment issues to use Heroku for development.

Since Heroku includes a free offering to be used for development, it's a great option for a quick and dirty setup to run Spree non-locally. I experienced several problems and summarized them below.

Application Changes

1. After a failed attempt to setup the basic Heroku installation described here because of a RubyGems 1.3.6 requirement, I discovered the need for Heroku's bamboo deployment stack, which requires you to declare the gems required for your application. I also found the Spree Heroku extension and reviewed the code, but I wanted to take a more simple approach initially since the extension offers several features that I didn't need. After some testing, I created a .gems file in the main application directory including the contents below to specify the gems required on the Badious Bamboo Heroku stack.

rails -v 2.3.5
highline -v '1.5.1'
authlogic -v '>=2.1.2'
authlogic-oid -v '1.0.4'
activemerchant -v '1.5.1'
activerecord-tableless -v '0.1.0'
less -v '1.2.20'
stringex -v '1.0.3'
chronic -v '0.2.3'
whenever -v '0.3.7'
searchlogic -v '2.3.5'
will_paginate -v '2.3.11'
faker -v '0.3.1'
paperclip -v '>=2.3.1.1'
state_machine -v '0.8.0'

2. The next block I hit was that git submodules are not supported by Heroku, mentioned here. I replaced the git submodules in our application with the Spree extension source code.

3. I also worked through addressing Heroku's read-only filesystem limitation. The setting perform_caching is set to true for a production environment by default in an application running from the Spree gem. In order to run the application for development purposes, perform_caching was set to false in RAILS_APP/config/environments/production.rb:

config.action_controller.perform_caching             = false

Another issue that came up due to the read-only filesystem constraint was that the Spree extensions were attempting to copy files over to the rails application public directory during the application restart, causing the application to die. To address this issue, I removed the public images and stylesheets from the extension directories and verified that these assets were included in the main application public directory.

I also removed the frozen Spree gem extension public files (javascripts, stylesheets and images) to prevent these files from being copied over during application restart. These files were moved to the main application public directory.

4. Finally, I disabled the allow_ssl_in_production to turn SSL off in my development application. This change was made in the extension directory extension_name_extensions.rb file.

AppConfiguration.class_eval do
  preference :allow_ssl_in_production, :boolean, :default => false
end

Obviously, this isn't the preference setting that will be used for the production application, but it works for a quick and dirty Heroku development app. Heroku's SSL options are described here.

Deployment Tips

1. To create a Heroku application running on the Bamboo stack, I ran:

heroku create --stack bamboo-ree-1.8.7 --remote bamboo

2. Since my git repository is hosted on github, I ran the following to push the existing repository to my heroku app:

git push bamboo master

3. To run the Spree database bootstrap (or database reload), I ran the following:

heroku rake db:bootstrap AUTO_ACCEPT=1

As a side note, I ran the command heroku logs several times to review the latest application logs throughout troubleshooting.

Despite the issues noted above, the troubleshooting yielded an application that can be used for development. I also learned more about Heroku configurations that will need to be addressed when moving the project to production, such as SSL setup and multi domain configuration. We'll also need to determine the best option for serving static content, such as using Amazon's S3, which is supported by the Spree Heroku extension mentioned above.

Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.

PostgreSQL tip: arbitrary serialized rows

Sometimes when using PostgreSQL, you want to deal with a record in its serialized form. If you're dealing with a specific table, you can accomplish this using the table name itself:

psql # CREATE TABLE foo (bar text, baz int);
CREATE TABLE

psql # INSERT INTO foo VALUES ('test 1', 1), ('test 2', 2);
INSERT 0 2

psql # SELECT foo FROM foo;
     foo      
--------------
 ("test 1",1)
 ("test 2",2)
(2 rows)

This works fine for defined tables, but how to go about this for arbitrary SELECTs? The answer is simple: wrap in a subselect and alias as so:

psql # SELECT q FROM (SELECT 1, 2) q;
   q   
-------
 (1,2)
(1 row)

Riak Install on Debian Lenny

I'm doing some comparative analysis of various distributed non-relational databases and consequently wrestled with the installation of Riak on a server running Debian Lenny.

I relied upon the standard "erlang" debian package, which installs cleanly on a basically bare system without a hitch (as one would expect). However, the latest Riak's "make" tasks fail to run; this is because the rebar script on which the make tasks rely chokes on various bad characters:

riak@nosql-01:~/riak$ make all rel
./rebar compile
./rebar:2: syntax error before: PK
./rebar:11: illegal atom
./rebar:30: illegal atom
./rebar:72: illegal atom
./rebar:76: syntax error before: ��n16
./rebar:79: syntax error before: ','
./rebar:91: illegal integer
./rebar:149: illegal atom
./rebar:160: syntax error before: Za��ze
./rebar:172: illegal atom
./rebar:176: illegal atom
escript: There were compilation errors.
make: *** [compile] Error 127

Delicious.

Ultimately, I came across this article describing issues getting Riak to install on Ubuntu 9.04, and ultimately determined that the Erlang version mentioned seemed to apply here. Following the article's instructions for building Erlang from source worked out fine, and so far I've been able to start, ping, and stop the local Riak server without incident.

Since a true investigation requires running these kinds of tools in a cluster, and that means automation of the installation/configuration is desirable, I've been scripting out the configuration steps (putting things into a configuration management tool like Puppet will come later when we're farther along and closer to picking the right solution for the problem in question). So, here's the script I've been running to build these things from my local machine (relying upon SSH); these are rough, a work in progress, and are not intended as examples of excellence, elegance, or beauty -- they simply get the job done (so far) for me and may help somebody else.

#!/bin/sh

hostname=$1
erlang_release=otp_src_R13B04
riak_release=riak-0.8.1

ssh root@$hostname "
# necessary for Erlang build
apt-get install build-essential libncurses5-dev m4
apt-get install openssl libssl-dev
# standard from-source build
mkdir erlang-build
cd erlang-build
wget http://ftp.sunet.se/pub/lang/erlang/download/$erlang_release.tar.gz
tar xzf $erlang_release.tar.gz
cd $erlang_release
./configure
make
make install
# put all of riak in a riak user
useradd -m riak
su -c 'wget http://bitbucket.org/basho/riak/downloads/$riak_release.tar.gz' - riak
su -c 'tar xzf $riak_release.tar.gz' - riak
su -c 'cd $riak_release && make all rel' - riak
su -c 'mv $riak_release/rel riak' - riak
"

(I have other scripts for preparing the box post-OS-install, but I don't think they impact this particular part of the process.)