End Point


Welcome to End Point's blog

Ongoing observations by End Point people.

Character encoding in perl: decode_utf8() vs decode('utf8')

When doing some recent encoding-based work in Perl, I found myself in a situation which seemed fairly unexplainable. I had a function which used some data which was encoded as UTF-8, ran Encode::decode_utf8() on said data to convert to Perl's internal character format, then converted the "wide" characters to the numeric entity using HTML::Entities::encode_entities_numeric(). Logging/printing of the data on input confirmed that the data was properly formatted UTF-8, as did running `iconv -f utf8 -t utf8 output.log >/dev/null` for the purposes of review.

However when I ended up processing the data, it was as if I had not run the decode function at all. In this case, the character in question was € (unicode code point U+20AC). The expected behavior from encode_entities_numeric() would be to turn any of the hi-bit characters in the perl string (i.e. all Unicode code points > 0x80) into the corresponding numeric entity (€ - € in this case). However instead of that specific character's numeric entity appearing in the output, the entities which appeared were: € i.e., the raw UTF-8 encoded value for €, with each octet being treated as an independent character instead of part of the whole encoded value.

What was particularly confusing was that extracting the relevant parts from the script in question resulted in the expected answer, so it was clearly not an issue of HTML::Entities not being able to deal with Unicode characters, as this code snippet demonstrates:

$ perl -MHTML::Entities+encode_entities_numeric -MEncode -e '$c=qq{\xE2\x82\xAC}; print encode_entities_numeric(decode_utf8($c))'
--> €

In the actual non-extracted version of the code, I was scratching my head. This was exhibiting the signs of doubly-encoded data, however I couldn't see how that could be the case. There were no PerlIO layers (e.g., :utf8 or :encoding) at play, the data I was outputting to a log file for verification purposes was being written via a brand new filehandle from a bare open(); I verified in multiple ways that the raw octets being passed in to the function were not doubly-encoded (printing the raw character points, counting lengths of the runs of octets and verifying that these matched the length of the UTF-8 encoded value for the represented characters, etc). The more things I tried the more puzzled I got. Finally, I changed the Encode::decode_utf8() call to a Encode::decode('utf8') one, providing the encoding explicitly. At this point, the processing pipeline started working as expected, and hi-bit characters were being output as their full numeric entities.

Since the documentation for decode_utf8 indicated that it should be identical to decode('utf8'), I resorted to the code to find out why it worked with the version that specified the encoding explicitly. I found that decode_utf8() does one additional thing that the regular decode('utf8') does not, and that is that before processing via the regular decode() function, decode_utf8 first checks the UTF-8 flag of the data that is being passed in, and if it is set it returns the data without further decoding*. My best guess is that this is to prevent errors if someone attempts to decode UTF-8 data in a string which is already in Perl's internal format, so in most cases this will provide a caller-friendly interface that will DWYM in many expected cases.

Armed with this knowledge, I verified that for some reason, the data that was being passed into the function had the UTF-8 flag set, so using the explicit decode('utf8') in lieu of decode_utf8() fixed the issue for me. (Tracing down the reason for the UTF-8 flag being set on this data was out of scope for this exercise, but is the true fix.) And just to verify that this was in fact the cause of the issue at hand, here's our example, modified slightly (we use the utf8::upgrade function to turn the UTF-8 flag on in the data and treat as actual encoded characters instead of raw octets):

$ perl -l -MHTML::Entities+encode_entities_numeric -MEncode -Mutf8 -e '$c=qq{\xE2\x82\xAC}; utf8::upgrade($c); print encode_entities_numeric(decode_utf8($c))'
--> €

* The UTF-8 flag is more-or-less an implementation detail of how Perl is able to deal with legacy 8-bit binary data in no particular encoding (i.e., raw octets, which it treats as latin-1) as well as the full range of Unicode data, and deal with both efficiently and in a backwards-compatible manner.

SearchToolbar and dropped Interchange sessions

A new update to Interchange's robots.cfg can be found here. This update adds "SearchToolbar" to the NotRobotUA directive which is used to exclude certain user agent strings when determining whether an incoming request is from a search engine robot or not. The SearchToolbar addon for IE and FireFox is being used more widely and we have received reports that users of this addon are unable to add items to their cart, checkout, etc. You may remember a similiar issue with the Ask.com toolbar that we discussed in this post. If you are using Interchange you should download the latest robots.cfg and restart Interchange.

Using "diff" and "git" to locate original revision/source of externally modified files

I recently ran into an issue where I had a source file of unknown version which had been substantially modified from its original form, and I wanted to find the version of the originating software that it had originally come from to compare the changes. This file could have come from any number of the 100 tagged releases in the repository, so obviously a hand-review approach was out of the question. While there were certainly clues in the source file (i.e., copyright dates to narrow down the range of commits to review) I thought up and used this technique:

Here are our considerations:

  • We know that the number of changes to the original file is likely small compared to the size of the file overall.
  • Since we're trying to uncover a likely match for the purposes of reviewing, exactness is not required; i.e., if there are lines in common with future releases, we're interested in the changes, so a revision with the fewest number of changes is preferred over finding the *exact* version of the file that this was originally based on.

The basic thought, then, is that we want to take the content of the unversioned file (i.e., the file that was changed) and find the revision of the corresponding file in the repository with the least number of changes, which we'll measure as the count of the lines in the source code diff. This struck me as similar to the copy detection that git does, insofar as it can detect content that is similar to some source content with a certain amount of tolerance for changes from the base. The difference in this case is that we're comparing content across a number of refs rather than across all of the blobs in a single ref. This recipe distilled down to the following bash command:

for ref in $(git tag);
    echo -n $ref;
    diff -w <(git show $ref:/path/to/versioned/file 2>/dev/null) modified_file | wc -l;
done | sort -k2 -n

The results of running this command is a list of the tags in the repository ordered by how similar they are to the target content (most similar first). A few comments:

  • We iterate through all tags in the project; while there could indeed be changes to the relevant file in intermediate versions, due to the way the release worked it's likely the original file was based on a released (aka tagged) version.
  • We're using diff's -w option, as the content may have changed spaces to tabs or vice versa, depending on the editor/editing habits of the original user. This helps us ensure that the changes that we're focusing on are the ones that change something substantial.
  • We're doing a numeric sort so the lines with the least number of changes show up at the top.
  • For the specific case I used this technique with, there were a number of revisions that had the least number of changed lines. Upon reviewing this smaller set of revisions (using the git diff rev1 rev2 -- path/to/content syntax), it turns out that the file in question had remained unchanged in each of these revisions, so any one of them was useful for my purposes.
  • The flexibility in the version detection works in this case because this was an isolated part of the system that did not have any changes or dependencies. If there had been important changes to the system as a whole independent of the changes to this file (but which had an affect on the operation of this specific part), we would need to have a more exact method of identifying the file.

Dissecting a Rails 3 Spree Extension Upgrade

A while back, I wrote about the release of Spree 0.30.* here and here. I didn't describe extension development in depth because I hadn't developed any Rails 3 extensions of substance for End Point's clients. Last month, I worked on an advanced reporting extension for a client running on Spree 0.10.2. I spent some time upgrading this extension to be compatible with Rails 3 because I expect the client to move in the direction of Rails 3 and because I wanted the extension to be available to the community since Spree's reporting is fairly lightweight.

Just a quick rundown on what the extension does: It provides incremental reports such as revenue, units sold, profit (calculated by sales minus cost) in daily, weekly, monthly, quarterly, and yearly increments. It reports Geodata to show revenue, units sold, and profit by [US] states and countries. There are also two special reports that show top products and customers. The extension allows administrators to limit results by order date, "store" (for Spree's multi-site architecture), product, and taxon. Finally, the extension provides the ability to export data in PDF or CSV format using the Ruport gem. One thing to note is that this extensions does not include new models – this is significant only because Rails 3 introduced significant changes to ActiveRecord, which are not described in this article.

Screenshots of the Spree Advanced Reporting extension.

To deconstruct the upgrade, I examined a git diff of the master and rails3 branch. I've divided the topics into Rails 3 and Spree specific categories.

Rails 3 Specific


In my report extension, I utilize Ruport's to_html method, which returns a ruport table object to an HTML table. With the upgrade to Rails 3, ruports to_html was spitting out escaped HTML with the addition of default XSS protection. The change is described here. and required the addition of a new helper method (raw) to yield unescaped HTML:

diff --git a/app/views/admin/reports/top_base.html.erb b/app/views/admin/reports/top_base.html.erb
index 6cc6b70..92f2118 100644
--- a/app/views/admin/reports/top_base.html.erb
+++ b/app/views/admin/reports/top_base.html.erb
@@ -1,4 +1,4 @@
-<%= @report.ruportdata.to_html %>
+<%= raw @report.ruportdata.to_html %>

Common Deprecation Messages

While troubleshooting the upgrade, I came across the following warning:

DEPRECATION WARNING: Using #request_uri is deprecated. Use fullpath instead. (called from ...)

I made several changes to address the deprecation warnings, did a full round of testing, and moved on.

diff --git a/app/views/admin/reports/_advanced_report_criteria.html.erb b/app/views/admin/reports/_advanced_report_criteria.html.erb
index ba69a2e..6d9c3f9 100644
--- a/app/views/admin/reports/_advanced_report_criteria.html.erb
+++ b/app/views/admin/reports/_advanced_report_criteria.html.erb
@@ -1,11 +1,11 @@
 <% @reports.each do |key, value| %>
-  <option <%= request.request_uri == "/admin/reports/#{key}" ? 'selected="selected" ' : '' %>
  value="<%= send("#{key}_admin_reports_url".to_sym) %>">
+  <option <%= request.fullpath == "/admin/reports/#{key}" ? 'selected="selected" ' : '' %>
  value="<%= send("admin_reports_#{key}_url".to_sym) %>">
     <%= t(value[:name].downcase.gsub(" ","_")) %>

Integrating Gems

An exciting change in Rails 3 is the advancement of Rails::Engines to allow easier inclusion of mini-applications inside the main application. In an ecommerce platform, it makes sense to break up the system components into Rails Engines. Extensions become gems in Spree and gems can be released through rubygems.org. A gemspec is required in order for my extension to be treated as a gem by my main application, shown below. Componentizing elements of a larger platform into gems may become popular with the advancement of Rails::Engines / Railties.

diff --git a/advanced_reporting.gemspec b/advanced_reporting.gemspec
new file mode 100644
index 0000000..71f00a8
--- /dev/null
+++ b/advanced_reporting.gemspec
@@ -0,0 +1,22 @@
+Gem::Specification.new do |s|
+  s.platform    = Gem::Platform::RUBY
+  s.name        = 'advanced_reporting'
+  s.version     = '2.0.0'
+  s.summary     = 'Advanced Reporting for Spree'
+  s.homepage    = 'http://www.endpoint.com'
+  s.author = "Steph Skardal"
+  s.email = "steph@endpoint.com"
+  s.required_ruby_version = '>= 1.8.7'
+  s.files        = Dir['CHANGELOG', 'README.md', 'LICENSE', 'lib/**/*', 'app/**/*']
+  s.require_path = 'lib'
+  s.requirements << 'none'
+  s.has_rdoc = true
+  s.add_dependency('spree_core', '>= 0.30.1')
+  s.add_dependency('ruport')
+  s.add_dependency('ruport-util') #, :lib => 'ruport/util')

Routing Updates

With the release of Rails 3, there was a major rewrite of the router and integration of rack-mount. The Rails 3 release notes on Action Dispatch provide a good starting point of resources. In the case of my extension, I rewrote the contents of config/routes.rb:

map.namespace :admin do |admin|
  admin.resources :reports, :collection => {
    :sales_total => :get,
    :revenue   => :get,
    :units   => :get,
    :profit   => :get,
    :count   => :get,
    :top_products  => :get,
    :top_customers  => :get,
    :geo_revenue  => :get,
    :geo_units  => :get,
    :geo_profit  => :get,
  map.admin "/admin",
    :controller => 'admin/advanced_report_overview',
    :action => 'index'
Rails.application.routes.draw do
  #namespace :admin do
  #  resources :reports, :only => [:index, :show] do
  #    collection do
  #      get :sales_total
  #    end
  #  end
  match '/admin/reports/revenue' => 'admin/reports#revenue', :via => [:get, :post]
  match '/admin/reports/count' => 'admin/reports#count', :via => [:get, :post]
  match '/admin/reports/units' => 'admin/reports#units', :via => [:get, :post]
  match '/admin/reports/profit' => 'admin/reports#profit', :via => [:get, :post]
  match '/admin/reports/top_customers' => 'admin/reports#top_customers', :via => [:get, :post]
  match '/admin/reports/top_products' => 'admin/reports#top_products', :via => [:get, :post]
  match '/admin/reports/geo_revenue' => 'admin/reports#geo_revenue', :via => [:get, :post]
  match '/admin/reports/geo_units' => 'admin/reports#geo_units', :via => [:get, :post]
  match '/admin/reports/geo_profit' => 'admin/reports#geo_profit', :via => [:get, :post]
  match "/admin" => "admin/advanced_report_overview#index", :as => :admin 

Spree Specific

Transition extension to Engine

The biggest transition to Rails 3 based Spree requires extensions to transition to Rails Engines. In Spree 0.11.*, the extension class inherits from Spree::Extension and the path of activation for extensions in Spree 0.11.* starts in initializer.rb where the ExtensionLoader is called to load and activate all extensions. In Spree 0.30.*, extensions inherit from Rails:Engine which is a subclass of Rails::Railtie. Making an extension a Rails::Engine allows it to hook into all parts of the Rails initialization process and interact with the application object. A Rails engine allows you run a mini application inside the main application, which is at the core of what a Spree extension is – a self-contained Rails application that is included in the main ecommerce application to introduce new features or override core behavior.

See the diffs between versions here:

diff --git a/advanced_reporting_extension.rb b/advanced_reporting_extension.rb
deleted file mode 100644
index f75d967..0000000
--- a/advanced_reporting_extension.rb
+++ /dev/null
@@ -1,46 +0,0 @@
-# Uncomment this if you reference any of your controllers in activate
-# require_dependency 'application'
-class AdvancedReportingExtension < Spree::Extension
-  version "1.0"
-  description "Advanced Reporting"
-  url "http://www.endpoint.com/"
-  def self.require_gems(config)
-    config.gem "ruport"
-    config.gem "ruport-util", :lib => 'ruport/util'
-  end
-  def activate
-    Admin::ReportsController.send(:include, AdvancedReporting::ReportsController)
-    Admin::ReportsController::AVAILABLE_REPORTS.merge(AdvancedReporting::ReportsController::ADVANCED_REPORTS)
-    Ruport::Formatter::HTML.class_eval do
-      # Override some Ruport functionality
-    end
-  end
diff --git a/lib/advanced_reporting.rb b/lib/advanced_reporting.rb
new file mode 100644
index 0000000..4e6fee6
--- /dev/null
+++ b/lib/advanced_reporting.rb
@@ -0,0 +1,50 @@
+require 'spree_core'
+require 'advanced_reporting_hooks'
+require "ruport"
+require "ruport/util"
+module AdvancedReporting
+  class Engine < Rails::Engine
+    config.autoload_paths += %W(#{config.root}/lib)
+    def self.activate
+      #Dir.glob(File.join(File.dirname(__FILE__), "../app/**/*_decorator*.rb")) do |c|
+      #  Rails.env.production? ? require(c) : load(c)
+      #end
+      Admin::ReportsController.send(:include, Admin::ReportsControllerDecorator)
+      Admin::ReportsController::AVAILABLE_REPORTS.merge(Admin::ReportsControllerDecorator::ADVANCED_REPORTS)
+      Ruport::Formatter::HTML.class_eval do
+        # Override some Ruport functionality
+      end
+    end
+    config.to_prepare &method(:activate).to_proc
+  end

Required Rake Tasks

Rails Engines in Rails 3.1 will allow migrations and public assets to be accessed from engine subdirectories, but a work-around is required to access migrations and assets in the main application directory in the meantime. There are a few options for accessing Engine migrations and assets; Spree recommends a couple rake tasks to copy assets to the application root, shown here:

diff --git a/lib/tasks/install.rake b/lib/tasks/install.rake
new file mode 100644
index 0000000..c878a04
--- /dev/null
+++ b/lib/tasks/install.rake
@@ -0,0 +1,26 @@
+namespace :advanced_reporting do
+  desc "Copies all migrations and assets (NOTE: This will be obsolete with Rails 3.1)"
+  task :install do
+    Rake::Task['advanced_reporting:install:migrations'].invoke
+    Rake::Task['advanced_reporting:install:assets'].invoke
+  end
+  namespace :install do
+    desc "Copies all migrations (NOTE: This will be obsolete with Rails 3.1)"
+    task :migrations do
+      source = File.join(File.dirname(__FILE__), '..', '..', 'db')
+      destination = File.join(Rails.root, 'db')
+      puts "INFO: Mirroring assets from #{source} to #{destination}"
+      Spree::FileUtilz.mirror_files(source, destination)
+    end
+    desc "Copies all assets (NOTE: This will be obsolete with Rails 3.1)"
+    task :assets do
+      source = File.join(File.dirname(__FILE__), '..', '..', 'public')
+      destination = File.join(Rails.root, 'public')
+      puts "INFO: Mirroring assets from #{source} to #{destination}"
+      Spree::FileUtilz.mirror_files(source, destination)
+    end
+  end

Relocation of hooks file

A minor change with the extension upgrade is a relocation of the hooks file. Spree hooks allow you to interact with core Spree views, described more in depth here and here.

diff --git a/advanced_reporting_hooks.rb b/advanced_reporting_hooks.rb
deleted file mode 100644
index fcb5ab5..0000000
--- a/advanced_reporting_hooks.rb
+++ /dev/null
@@ -1,43 +0,0 @@
-class AdvancedReportingHooks < Spree::ThemeSupport::HookListener
-  # custom hooks go here

diff --git a/lib/advanced_reporting_hooks.rb b/lib/advanced_reporting_hooks.rb
new file mode 100644
index 0000000..cca155e
--- /dev/null
+++ b/lib/advanced_reporting_hooks.rb
@@ -0,0 +1,3 @@
+class AdvancedReportingHooks < Spree::ThemeSupport::HookListener
+  # custom hooks go here

Adoption of "Decorator" naming convention

A common behavior in Spree extensions is to override or extend core controllers and models. With the upgrade, Spree adopts the "decorator" naming convention:

Dir.glob(File.join(File.dirname(__FILE__), "../app/**/*_decorator*.rb")) do |c|
  Rails.env.production? ? require(c) : load(c)

I prefer to extend the controllers and models with module includes, but the decorator convention also works nicely.

Gem Dependency Updates

With the transition to Rails 3, I found that there were changes related to dependency upgrades. Spree 0.11.* uses searchlogic 2.3.5, and Spree 0.30.1 uses searchlogic 3.0.0.*. Searchlogic is the gem that performs the search for orders in my report to pull orders between a certain time frame or tied to a specific store. I didn't go digging around in the searchlogic upgrade changes, but I referenced Spree's core implementation of searchlogic to determine the required updates:

diff --git a/lib/advanced_report.rb b/lib/advanced_report.rb
@@ -13,11 +15,26 @@ class AdvancedReport
     self.params = params
     self.data = {}
     self.ruportdata = {}
+    params[:search] ||= {}
+    if params[:search][:created_at_greater_than].blank?
+      params[:search][:created_at_greater_than] =
+        Order.first(:order => :completed_at).completed_at.to_date.beginning_of_day
+    else
+      params[:search][:created_at_greater_than] =
+        Time.zone.parse(params[:search][:created_at_greater_than]).beginning_of_day rescue ""
+    end
+    if params[:search][:created_at_less_than].blank?
+      params[:search][:created_at_less_than] =
+        Order.last(:order => :completed_at).completed_at.to_date.end_of_day
+    else
+      params[:search][:created_at_less_than] =
+        Time.zone.parse(params[:search][:created_at_less_than]).end_of_day rescue ""
+    end
+    params[:search][:completed_at_not_null] ||= "1"
+    if params[:search].delete(:completed_at_not_null) == "1"
+      params[:search][:completed_at_not_null] = true
+    end
     search = Order.searchlogic(params[:search])
-    search.checkout_complete = true
-    self.orders = search.find(:all)
+    self.orders = search.do_search 
     self.product_in_taxon = true
     if params[:advanced_reporting]

Rakefile diffs

Finally, there are substantial changes to the Rakefile, which are related to rake tasks and testing framework. These didn't impact my development directly. Perhaps when I get into more significant testing on another extension, I'll dig deeper into the code changes here.

diff --git a/Rakefile b/Rakefile
index f279cc8..f9e6a0e 100644
# lots of stuff

For those interested in learning more about the upgrade process, I recommend the reviewing the Rails 3 Release Notes in addition to reading up on Rails Engines as they are an important part of Spree's core and extension architecture. The advanced reporting extension described in this article is available here.