End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

LinuxFest Northwest: PostgreSQL 9.0 upcoming features

Once again, LinuxFest Northwest provided a full track of PostgreSQL talks during their two-day conference in Bellingham, WA.

Gabrielle Roth and I presented our favorite features in 9.0, including a live demo of Hot Standby with streaming replication! We also demonstrated features like:

The full feature list is available at on the developer site right now!

Viewing Postgres function progress from the outside

Getting visibility into what your PostgreSQL function is doing can be a difficult task. While you can sprinkle notices inside your code, for example with the RAISE feature of plpgsql, that only shows the notices to the session that is currently running the function. Let's look at a solution to peek inside a long-running function from any session.

While there are a few ways to do this, one of the most elegant is to use Postgres sequences, which have the unique property of living "outside" the normal MVCC visibility rules. We'll abuse this feature to allow the function to update its status as it goes along.

First, let's create a simple example function that simulates doing a lot of work, and taking a long time to do so. The function doesn't really do anything, of course, so we'll throw some random sleeps in to emulate the effects of running on a busy production machine. Here's what the first version looks like:

DROP FUNCTION IF EXISTS slowfunc();

CREATE FUNCTION slowfunc()
RETURNS TEXT
VOLATILE
SECURITY DEFINER
LANGUAGE plpgsql
AS $BC$
DECLARE
  x INT = 1;
  mynumber INT;
BEGIN
  RAISE NOTICE 'Start of function';

  WHILE x <= 5 LOOP
    -- Random number from 1 to 10
    SELECT 1+(random()*9)::int INTO mynumber;
    RAISE NOTICE 'Start expensive step %: time to run=%', x, mynumber;
 PERFORM pg_sleep(mynumber);
    x = x + 1;
  END LOOP;

  RETURN 'End of function';
END
$BC$;

Pretty straightforward function: we simply emulate doing five expensive steps, and output a small notice as we go along. Running it gives this output (with pauses from 1-10 seconds of course):

$ psql -f slowfunc.sql
DROP FUNCTION
CREATE FUNCTION
psql:slowfunc.sql:30: NOTICE:  Start of function
psql:slowfunc.sql:30: NOTICE:  Start expensive step 1: time to run=2
psql:slowfunc.sql:30: NOTICE:  Start expensive step 2: time to run=7
psql:slowfunc.sql:30: NOTICE:  Start expensive step 3: time to run=3
psql:slowfunc.sql:30: NOTICE:  Start expensive step 4: time to run=8
psql:slowfunc.sql:30: NOTICE:  Start expensive step 5: time to run=5
    slowfunc     
-----------------
 End of function

To grant some visibility to other processes about where we are, we're going to change a sequence from within the function itself. First we need to decide on what sequence to use. While we could pick a common name, this won't allow us to run the function in more than one process at a time. Therefore, we'll create unique sequences based on the PID of the process running the function. Doing so is fairly trivial for an application: just create that sequence before the expensive function is called. For this example, we'll use some psql tricks to achieve the same effect like so:

\t
\o tmp.drop.sql
SELECT 'DROP SEQUENCE IF EXISTS slowfuncseq_' || pg_backend_pid() || ';';
\o tmp.create.sql
SELECT 'CREATE SEQUENCE slowfuncseq_' || pg_backend_pid() || ';';
\o
\t
\i tmp.drop.sql
\i tmp.create.sql

From the top, this script turns off everything but tuples (so we have a clean output), then arranges for all output to go to the file named "tmp.drop.sql". Then we build a sequence name by concatenating the string 'slowfuncseq_' with the current PID. We put that into a DROP SEQUENCE statement. Then we redirect the output to a new file named "tmp.create.sql" (this closes the old one as well). We do the same thing for CREATE SEQUENCE. Finally, we stop sending things to the file, turn off "tuples only" mode, and import the two files we just created, first to drop the sequence if it exists, and then to create it. The files will look something like this:

$ more tmp.*.sql
::::::::::::::
tmp.drop.sql
::::::::::::::
DROP SEQUENCE IF EXISTS slowfuncseq_8762;

::::::::::::::
tmp.create.sql
::::::::::::::
CREATE SEQUENCE slowfuncseq_8762;

The only thing left is to add the calls to the sequence from within the function itself. Remember that the sequence called must exist, or the function will throw an exception, so make sure you create the sequence before the function is called! (Alternatively, you could use the same named sequence every time, but as explained before, you lose the ability to track more than one iteration of the function at a time.)

DROP FUNCTION IF EXISTS slowfunc();

CREATE FUNCTION slowfunc()
RETURNS TEXT
VOLATILE
SECURITY DEFINER
LANGUAGE plpgsql
AS $BC$
DECLARE
  x INT = 1;
  mynumber INT;
  seqname TEXT;
BEGIN
  SELECT INTO seqname 'slowfuncseq_' || pg_backend_pid();
  PERFORM nextval(seqname);

  RAISE NOTICE 'Start of function';

  WHILE x <= 5 LOOP
    -- Random number from 1 to 10
    SELECT 1+(random()*9)::int INTO mynumber;
    RAISE NOTICE 'Start expensive step %: time to run=%', x, mynumber;
 PERFORM pg_sleep(mynumber);
    PERFORM nextval(seqname);
    x = x + 1;
  END LOOP;

  RETURN 'End of function';
END
$BC$;

Again, it's important that the steps become to create the sequence, run the function, and then drop the sequence. While access to sequences lives outside MVCC, creation of the sequence itself is not. Here's what the whole thing will look like in psql:

\t
\o tmp.drop.sql
SELECT 'DROP SEQUENCE IF EXISTS slowfuncseq_' || pg_backend_pid() || ';';
\o tmp.create.sql
SELECT 'CREATE SEQUENCE slowfuncseq_' || pg_backend_pid() || ';';
\o
\t
\i tmp.drop.sql
\i tmp.create.sql
SELECT slowfunc();
\i tmp.drop.sql

Now you can see how far along the function is from any other process. For example, if we kick off the script above, then go into psql from another window, we can use the process id from the pg_stat_activity view to see how far along our function is:

$ select procpid, current_query from pg_stat_activity;
 procpid |                    current_query                     
---------+------------------------------------------------------
   10206 | SELECT slowfunc();
   10313 | select procpid, current_query from pg_stat_activity;

$ select last_value from slowfuncseq_10206;
 last_value 
------------
          3

You can assign your own values and meanings to the numbers, of course: this one simply tells us that the script is on the third iteration of our sleep loop. You could use multiple sequences to convey even more information.

There are other ways besides sequences to achieve this trick: one that I've used before is to have a plperlu function open a new connection to the existing database and update a text column in a simple tracking table. Another idea is to update a small semaphore table within the function, and check the modification time of the underlying file underneath your data directory.

Spree and Authorize.Net: Authorization and Capture Quick Tip

Last week I did a bit of reverse engineering on payment configuration in Spree. After I successfully setup Spree to use Authorize.net for a client, the client was unsure how to change the Authorize.Net settings to perform an authorize and capture of the credit card instead of an authorize only.


The requested settings for an Authorize.Net payment gateway on the Spree backend.

I researched in the Spree documentation for a bit and then sent out an email to the End Point team. Mark Johnson responded to my question on authorize versus authorize and capture that the Authorize.Net request type be changed from "AUTH_ONLY" to "AUTH_CAPTURE". So, my first stop was a grep of the activemerchant gem, which is responsible for handling the payment transactions in Spree. I found the following code in the gem source:

# Performs an authorization, which reserves the funds on the customer's credit card, but does not
# charge the card.
def authorize(money, creditcard, options = {})
  post = {}
  add_invoice(post, options)
  add_creditcard(post, creditcard)
  add_address(post, options)
  add_customer_data(post, options)
  add_duplicate_window(post)

  commit('AUTH_ONLY', money, post)
end

# Perform a purchase, which is essentially an authorization and capture in a single operation.
def purchase(money, creditcard, options = {})
  post = {}
  add_invoice(post, options)
  add_creditcard(post, creditcard)
  add_address(post, options)
  add_customer_data(post, options)
  add_duplicate_window(post)

  commit('AUTH_CAPTURE', money, post)
end

My next stop was the Spree payment_gateway core extension. This extension is included as part of the Spree core. It acts as a layer between Spree and the payment gateway gem and can be swapped out if a different payment gateway gem is used without requiring changing the transaction logic in the Spree core. I searched for purchase and authorize in this extension and found the following:

def purchase(amount, payment)
  #combined Authorize and Capture that gets processed by the ActiveMerchant gateway as one single transaction.
  response = payment_gateway.purchase((amount * 100).round, self, gateway_options(payment))
  ...
end
def authorize(amount, payment)
  # ActiveMerchant is configured to use cents so we need to multiply order total by 100
  response = payment_gateway.authorize((amount * 100).round, self, gateway_options(payment))
  ...
end

My last stop was where I found the configuration setting I was looking for, Spree::Config[:auto_capture], by searching for authorize and purchase in the Spree application code. I found the following logic in the Spree credit card model:

def process!(payment)
  begin
    if Spree::Config[:auto_capture]
      purchase(payment.amount.to_f, payment)
      payment.finalize!
    else
      authorize(payment.amount.to_f, payment)
    end
  end
end

The auto_capture setting defaults to false, not surprisingly, so it can be updated with one of the following changes.

# *_extension.rb:
def activate
  AppConfiguration.class_eval do
    preference :auto_capture, :boolean, :default => true
  end
end

# EXTENSION_DIR/config/initializers/*.rb:
if Preference.table_exists?
  Spree::Config.set(:auto_capture => true)
end

After I found what I was looking for, I googled "Spree auto_capture" and found a few references to it and saw that it was briefly mentioned in the Spree documentation payment information. Perhaps more documentation could be added around how the Spree auto_capture preference setting trickles down through the payment gateway processing logic, or perhaps this article provides a nice overview of the payment processing layers in Spree.

Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.

Make git grep recurse into submodules

If you've done any major work with projects that use submodules, you may have been surprised that `git grep` will fail to return matches that match in a submodule itself. If you go into the specific submodule directory and run the same `git grep` command, you will be able to see the results, so what to do in that case?

Fortunately, `git submodule` has a subcommand which lets us execute arbitrary commands in all submodule repos, intuitively named `git submodule foreach`.

My first attempt at a command to search in all submodules was:

$ git submodule foreach git grep {pattern}

This worked fine, except when {pattern} was multiple words or otherwise needed shell escaping. My next attempt was:

$ git submodule foreach git grep "{pattern}"

This properly passed the escapes to the shell (ending up with "'multi word phrase'" in my case), however an additional problem surfaced; the return value of the command resulted in an abort of the foreach loop. This was solved via:

$ git submodule foreach "git grep {pattern}; true"

A more refined version could be created as a git alias, automatically escape its arguments, and union with the results of `git grep`, thus providing the submodule-aware `git grep` I'd been hoping existed already. I leave this as an exercise to the reader... :-)

It's also worth noting that the file paths reported are relative to the containing submodule, so you would need to incorporate the `git submodule foreach`-supplied $path variable to pinpoint the full paths of the files in question.

jQuery UI Sortable Tips

I was recently tasked with developing a sorting tool to allow Paper Source to manage the sort order in which their categories are displayed. They had been updating a sort column in a database column but wanted a more visual aspect to do so. Due to the well-received feature developed by Steph, it was decided that they wanted to adapt their upsell interface to manage the categories. See here for the post using jQuery UI Drag Drop.

The only backend requirements were that the same sort column was used to drive the order. The front end required the ability to drag and drop positions within the same container. The upsell feature provided a great starting point to begin the development. After a quick review I determined that the jQuery UI Sortable function would be more favorable to use for the application.

Visual feedback was used to display the sorting in action with:

// on page load
$('tr.the_items td').sortable({
opacity: 0.7,
helper: 'clone',
});
// end on page load

Secondly I reiterate "jQuery UI Event Funtionality = Cool"

I only needed to use one function for this application to do the arrange the sorting values once the thumbnail had been dropped. This code calls a function which loops through all hidden input variables on the page and updates the sorting order.

// on page load
$('tr.the_items td').sortable({
stop: function(event, ui) { do_drop(this); },
});
// end on page load

Validating the sorting fields was a little different from the previously developed feature in that the number of available items could change depending on the category. The number of items could easily be 3 or 30. Therefore I needed a quick way to check the ever changing number. I decided to use a nested loop using the each function.

$('input.new_sku').each(
    function( intIndex, obj ) {
        $('input.new_sku').each(
            function( secIndex, secObj ) {
                if( (intIndex != secIndex) && ($(obj).val() == $(secObj).val()) ) {
                    error = true;
                }
            });
    }
);

The rest of the feature uses some of the same logic previously documented here.

All in all I learned that the jQuery UI is very versatile and a pleasure to work with. I hope to be using more of its features in the near future.

PostgreSQL at LinuxFest Northwest

This is my third year driving up to Bellingham for LinuxFest Northwest, and I'm excited to be presenting two talks about PostgreSQL there. Adrian Klaver is one of the organizers of the conference, and has always been a huge supporter of PostgreSQL. He has gone out of his way to have a track of content about our favorite database.

I'll be presenting an introduction to Bucardo and co-hosting a talk about new features in version 9.0 of PostgreSQL with Gabrielle Roth.

Talking about Bucardo and replication is always a blast. The last time I gave this talk to a packed house in Seattle, so I'm hoping for another lively discussion about the state of replication in PostgreSQL.

Using charge tag in Interchange's profiles, and trickiness with logic and tag interpolation order

One of the standard ways to charging in older versions of the Interchange demo was to do the charging from a profile using the &charge command. New versions of the demo store do the charging from log_transaction once the order profiles have finished, so it is not an issue there. I've come across quite a few catalogs where the &charge command is replaced with the [charge] tag wrapped in if-then-else blocks in an order profile. It had been so long since I had used &charge so I needed to lookup how options are passed to it, which may be why people tend to use the tag version instead of the &charge command. The problem here is that Interchange tags interpolate before any of the profile specifications execute, so if you have a [charge] tag in an order profile, it executes before any of the other checks, such as validation of fields.

Here's a stripped down example of where a profile will have tags executed before the other profile checks:

lname=required Last name required
fname=required First name required
&fatal=yes
&credit_card=standard keep

[charge route="[var MV_PAYMENT_MODE]" amount="[scratch some_total_calculation]"]

&final=yes

In this situation even if lname, fname or the credit card number are invalid, charge will execute before all of those checks occur, calling your payment gateway with invalid parameters. This could even cause a weird state where a credit card was charged, but the order not placed because the last name check fails for example, after the charge is successful.

The way around this is either to move the credit card charging out of the order profile into log_transaction or use the &charge command like so:

&charge=[var MV_PAYMENT_MODE] amount=[scratch some_total_calculation]

Another situation where you should be careful is using if-then-else blocks, if you need to do a profile checks that are dependent upon the results of other calls in the profile then you will need to create a custom order check to do that processing, otherwise sections of your if-then-else may execute that are not intended to.

Restoring individual table data from a Postgres dump

Recently, one of our clients needed to restore the data in a specific table from the previous night's PostgreSQL dump file. Basically, there was a UPDATE query that did not do what it was supposed to, and some of the columns in the table were irreversibly changed. So, the challenge was to quickly restore the contents of that table.

The SQL dump file was generated by the pg_dumpall command, and thus there was no easy way to extract individual tables. If you are using the pg_dump command, you can specify a "custom" dump format by adding the -Fc option. Then, pulling out the data from a single table becomes as simple as adding a few flags to the pg_restore command like so:

$ pg_restore --data-only --table=alpha large.custom.dumpfile.pg > alpha.data.pg

One of the drawbacks of using the custom format is that it is only available on a per-database basis; you cannot use it with pg_dumpall. That was the case here, so we needed to extract the data of that one table from within the large dump file. If you know me well, you might suspect at this point that I've written yet another handy perl script to tackle the problem. As tempting as that may have been, time was of the essence, and the wonderful array of Unix command line tools already provided me with everything I needed.

Our goal at this point was to pull the data from a single table ("alpha") from a very large dump file ("large.dumpfile.pg") into a separate and smaller file that we could use to import directly into the database.

The first step was to find exactly where in the file the data was. We knew the name of the table, and we also know that a dump file inserts data by using the COPY command, so there should be a line like this in the dump file:

COPY alpha (a,b,c,d) FROM stdin;

Because all the COPYs are done together, we can be pretty sure that the command after "COPY alpha" is another copy. So the first thing to try is:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '

This uses grep's handy -n option (aka --line-number) to output the line number that each match appears on. Then we pipe that back to grep, search for our table name, and print the line after it with the -A option (aka --after-context). The output looked like this:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '
1233889:COPY alpha (cdate, who, state, add, remove) FROM stdin;
12182851:COPY alpha_sequence (sname, value) FROM stdin;

Note that many of the options here are GNU specific. If you are using an operating system that doesn't support the common GNU tools, you are going to have a much harder time doing this (and many other shell tasks)!

We now have a pretty good guess at the starting and ending lines for our data: 1233889 to lines 12182850 (we subtract 1 as we don't want the next COPY). We can now use head and tail to extract the lines we want, once we figure out how many lines our data spans:

$ echo 12182851 - 1233889 | bc
10948962
$ head -1233889 large.dumpfile.pg | tail -10948962 > alpha.data.pg

However, what if the next command was not a COPY? We'll have to scan forward for the end of the COPY section, which is always a backslash and a single dot at the start of a new line. The new command becomes (all one line, but broken down for readability):

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | grep -n -m1 '^\\\.'

That's a lot, but in the spirit of Unix tools doing one thing and one thing well, it's easy to break down. First, we grab the line numbers where COPY occurs in our file, then we find the first occurrence of our table (using the -m aka --max-count option). We cut out the first field from that output, using a colon as the delimiter. This gives is the line number where the COPY begins. We pass this to xargs, and tail the file with a --lines=+x argument, which outputs all lines from that file *starting* at the given line number. Finally, we pipe that output to grep and look for the end of copy indicator, stopping at the first one, and also outputting the line number. Here's what we get:

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | grep -n -m1 '^\\\.'

148956:\.
xargs: tail: terminated by signal 13

This tells us that 148956 lines after the COPY, we encountered the string "\.". (The complaint from xargs can be ignored). Now we can create our data file:

$ grep -n COPY large.dumpfile.pg \
    | grep -m1 'COPY alpha' \
    | cut -d: -f1 \
    | xargs -Ix tail --lines=+x large.dumpfile.pg \
    | head -148956 > alpha.data.pg

Now that the file is there, we should do a quick sanity check on it. If the file is small enough, we could simply call it up in your favorite editor or run it through less or more. You can also check things out by knowing that a Postgres dump file separates the data in columns by a tab character when using the COPY command. So we can view all lines that don't have a tab, and make sure there is nothing except comments and the COPY and \. lines:

$ grep -v -P '\t' alpha.data.pg

The grep option -P (aka --perl-regexp) instructs grep to interpret the argument ("backslash t" in this case) as a Perl regular expression. You could also simply input a literal tab there: on most systems this can be done with the <ctrl-v><TAB> key combination.

It's time to replace that bad data. We'll need to truncate the existing table, then COPY our data back in. To do this, we'll create a file that we'll feed to psql -X -f. Here's the top of the file:

$ cat > alpha.restore.pg

\set ON_ERROR_STOP on
\timing

\c mydatabase someuser

BEGIN;

CREATE SCHEMA backup;

CREATE TABLE backup.alpha AS SELECT * FROM public.alpha;

TRUNCATE TABLE alpha;

From the top: we tell psql to stop right away if it encounters any problems, and then turn on the timing of all queries. We explicitly connect to the correct database as the correct user. Putting it here in the script is a safety feature. Then we start a new transaction, create a backup schema, and make a copy of the existing data into a backup table before truncating the original table. The next step is to add in the data, then wrap things up:

$ cat alpha.data.pg >> alpha.restore.pg

Now we run it and check for any errors. We use the -X argument to ensure control of exactly which psql options are in effect, bypassing any psqlrc files that may be in use.

$ psql -X -f alpha.restore.pg

If everything looks good, the final step is to add a COMMIT and run the file again:

$ echo "COMMIT;" >> alpha.restore.pg
$ psql -X -f alpha.restore.pg

And we are done! All of this is a little simplified, as in real life there was actually more than one table to be restored, and each had some foreign key dependencies that had to be worked around, but the basic idea remains the same. (and yes, I know you can do the extraction in a Perl one-liner)

Authlogic and RESTful Authentication Encryption

I recently did a bit of digging around for the migration of user data from RESTful authentication to Authlogic in Rails. My task was to implement changes required to move the application and data from RESTful Authentication to Authlogic user authentication.

I was given a subset of the database dump for new and old users in addition to sample user login data for testing. I didn't necessarily want to use the application to test login functionality, so I examined the repositories here and here and came up with the two blocks of code shown below to replicate and verify encryption methods and data for both plugins.

RESTful Authentication

user = User.find_by_email('test@endpoint.com')

key = REST_AUTH_SITE_KEY 
actual_password = "password"
digest = key

REST_AUTH_DIGEST_STRETCHES.times { digest = Digest::SHA1.hexdigest([digest, user.salt, actual_password, key].join('--')) }

# compare digest and user.crypted_password here to verify password, REST_AUTH_SITE_KEY, and REST_AUTH_DIGEST_STRETCHES

Note that the stretches value for RESTful authentication defaults to 10, but it can be adjusted. If no REST_AUTH_SITE_KEY is provided, the value defaults to an empty string. Also note that RESTful authentication uses the SHA-1 hash function by default.

Authlogic

user = User.find_by_email('test2@endpoint.com')

actual_password = "password"
digest = "#{actual_password}#{user.salt}"

20.times { digest = Digest::SHA512.hexdigest(digest) }

# compare digest and user.crypted_password here to verify password

Note that the stretches value for Authlogic defaults to 20, but it can be adjusted. Also note that Authlogic uses the SHA-512 hash function by default.

After I verified the encryption of both old user passwords encrypted with RESTful Authentication and new user passwords encrypted Authlogic, I added the verified REST_AUTH_SITE_KEY and REST_AUTH_DIGEST_STRETCHES values to RAILS_ROOT/config/initializers/site_keys.rb and confirmed that the changes implemented in the tutorial described here were implemented. The Spree User model already contains the model changes below discussed in the tutorial. As users log in to the application, user authentication is performed against the RESTful authentication crypted password. After a successful login, the password is re-encrypted by Authlogic.

# app/models/user.rb
class User < ActiveRecord::Base
  acts_as_authentic do |c|
    c.act_like_restful_authentication = true
  end
end

Prior to this task, I hadn't poked around the user authentication code in Rails or Spree. Hopefully, this experience will prepare me for the next time I encounter user migrations with encrypted passwords.

Learn more about End Point's Ruby on Rails Development or Ruby on Rails Ecommerce Services.

A decade of change in our work

Lately I've been looking back a bit at how things have changed in our work during the past decade. Maybe I'm a little behind the times since this is a little like a new year's reflection.

Since 2000, in our world of open source/free software ecommerce and other Internet application development, many things have stayed the same or become more standard.

The Internet is a completely normal part of people's lives now, in the way TV and phones long have been.

Open source and free software is a widely-used and accepted part of the software ecosystem, used on both servers and desktops, at home and in companies of all sizes. Software licensing differences are still somewhat arcane, but the more popular options are now fairly widely-known.

Many of today's major open source software systems key to Internet infrastructure and application development were already familiar in 2000:

  • the GNU toolset, Linux (including Red Hat and Debian), FreeBSD, OpenBSD
  • Apache, mod_ssl (the RSA patent expired in September 2000, making it freely usable)
  • PostgreSQL, MySQL, *DBM
  • Perl and CPAN, Python, PHP, Ruby (though little known then)
  • Interchange (just changed from MiniVend)
  • JavaScript
  • Sendmail, Postfix
  • OpenSSH (brand new!), GnuPG, Nmap, Nagios (originally NetSaint), BIND, rsync
  • zip, gzip, bzip2
  • screen, Vim, Emacs, IRC, Pine (and now Alpine), mutt
  • X.org (from XFree86), KDE, GNOME
  • proprietary Netscape became open-source Mozilla became very popular Firefox
  • OpenOffice.org, the GIMP, Ghostscript, etc.
  • and many others

Also worth mentioning: Java was (eventually) released as open source.

All of these projects have improved dramatically over the past decade and they've become a core part of many people's lives, even though invisibly in many cases. Time invested in learning, developing software based on, and contributing to these projects has been well-rewarded.

However, my hand-picked list of open source "survivor" projects reveals just as much by what's not listed there.

Take version control systems: A decade ago, Subversion was in its infancy, its developers aiming to improve on the de facto standard CVS. Now Subversion is a legacy system, still used but overshadowed by a new generation of version control systems that have distributed functionality at their core. Git, Mercurial, and Bazaar are the norm in free software and much business development now.

And countless smaller open source libraries, frameworks, and applications have come and gone.

Now, in contrast to the things that have stayed much the same, consider some of the upheavals that we've adapted to, or are still adapting to:

  • migration from 32-bit to 64-bit: hardware, operating systems, libraries, applications, and binary data formats
  • from a huge number of character sets to Unicode
  • and from a variety of mostly fixed 8- and 16-bit character set encodings to Internet standard variable-length UTF-8
  • reignited browser competition and finally somewhat standard and usable CSS and JavaScript
  • a proliferation of new frameworks in every language, including JavaScript's Prototype.js, script.aculo.us, MooTools, Rico, Dojo, YUI, Ext JS, Google Web Toolkit, jQuery
  • widespread adoption object/relational mappers to interact with databases instead of hand-written SQL queries
  • necessity of email spam filtering for survival, and coping with spam blacklists interfering with legitimate business
  • search engine optimization (SEO) and adapting to Google's search dominance
  • leased dedicated server hosting price wars: where all servers once cost at least $500/month, price competition brought us < $100/month options
  • server virtualization and cloud computing
  • configuration management: from obscure cfengine to widely-used Puppet and Chef
  • Microsoft's introduction of XmlHttpRequest and the transformation of dynamic HTML into Ajax
  • multi-core CPUs and languages to take advantage (Erlang, Scala, etc.)
  • and so on.

The first of these changes, moving to 64-bit platforms and UTF-8 encoding, were mostly quiet behind-the-scenes migration work on the server. I think they're two of the more important changes, though of course the transition still is underway.

Thanks to UTF-8, new projects typically give little thought to character set encoding concerns, which once were a major problem as soon as any non-ASCII or non-Latin-1 text came into play. Just mixing, say, Japanese, Chinese, and Hungarian in a single data set was a real challenge. But anyone asked to display all three in a single browser screen knows how valuable UTF-8 is as a standard.

Moving to 64-bit architecture removed the ~3 GB memory barrier from applications and operating systems. It gave us lots of room to grow, and take advantage of cheaper memory, with the next limit not really in sight.

Other changes in the larger Internet ecosystem are more visible and have been more widely discussed, but still bear mentioning:

  • online advertising as a major industry
  • affiliate marketing
  • social networks
  • wikis as a standard fixture in society, from the overwhelmingly popular Wikipedia, to private wikis used in business and community projects
  • prepackaged free software-style licenses have spread to other areas, for example via the Creative Commons
  • free-of-charge email and other services
  • mobile computing, from laptops to netbooks to phones
  • rebirth of the Walkman as a little computer, and MP3 and other digital audio purchases now eclipse CDs
  • DVRs, Hulu, HD TV
  • wifi Internet widely available
  • broadband Internet access in many homes

I don't mean to say that the preceding decades were any less noteworthy. Those working with the Internet should expect a lot of change, as it's still a young industry.

It's interesting to look back and consider the journey, and a little daunting to realize how much work has gone into adapting to each wave of change, and how much work remains to upgrade, migrate, and adapt the large amount of legacy code and infrastructure we've created. That, in addition to working on the next improvements we see a need for. Lots to learn, and lots to do!

Tip: Find all non-UTF-8 files

Here's an easy way to find all non-UTF-8 files for later perusal:

find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}" > utf8_fail

I've needed this before for converting projects over into UTF-8; obviously certain files are going to be binary and will show up in this list, so manual vetting will need to be done before converting all your images over into UTF-8.

Modifying Models in Rails Migrations

One problem that has haunted me in the past was making modifications to the model in migrations. Specifically, stuff like removing or changing associations. At the time the migration is expected to run, the file for the model class will have been updated already, so it is hard use that in the migration itself, even though it would be useful.

In this case I found myself with an even slightly trickier example. I have a model that contains some address info. Part of that is an association to an external table that lists the states. So part of the class definition was like so:

Class Contact { belongs_to :state ... }

What I needed to do in the migration was to remove the association and introduce another field called "state" which would just be a varchar field representing the state part of the address. The two problems the migration would encounter are:

  1. the state association would not exist at the time it ran
  2. and even if it did, there would be a name conflict between it and the new column I wanted

to get around these restrictions I did this in my migration:

Contact.class_eval { belongs_to :orig_state, :class_name => "State", :foreign_key => "state_id" }

This loads the association association code into the context of the Contact class and gives me a new handle to work with, so I am free to create a "state" column without worrying about the names colliding.

Another problem I had was that the table had about 300 rows of data that failed one of the validations called "validate_names". I didn't feel like sorting it out, so I just added the following code to the above class_eval block:

define_method(:validate_names) { true }

This overrides the method for the validation in the class while the migration is running. I can sort out the invalid data later.

Git Submodules: What is the Ideal Workflow?

Last week, I asked some coworkers at End Point about the normal workflow for using git submodules. Brian responded and the discussion turned into an overview on git submodules. I reorganized the content to be presented in a FAQ format:

How do you get started with git submodules?

You should use git submodule add to add a new submodule. So for example you would issue the commands:

git submodule add git://github.com/stephskardal/extension1.git extension
git submodule init

Then you would git add extension (the path of the submodule installation), git commit.

What does the initial setup of a submodule look like?

The super project repo stores a .gitmodules file. A sample:

[submodule "extension1"]
        path = extension
        url = git://github.com/stephskardal/extension1.git
[submodule "extension2"]
        path = extension_two
        url = git://github.com/stephskardal/extension2.git

When you have submodules in a project, do you have to separately clone them from the master project, or does the initial checkout take care of that recursively for you?

Generally, you will issue the commands below when you clone a super project repository. These commands will "install" the submodule under the main repository.

git submodule init
git submodule update

How do you update a git submodule repository?

Given an existing git project in the "project" directory, and a git submodule extension1 in the the extension directory:

First, a status check on the main project:

~/project> git status
# On branch master
nothing to commit (working directory clean)

Next, a status check on the git submodule:

~/project> cd extension/
~/project/extension> git status
# Not currently on any branch.
nothing to commit (working directory clean)

Next, an update of the extension:

~/project/extension> git fetch
remote: Counting objects: 30, done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 19 (delta 9), reused 0 (delta 0)
Unpacking objects: 100% (19/19), done.
From git://github.com/stephskardal/extension1
   0f0b76b..9cbb6bd  master     -> origin/master

~/project/extension> git checkout master
Previous HEAD position was 0f0b76b... Added before_filter to base controller.
Switched to branch "master"
Your branch is behind 'origin/master' by 5 commits, and can be fast-forwarded.

~/project/extension> git merge origin/master
Updating f95a2d5..9cbb6bd
Fast forward
 extension.rb                                    |   10 +
 README                                             |   36 +
 TODO                                               |   11 +-
...

~/project/extension> git status
# On branch master
nothing to commit (working directory clean)

Next, back to the main project:

~/project/extension> cd ..
~/project> git status
# On branch master
# Changed but not updated:
#   (use "git add ..." to update what will be committed)
#   (use "git checkout -- ..." to discard changes in working directory)
#
#       modified:   extension
#
no changes added to commit (use "git add" and/or "git commit -a")

Now, a commit to include the submodule repository change. Brian has made it a convention to manually include SUBMODULE UPDATE: extension_name in the commit message to inform other developers that a submodule update is required.

~/project> git add extension
~/project> git commit
[master eba52d5] SUBMODULE UPDATE: extension
 1 files changed, 1 insertions(+), 1 deletions(-)

What does git store internally to track the submodule? The HEAD position? That would seem to be the minimal information needed to tie the specific submodule-tracked version with the version used in the superproject.

It stores a specific commit SHA1 so even if HEAD moves the super project's "reference" doesn't, which is why updating to the upstream version must be followed by a commit so that the super project is "pinned" to the same commit across repos. You'll see in the example above that the submodule project was in a detached head state (not on a branch) so HEAD doesn't really make sense.

It is critical that the super project repo store an exact position for the submodule otherwise you would not be able to associate your own code with a particular version of a submodule and ensure that a given submodule is at the same position across repos. For instance, if you updated to an upgraded version of a submodule and committed it not realizing that it broke your own code, you can check out a previous spot in the repository where the code worked with the submodule.

Hopefully, this discussion on git submodules begins to show how powerful git and submodules can be for making it easy for non-core developers to start sharing their code on an open source project.

Thanks to Brian Miller and David Christensen for contributing the content for this post! I reference this article in my article on Software Development with Spree - I've found it very useful to use git submodules to install several Spree extensions on recent projects. The Spree extension community has a few valuable extensions including that introduce features such as product reviews, faq, blog organization, static pages, and multi-domain setup.