End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Python Imports

For a Python project I'm working on, I wrote a parent class with multiple child classes, each of which made use of various modules that were imported in the parent class. A quick solution to making these modules available in the child classes would be to use wildcard imports in the child classes:

from package.parent import *

however, PEP8 warns against this stating "they make it unclear which names are present in the namespace, confusing both readers and many automated tools."

For example, suppose we have three files:

# a.py
import module1
class A(object):
    def __init__():
        pass
# b.py
import module2
class B(A):
    def __init__():
        super(B, self).__init__()
# c.py
class C(B):
    def __init__():
        super(C, self).__init__()

To someone reading just b.py or c.py, it is unknown that module1 is present in the namespace of B and that both module1 and module2 are present in the namespace of C. So, following PEP8, I just explicitly imported any module needed in each child class. Because in my case there were many imports and because it seemed repetitive to have all those imports duplicated in each of the many child classes, I wanted to find out if there was a better solution. While I still don't know if there is, I did go down the road of how imports work in Python, at least for 3.4.1, and will share my notes with you.

Python allows you to import modules using the import statement, the built-in function __import__(), and the function importlib.import_module(). The differences between these are:

The import statement first "searches for the named module, then it binds the results of that search to a name in the local scope" ( Python Documentation). Example:

Python 3.4.1 (default, Jul 15 2014, 13:05:56) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>
>>> re.sub('s', '', 'bananas')
'banana'

Here the import statement searches for a module named re then binds the result to the variable named re. You can then call re module functions with re.function_name().

A call to function __import__() performs the module search but not the binding; that is left to you. Example:

>>> muh_regex = __import__('re')
>>> muh_regex
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>
>>> muh_regex.sub('s', '', 'bananas')
'banana'

Your third option is to use importlib.import_module() which. like __import__(), only performs the search:

>>> import importlib
>>> muh_regex = importlib.import_module('re')
>>> muh_regex
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>
>>> muh_regex.sub('s', '', 'bananas')
'banana'

Let's now talk about how Python searches for modules. The first place it looks is in sys.modules, which is a dictionary that caches previously imported modules:

>>> import sys
>>> 're' in sys.modules
False
>>> import re
>>> 're' in sys.modules
True
>>> sys.modules['re']
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>

If the module is not found in sys.modules Python searches sys.meta_path, which is a list that contains finder objects. Finders, along with loaders, are objects in Python's import protocol. The job of a finder is to return a module spec, using method find_spec(), containing the module's import-related information which the loader then uses to load the actual module. Let's see what I have in my sys.meta_path:

>>> sys.meta_path
[, , ]

Python will use each finder object in sys.meta_path until the module is found and will raise an ImportError if it is not found. Let's call find_spec() with parameter 're' on each of these finder objects:

>>> sys.meta_path[0].find_spec('re')
>>> sys.meta_path[1].find_spec('re')
>>> sys.meta_path[2].find_spec('re')
ModuleSpec(name='re', loader=<_frozen_importlib.SourceFileLoader object at 0x7ff7eb314438>, origin='/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py')

The first finder knows how to find built-in modules and since re is not a built-in module, it returns None.

>>> 're' in sys.builtin_module_names
False

The second finder knows how to find frozen modules, which re is not. The third knows how to find modules from a list of path entries called an import path. For re the import path is sys.path but for subpackages the import path can be the parent's __path__ attribute.

>>>sys.path
['', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/site-packages/distribute-0.6.49-py3.4.egg', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python34.zip', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/plat-linux', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/lib-dynload', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/site-packages', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/site-packages/setuptools-0.6c11-py3.4.egg-info']

Once the module spec is found, the loading machinery takes over. That's as far as I dug but you can read more about the loading process by reading the documentation.

Python Subprocess Wrapping with sh

When working with shell scripts written in bash/csh/etc one of the primary tools you have to rely on is a simple method of piping output and input from subprocesses called by the script to create complex logic to accomplish the goal of the script. When working with python, this same method of calling subprocesses to redirect the input/output is available, but the overhead of using this method in python would be so cumbersome as to make python a less desirable scripting language. In effect you were implementing large parts of the I/O facilities, and potentially even writing replacements for the existing shell utilities that would perform the same work. Recently, python developers attempted to solve this problem, by updating an existing python subprocess wrapper library called pbs, into an easier to use library called sh.

Sh can be installed using pip, and the author has posted some documentation for the library here: http://amoffat.github.io/sh/

Using the sh library

After installing the library into your version of python, there will be two ways to call any existing shell command available to the system, firstly you can import the command as though it was itself a python library:

from sh import hostname
print(hostname())

In addition, you can also call the command directly by just referencing the sh namespace prior to the command name:

import sh
print(sh.hostname())

When running this command on my linux workstation (hostname atlas) it will return the expected results:

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sh
>>> print(sh.hostname())
atlas

However at this point, we are merely replacing a single shell command which prints output to the screen, the real benefit of the shell scripts was that you could chain together commands in order to create complex logic to help you do work.

Advanced Gymnastics

A common use of shell scripts is to provide administrators the ability to quickly filter log file output and to potentially search for specific conditions within those logs, to alert in the event that an application starts throwing errors. With python piping in sh we can create a simple log watcher, which would be capable of calling anything we desire in python when the log file contains any of the conditions we are looking for.

To pipe together commands using the sh library, you would encapsulate each command in series to create a similar syntax to bash piping:

>>> print(sh.wc(sh.ls("-l", "/etc"), "-l"))
199

This command would have been equivalent to the bash pipe of "ls -l /etc | wc -l" indicating that the long listing of /etc on my workstation contained 199 lines of output. Each piped command is encapsulated inside the parenthesis of the command the precedes it.

For our log listener we will use the tail command along with a python iterator to watch for a potential error condition, which I will represent with the string "ERROR":

>>> for line in sh.tail("-f", "/tmp/test_log", _iter=True):
...     if "ERROR" in line:
...         print line

In this example, once executed, python will call the tail command to follow a particular log file. It will iterate over each line of output produced by tail and if any of the lines contain the string we are watching for python will print that line to standard output. At this point, this would be similar to using the tail command and piping the output to a string search command, like grep. However, you could replace the third line of the python with a more complex action, emailing the error condition out to a developer or administrator for review, or perhaps initiating a procedure to recover from the error automatically.

Conclusions


In this manner with just a few lines of python, much like with bash, one could create a relatively complex process without recreating all the shell commands which already perform this work, or create a convoluted wrapping process of passing output from command to command. This combination of the existing shell commands and the power of python; you get all the functions available to any python environment, with the ease of using the shell commands to do some of the work. In the future I will definitely be using this python library for my own shell scripting needs, as I have generally preferred the syntax and ease of use of python over that of bash, but now I will be able to enjoy both at the same time.

Postgresql conflict handling with Bucardo and multiple data sources


Image by Flickr user Rebecca Siegel (cropped)

Bucardo's much publicized ability to handle multiple data sources often raises questions about conflict resolution. People wonder, for example, what happens when a row in one source database gets updated one way, and the same row in another source database gets updated a different way? This article will explain some of the solutions Bucardo uses to solve conflicts. The recently released Bucardo 5.1.1 has some new features for conflict handling, so make sure you use at least that version.

Bucardo does multi-source replication, meaning that users can write to more than one source at the same time. (This is also called multi-master replication, but "source" is a much more accurate description than "master"). Bucardo deals in primary keys as a way to identify rows. If the same row has changed on one or more sources since the last Bucardo run, a conflict has arisen and Bucardo must be told how to handle it. In other words, Bucardo must decide which row is the "winner" and thus gets replicated to all the other databases.

For this demo, we will again use an Amazon AWS. See the earlier post about Bucardo 5 for directions on installing Bucardo itself. Once it is installed (after the './bucardo install' step), we can create some test databases for our conflict testing. Recall that we have a handy database named "shake1". As this name can get a bit long for some of the examples below, let's make a few databases copies with shorter names. We will also teach Bucardo about the databases, and create a sync named "ctest" to replicate between them all:

createdb aa -T shake1
createdb bb -T shake1
createdb cc -T shake1
bucardo add db A,B,C dbname=aa,bb,cc
## autokick=0 means new data won't replicate right away; useful for conflict testing!
bucardo add sync ctest dbs=A:source,B:source,C:source tables=all autokick=0
bucardo start

Bucardo has three general ways to handle conflicts: built in strategies, a list of databases, or using custom conflict handlers. The primary strategy, and also the default one for all syncs, is known as bucardo_latest. When this strategy is invoked, Bucardo scans all copies of the conflicted table across all source databases, and then orders the databases according to when they were last changed. This generates a list of databases, for example "B C A". For each conflicting row, the database most recently updated - of all the ones involved in the conflict for that row - is the winner. The other built in strategy is called "bucardo_latest_all_tables", which scans all the tables in the sync across all source databases to find a winner.

There may be other built in strategies added as experience/demand dictates, but it is hard to develop generic solutions to the complex problem of conflicts, so non built-in strategies are preferred. Before getting into those other solutions, let's see the default strategy (bucardo_latest) in action:

## This is the default, but it never hurts to be explicit:
bucardo update sync ctest conflict=bucardo_latest
Set conflict strategy to 'bucardo_latest'
psql aa -c "update work set totalwords=11 where title~'Juliet'"; \
psql bb -c "update work set totalwords=21 where title~'Juliet'"; \
psql cc -c "update work set totalwords=31 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 0
Kick ctest: [1 s] DONE!
## Because cc was the last to be changed, it wins:
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   31
bb   |   31
cc   |   31

Under the hood, Bucardo actually applies the list of winning databases to each conflicting row, such that example above of "B C A" means that database B wins in a conflict in which a rows was updated by B and C, or B and A, or B and C and A. However, if B did not change the row, and the conflict is only between C and A, then C will win.

As an alternative to the built-ins, you can set conflict_strategy to a list of the databases in the sync, ordered from highest priority to lowest, for example "C B A". The list does not have to include all the databases, but it is a good idea to do so. Let's see it in action. We will change the conflict_strategy for our test sync and then reload the sync to have it take effect:


bucardo update sync ctest conflict='B A C'
Set conflict strategy to 'B A C'
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=12 where title~'Juliet'"; \
psql bb -c "update work set totalwords=22 where title~'Juliet'"; \
psql cc -c "update work set totalwords=32 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 10
Kick ctest: [1 s] DONE!
## This time bb wins, because B comes before A and C
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   22
bb   |   22
cc   |   22

The final strategy for handling conflicts is to write your own code. Many will argue this is the best approach. It is certaiy the only one that will allow you to embed your business logic into the conflict handling.

Bucardo allows loading of snippets of Perl code known as "customcodes". These codes take effect at specified times, such as after triggers are disabled, or when a sync has failed because of an exception. The specific time we want is called "conflict", and it is an argument to the "whenrun" attribute of the customcode. A customcode needs a name, the whenrun argument, and a file to read in for its content. They can also be associated with one or more syncs or tables.

Once a conflict customcode is in place and a conflict is encountered, the code will be invoked, and it will in turn pass information back to Bucardo telling it how to handle the conflict.

The code should expect a single argument, a hashref containing information about the current sync. This hashref tells the current table, and gives a list of all conflicted rows. The code can tell Bucardo which database to consider the winner for each conflicted row, or it can simply declare a winning database for all rows, or even for all tables. It can even modify the data in any of the tables itself. What it cannot do (thanks to the magic of DBIx::Safe) is commit, rollback, or do other dangerous actions since we are in the middle of an important transaction.

It's probably best to show by example at this point. Here is a file called ctest1.pl that asks Bucardo to skip to the next applicable customcode if the conflict is in the table "chapter". Otherwise, it will tell it to have database "C" win all conflicts for this table, and fallback to the database "B" otherwise.

## ctest1.pl - a sample conflict handler for Bucardo
use strict;
use warnings;

my $info = shift;
## If this table is named 'chapter', do nothing
if ($info->{tablename} eq 'chapter') {
    $info->{skip} = 1;
}
else {
    ## Winning databases, in order
    $info->{tablewinner} = 'C B A';
}
return;

Let's add in this customcode, and associate it with our sync. Then we will reload the sync and cause a conflict.

bucardo add customcode ctest \
  whenrun=conflict src_code=ctest1.pl sync=ctest
Added customcode "ctest"
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=13 where title~'Juliet'"; \
psql bb -c "update work set totalwords=23 where title~'Juliet'"; \
psql cc -c "update work set totalwords=33 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 0
Kick ctest: [1 s] DONE!
## This time cc wins, because we set all rows to 'C B A'
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   33
bb   |   33
cc   |   33

We used the 'skip' hash value to tell Bucardo to not do anything if the table is named "chapter'. In real life, we would have another customcode that will handle the skipped table, else any conflict in it will cause the sync to stop. Any number of customcodes can be attached to syncs or tables.

The database preference will last for the remainder of this sync's run, so any other conflicts in other tables will not even bother to invoke the code. You can use the hash key "tablewinneralways" to make this decision sticky, in that it will apply for all future runs by this sync (its KID technically) - which effectively means the decision stays until Bucardo restarts.

One of the important structures sent to the code is a hash named "conflicts", which contains all the changed primary keys, and, for each one, a list of which databases were involved in the sync. A Data::Dumper peek at it would look like this:

$VAR1 = {
  'romeojuliet' => {
    'C' => 1,
    'A' => 1,
    'B' => 1,
  }
};

The job of the conflict handling code (unless using one of the "winner" hash keys) is to change each of those conflicted rows from a hash of involved databases into a string describing the preferred order of databases. The Data::Dumper output would thus look like this:

$VAR1 = {
  'romeojuliet' => 'B'
};

The code snippet would look like this:

## ctest2.pl - a simple conflict handler for Bucardo.
use strict;
use warnings;

my $info = shift;
for my $row (keys %{ $info->{conflicts} }) {
  ## Equivalent to 'A C B'
  $info->{conflicts}{$row} = exists $info->{conflicts}{$row}{A} ? 'A' : 'C';
}

## We don't want any other customcodes to fire: we have handled this!
$info->{lastcode} = 1;
return;

Let's see that code in action. Assuming the above "bucardo add customcode" command was run, we will need to load an updated version, and then reload the sync. We create some conflicts, and check on the leresults:


bucardo update customcode ctest src_code=ctest2.pl
Changed customcode "ctest" src_code with content of file "ctest2.pl"
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=14 where title~'Juliet'"; \
psql bb -c "update work set totalwords=24 where title~'Juliet'"; \
psql cc -c "update work set totalwords=34 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 10
Kick ctest: [2 s] DONE!
## This time aa wins, because we set all rows to 'A C B'
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   14
bb   |   14
cc   |   14

That was an obviously oversimplified example, as we picked 'A' for no discernible reason! These conflict handlers can be quite complex, and are only limited by your imagination - and your business logic. As a final example, let's have the code examine some other things in the database, and as well as jump out of the database itself(!) to determine the resolution to the conflict:

## ctest3.pl - a somewhat silly conflict handler for Bucardo.
use strict;
use warnings;
use LWP;

my $info = shift;

## What is the weather in Walla Walla, Washington?
## If it's really hot, we cannot trust server A
my $max_temp = 100;
my $weather_url = 'http://wxdata.weather.com/wxdata/weather/rss/local/USWA0476?cm_ven=LWO&cm_cat=rss';
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => $weather_url);
my $response = $ua->request($req)->content();
my $temp = ($response =~ /(\d+) \°/) ? $1 : 75;
## Store in our shared hash so we don't have to look it up every run
## Ideally we'd add something so we only call it if the temp has not been checked in last hour
$info->{shared}{wallawallatemp} = $temp;

## We want to count the number of sessions on each source database
my $SQL = 'SELECT count(*) FROM pg_stat_activity';
for my $db (sort keys %{ $info->{dbinfo} }) {
    ## Only source databases can have conflicting rows
    next if ! $info->{dbinfo}{$db}{issource};
    ## The safe database handles are stored in $info->{dbh}
    my $dbh = $info->{dbh}{$db};
    my $sth = $dbh->prepare($SQL);
    $sth->execute();
    $info->{shared}{dbcount}{$db} = $sth->fetchall_arrayref()->[0][0];
}

for my $row (keys %{ $info->{conflicts} }) {
    ## If the temp is too high, remove server A from consideration!
    if ($info->{shared}{wallawallatemp} > $max_temp) {
        delete $info->{conflicts}{$row}{A}; ## May not exist, but we delete anyway
    }

    ## Now we can sort by number of connections and let the least busy db win
    (my $winner) = sort {
        $info->{shared}{dbcount}{$a} <=> $info->{shared}{dbcount}{$b}
        or
        ## Fallback to reverse alphabetical if the session counts are the same
        $b cmp $a
    } keys %{ $info->{conflicts}{$row} };

    $info->{conflicts}{$row} = $winner;
}

## We don't want any other customcodes to fire: we have handled this!
$info->{lastcode} = 1;
return;

We'll forego the demo: suffice to say that B always won in my tests, as Walla Walla never got over 97, and all my test databases had the same number of connections. Note some of the other items in the $info hash: "shared" allows arbitrary data to be stored across invocations of the code. The "lastcode" key tells Bucardo not to fire any more customcodes. While this example is very impractical, it does demonstrate the power available to you when solving conflicts.

Hopefully this article answers many of the questions about conflict handling with Bucardo. Suggestions for new default handlers and examples of real-world conflict handlers are particularly welcome, as well as any other questions or comments. You can find the mailing list at bucardo-general@bucardo.org, and subscribe by visiting the bucardo-general Info Page.

Creating a Symbol Web Font

Creating a custom font that only includes a few characters can be very useful. I was looking for a good way to display left and right arrows for navigating between clients and also between team members on our site and after doing some research, creating a custom font seemed like a good way that would be small and that would support all kinds of screens and browsers. So, here I'll show how to create a web font with a few custom characters in it that you can use on your website.

You'll need to get the free, open source vector graphics editor Inkscape and familiarize yourself with its drawing tools.

To start, open Inkscape and open the SVG font editor by clicking Text -> SVG Font Editor. Under the font column, click "New" and then name your new font.

Now you can start adding characters. Begin by adding as many glyphs as you need and choosing letters for your character to be represented by. Only use characters that you can find on a standard QWERTY keyboard, as FontSquirrel (which we'll use to convert this to a web font) won't work with, for instance, Unicode special characters.

Now, for each symbol, draw it using Inkscape's tools and make sure that its dimensions are roughly 750 pixels high (which will be about the height of an uppercase letter) and that it's flush with the bottom of the canvas.

When your symbol looks like you want it to, make sure that all of the shapes you used to form it are selected and merge them together with Path -> Union. When you're done, you should have a single object, your glyph. Now, select your glyph and do Path -> Object to Path.

To add this symbol to your new font, select your object and the corresponding glyph and click "Get curves from selection."

To test, enter the character you're using for your symbol in the "Preview Text" area. If it shows your symbol, you're set. Otherwise, you need to make sure that you merged and converted your object to a path correctly.

After you've repeated these steps with every symbol you need, save with Inkscape as an SVG. We need to convert this to a TrueType font, so go to www.freefontconverter.com/ (or any other font converter) and convert to .ttf.

The last thing you need to do before using your font in your webpage is convert it to a webfont. Fortunately, FontSquirrel makes this easy. Go to FontSquirrel's webfont generator and upload your TrueType font. After the conversion has finished, you'll get a zipfile with the font in several different webfont formats, and even an HTML page telling you how to use it in a webpage.

Have fun creating custom webfonts!

Runaway Rewrite Rule

I am not an expert in Apache configuration. When I have to delve into a *.conf file for more than five minutes, I come out needing an aspirin, or at least a nerve-soothing cupcake. But necessity is the mother of contention, or something like that.

My application recently had added some new URLs, which were being parsed by your typical MVC route handler (although in Perl, because that's how I roll, and not in Dancer, because … well, I don't think it had been invented yet when this application first drew breath). 99.9% of the URLs worked just fine:

/browse/:brand/:category (the pattern)
/browse/acme/widget
/browse/ben-n-jerry/ice-cream

and so on. Suddenly a report reached me that one particular brand was failing:

/browse/unseen-images/stuff

("unseen-images" has been changed to protect the innocent. The key here is the word "images"; put a pin in that and hang on.)

/browse/unseen-images

worked just fine. What's worse, instrumenting the route handler code proved that it wasn't even being called for /browse/unseen-images/foo or any of its siblings, whether :category was valid or not.

Making sure my bottle of aspirin was at hand, I dove into the Apache configuration. I added –

RewriteLog /path-to-logs/logs/rewrite_log
RewriteLogLevel 9

and while its output was fascinating, it wasn't very enlightening. However, I did stumble upon this gem:

RewriteRule  ^/.*images/.*   -       [NE,PT,L]

Aha! Oho! A runaway regular expression is our culprit. I'm pretty sure this was added innocently, hoping to catch things like

/css/images/foo.jpg
/images/foo.png

and so on, but it misfired and gathered up my application URL. I replaced this temporarily with:

RewriteRule  ^/(.+/)*images/.*   -       [NE,PT,L]

"Temporarily" because I'm still trying to find someone who knows why that particular kind of rewrite was deemed necessary, so I don't know whether my replacement rule will have the same effect in the cases where it is supposed to be doing a job.

Is there a moral to this story? I don't know just yet, but it's probably something like "Regular expressions are powerful, use them with care", or maybe "When rewrite rules are good, they are very, very good, but when they are bad they are horrid."

Interactive Highlighting and Annotations with Annotator

Over a year ago, I wrote about JavaScript-driven interactive highlighting that emulates the behavior of physical highlighting and annotating text. It's interesting how much technology can change in a short time. This spring, I've been busy at work for on a major upgrade of both Rails (2.3 to 4.1) and of the annotation capabilities for H2O. As I explained in the original post, this highlighting functionality is one of the most interesting and core features of the platform. Here I'll go through a brief history and the latest round of changes.


Example digital highlighting of sample content.

History

My original post explains the challenges associated with highlighting content on a per-word basis, as well as determining color combinations for those highlights. In the past implementations, each word was wrapped in a single DOM element and that element would have its own background color based on the highlighting (or a white background for no highlighting). In the first iteration of the project, we didn't do allow for color combinations at all – instead we tracked history of highlights per word and always highlighted with the most recent color if it applied. In the second iteration of the annotation work, opaque layers of color were added under the words to simulate color combinations using absolute positioning. Cross browser support for absolute positioning is not always consistent, so this iteration had challenges.

In the third iteration that lasted for over a year, I found a great plugin (xColor) to calculate color combinations, eliminating the need for complex layering of highlights. The most recent iteration was acceptable in terms of functionality, but the major limitation we found was in performance. When every word of a piece of content has a DOM element, there are significant performance issues when content has more than 20,000 words, especially noticeable in the slower browsers.

I've had my eye out for a better way to accomplish this desired functionality, but without having a DOM per word markup, I didn't know if there was a better way to accomplish annotations without the performance challenges.

Annotator Tool

Along came Annotator, an open source JavaScript plugin offering annotation functionality. A coworker first brought this plugin to my attention and I spent time evaluating it. At the time, I concluded that while the plugin looked promising, there was too much customization required to support the already existing features in H2O. And of course, the tool did not support IE8, which was a huge at the time, although it becomes less of a limitation as time passes and users move away from IE8.

Time passed, and the H2O project manager also came across the same tool and brought it to my attention. I spent a bit of time developing a proof of concept to see how I might accomplish some of the desired behavior in a custom encapsulated plugin. With the success of the proof of concept, I also spent time working through the IE8 issues. Although I was able to work through many of them, I was not able to find a solution to fully support the tool in IE8. At that time, a decision was made to use Annotator and disable annotation capabilities for IE8. I moved forward on development.

How does it work?

Rather than highlighting content on a word level, Annotator determines the XPath of a section of highlighted characters. The XPath for the annotation starting point and ending point is retrieved, and one or more DOM elements wrap this content. If the annotated characters span multiple DOM elements (e.g. the annotation spans multiple paragraphs), multiple DOM elements are created for each parent element to wrap the annotated characters. Annotator handles all the management of the wrapped DOM elements for an annotation, and it provides triggers or hooks to be called tied to specific annotation events (e.g. after annotation created, before annotation deleted).

This solution has much better performance than the aforementioned techniques, and there's a growing community of open source developers involved in it, who have helped improve functionality and contribute additional features.

Annotator Customizations

Annotator includes a nice plugin architecture designed to allow custom functionality to be built on top of it. Below are customized features I've added to the application:

Colored Highlighting

In my custom plugin, I've added tagged colored highlighting. An editor can select a specific color for a tag assigned to an annotation from preselected colors. All users can highlight and unhighlight annotations with that specific tag. The plugin uses jQuery XColor, a JavaScript plugin that handles color calculation of overlapping highlights. Users can also turn on and off highlighting on a per tag basis.


Tagged colored highlighting (referred to as layers here) is selected from a predefined set of colors.

Linked Resources

Another customization I created was the ability to link annotations to other resources in the application, which allows for users to build relationships between multiple pieces of content. This is merely an extra data point saved on the annotation itself.


A linked collage from this annotation.

Toggle Display of layered and unlayered content

One of the most difficult customization points was building out the functionality that allows users to toggle display of unlayered and layered content, meaning that after a user annotates a certain amount of text, they can hide all the unannotated text (replaced with an ellipsis). The state of the content (e.g. with unlayered text hidden) is saved and presented to other users this way, which essentially allows the author to control what text is visible to users.

Learn More

Make sure to check out the Annotator website if you are interested in learning more about this plugin. The active community has interesting support for annotating images, video, and audio, and is always focused on improving plugin capabilities. One group is currently focused on cross browser support, including support of IE8.

Liquid Galaxy installation at Sparkassen-Finanzportal Forum 2014


In May, End Point and Google organized a Liquid Galaxy installation at Sparkassen-Finanzportal Forum 2014 in Düsseldorf, Germany.

For this event End Point installed the Liquid Galaxy and also prepared custom tours for the Liquid Galaxy showing different Sparkasse localizations across Germany.

We arrived at Düsseldorf a day before the event, and assembled the whole Liquid Galaxy without any problems. The system consists of 7 displays and 6 computers, so the potential for issues is pretty great, but we've done this a number of times, and have worked out a good stable build. After assembly, the system worked pretty well. Our US-based team then finalized the custom tours and uploaded the most recent software and content.

The next morning when we arrived at the event place, everything was working great, and the system was ready for people to explore and discover. As per usual, people were interested not only in the prepared tours, but also were looking for the places they know (usually, their house!). The overall user experience is great, especially when people were able to see places they hadn't seen for quite a long time, like the places where they grew up.

This is another successful conference deployment of the Liquid Galaxy platform. Thanks to all our partners in Europe and the US who helped make this happen.