End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

PyOhio 2014: Python FTW!

Just got back from PyOhio a couple of days ago. Columbus used to be my old stomping grounds so it's often nice to get back there. And PyOhio had been on my TODO for a number of years now, but every time it seemed like something else just got in the way. This year I figured it was finally time, and I'm quite glad it worked out.

While of course everything centered around usage with Python, much of the talks surrounded other tools or projects. I return with a much better view of good technologies likes Redis, Ansible, Docker, ØMQ, Kafka, Celery, asyncio in Python 3.4, Graphite, and much more that isn't coming to mind at the moment. I have lots to dig into now.

It also pleased me to see so much Postgres love! I mean, clearly, once you start to use it you'll fall in love, that's without question. But the hall track was full of conversations about how various people were using Postgres, what it tied in to in their applications, and various tips and tricks they'd discovered in its functionality. Just goes to prove that Postgres == ♥.

Naturally PostgreSQL is what I spoke on; PL/Python, specifically. It actually directly followed a talk on PostgreSQL's LISTEN/NOTIFY feature. I was a touch worried about overlap considering some of the things I'd planned, but it turns out the two talks more or less dovetailed from one to the other. It was unintentional, but it worked out very well.

Anyway, the slides are available, but the talk wasn't quite structured in the normal format of having those slides displayed on a projector. Instead, in a bit of an experiment, the attendees could hit a web page and bring up the slides on their laptops or such. That slide deck opened a long-polling socket back to the server, and the web app could control the slide movement on those remote screens. That let the projector keep to a console session that was used to illustrate PL/Python and PostgreSQL usage. As you might expect, the demo included a run through the PL/Python and related code that drove that socket. Hopefully the video, when it's available, caught some of it.

The sessions were recorded on video, but one thing I hadn't expected was how that influenced which talks I tried to attend. Knowing that the more software-oriented presentations will be available for viewing later, where available I opted for more hardware-oriented topics, or other talks where being present seemed like it would have much more impact. I also didn't feel rushed between sessions, on the occasions where I got caught in a hall track conversation or checked out something in the open spaces area (in one sense,a dedicated hall track room.)

Overall, it was a fantastic conference and a great thank you goes out to everyone that helped make it happen!

Point Clouds on the Liquid Galaxy

Image by Stoermerjp, unmodified (CC BY-SA 3.0)

The Liquid Galaxy began as a system to display geographic data through Google Earth, but it has expanded quickly as a display platform for other types of information. We've used Liquid Galaxies for panoramic images and video, three dimensional models of all sorts, as well as time-lapse renderings of weather, infrastructure, and economic data. Now we've added support for a new type of data, the point cloud.

"Point cloud" is simply the common term for a data set consisting of individual points, often in three-dimensional space, and frequently very large, containing thousands or millions of entries. Points in a cloud can include not just coordinate data, but other information as well, and because this sort of data can come from many different fields, the possible variations are endless. For instance, the terrain features visible in Google Earth began life as point clouds, the output of an aerial scanning device such as a LIDAR scanner. These scanners sweep their field of view rapidly, scanning millions of points to determine their location and any other interesting measurements -- color, for instance, or temperature -- and with that data create a point cloud. Smaller scale hardware scanners have made their way into modern life, scanning rooms and buildings, or complex objects. A few years ago, the band Radiohead collaborated with Google to use the 3-D scanning techniques to film a music video, and published the resulting point cloud on Google Code.

Image by Xorx, unmodified (CC BY-SA 3.0)

For the Liquid Galaxy platform, we modified an open source point cloud viewer called Potree to allow one master instance to control several others. Potree is a WebGL application. It runs in a browser, and depends on three.js, a widely used WebGL library. Generally speaking, to convert an application to run on a Liquid Galaxy, the developer must give the software two separate modes: one which allows normal user control and transmits information to a central data bus, and another which receives information from the bus and uses it to create its own modified display. In this case, where the application simply loads a model and lets the user move it around, the "master" mode tracks all camera movements and sends them to the other Liquid Galaxy nodes, which will draw the same point cloud in the same orientation as the master, but with the camera pointing offset to the left or right a certain amount. We've dubbed our version lg-potree.

This marks the debut of a simple three.js extension we've been working on, which we've called lg-three.js, designed to make it easy to adapt three.js applications for the Liquid Galaxy. lg-three.js gives the programmer an interface for capturing and serializing things like camera movements or other scene changes on the master node, and de-serializing and using that data on the other nodes, hopefully without having to modify the original application much. Some applications' structure doesn't lend itself well to lg-three, but potree proved fairly straightforward.

With that introduction, please enjoy this demonstration.

Customizing the Nestable jQuery Plugin

A while back, we started using the Nestable jQuery Plugin for H2O. It provides interactive hierarchical list functionality – or the ability to sort and nest items.


Diagram from Nestable jQuery Plugin representing interactive hierarchical sort and list functionality.

I touched on H2O's data model in this post, but it essentially mimics the diagram above; A user can build sortable and nestable lists. Nesting is visible at up to 4 levels. Each list is accessible and editable as its own resource, owned by a single user. The plugin is ideal for working with the data model, however, I needed a bit of customization that I'll describe in this post.

Limiting Nestability to Specific List Items

The feature I was requested to develop was to limit nesting to items owned by the current authorized (logged in) user. Users can add items to their list that are owned by other users, but they can not modify the list elements for that list item. In visual form, it might look something like the diagram below, where green represents the items owned by the user which allow modified nesting, and red represents items that are not owned by the user which can not be modified. In the diagram below, I would not be able to add to or reorder the contents of Item 5 (including Items 6 - 8), and I would not be able to add any nested elements to Item 10. I can, however, reorder elements inside Item 2, which means e.g. I can move Item 5 above Item 3.


Example of nesting where nesting is prohibited among some items (red), but allowed under others (green).

There are a couple of tricky bits to note in developing this functionality:

  • The plugin doesn't support this type of functionality, nor is it currently maintained, so there are absolutely no expectations of this being an included feature.
  • These pages are fully cached for performance optimization, so there is no per-user logic that can be run to generate modified HTML. The solution here was implemented using JavaScript and CSS.

Background Notes on the Plugin

There are a couple of background notes on the plugin before I go into the implemented solution:

  • The plugin uses <ol> tags to represent lists. Only items in <ol> elements are nestable and sortable.
  • The plugin recognizes .dd-handle as the draggable handle on a list item. If an item doesn't have a .dd-handle element, no part of it can be dragged.
  • The plugin creates a <div> with a class of dd-placeholder to represent the placeholder where an item is about to be dropped. The default appearance for this is a white box with dashed outline.
  • The plugin has an on change event which is triggered whenever any item is dropped in the list or any part of the list is reordered.

Step 1: JavaScript Update for Limiting Nestability

After the content loads, as well as after additional list items are added, a method called set_nestability is run to modify the HTML of the content, represented by the pseudocode below:

set_nestability: function() {
  // if user is not logged in
    // nestability is never enabled
  // if user is logged in user is not a superadmin (superadmins can edit all) 
    // loop through each list item 
      // if the list item data('user_id') != $.cookie('user_id')
        // remove dd-handle class from all list item children .dd-handle elements
        // replace all <ol> tags with <ul> tags inside that list item

The simple bit of pseudocode does two things: It removes the .dd-handle class for elements that can't be reordered, and it replaces <ol> tags with <ul> tags to enable CSS. The only thing to take note of here is that in theory, a "hacker" can change their cookie to enable nestability of certain items, but there is additional server-side logic that would prohibit an update.

Step 2: CSS Updates

ul .dd-placeholder {
  display: none;
}

Next up, the simple CSS change above was made. This results in the placeholder div being hidden in any non-editable list. I made several small CSS modifications so that the ul and ol elements would look the same otherwise.

Step 3: JavaScript On Change Updates

Finally, I modified the on change event:

$('div#list').on('change', function(el) {
  // if item is dropped
    // if dropped item is not inside <ol> tag, return (do nothing)
    // else continue with dropped logic
  // else if position is changed
    // trigger position change logic  
}

The on change event does nothing when the dropped item is not inside an editable list. Otherwise, the dropped logic continues as the user has permissions to edit the list.

Conclusion

The functionality described here has worked well. What may become a bit trickier is when advanced rights and roles will allow non-owners to edit specific content, which I haven't gotten to yet. I haven't found additional resources that offer sortable and nestable functionality in jQuery, but it'd be great to see a new well-supported plugin in the future.

Python Imports

For a Python project I'm working on, I wrote a parent class with multiple child classes, each of which made use of various modules that were imported in the parent class. A quick solution to making these modules available in the child classes would be to use wildcard imports in the child classes:

from package.parent import *

however, PEP8 warns against this stating "they make it unclear which names are present in the namespace, confusing both readers and many automated tools."

For example, suppose we have three files:

# a.py
import module1
class A(object):
    def __init__():
        pass
# b.py
import module2
class B(A):
    def __init__():
        super(B, self).__init__()
# c.py
class C(B):
    def __init__():
        super(C, self).__init__()

To someone reading just b.py or c.py, it is unknown that module1 is present in the namespace of B and that both module1 and module2 are present in the namespace of C. So, following PEP8, I just explicitly imported any module needed in each child class. Because in my case there were many imports and because it seemed repetitive to have all those imports duplicated in each of the many child classes, I wanted to find out if there was a better solution. While I still don't know if there is, I did go down the road of how imports work in Python, at least for 3.4.1, and will share my notes with you.

Python allows you to import modules using the import statement, the built-in function __import__(), and the function importlib.import_module(). The differences between these are:

The import statement first "searches for the named module, then it binds the results of that search to a name in the local scope" (Python Documentation). Example:

Python 3.4.1 (default, Jul 15 2014, 13:05:56) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>
>>> re.sub('s', '', 'bananas')
'banana'

Here the import statement searches for a module named re then binds the result to the variable named re. You can then call re module functions with re.function_name().

A call to function __import__() performs the module search but not the binding; that is left to you. Example:

>>> muh_regex = __import__('re')
>>> muh_regex
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>
>>> muh_regex.sub('s', '', 'bananas')
'banana'

Your third option is to use importlib.import_module() which, like __import__(), only performs the search:

>>> import importlib
>>> muh_regex = importlib.import_module('re')
>>> muh_regex
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>
>>> muh_regex.sub('s', '', 'bananas')
'banana'

Let's now talk about how Python searches for modules. The first place it looks is in sys.modules, which is a dictionary that caches previously imported modules:

>>> import sys
>>> 're' in sys.modules
False
>>> import re
>>> 're' in sys.modules
True
>>> sys.modules['re']
<module 're' from '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py'>

If the module is not found in sys.modules Python searches sys.meta_path, which is a list that contains finder objects. Finders, along with loaders, are objects in Python's import protocol. The job of a finder is to return a module spec, using method find_spec(), containing the module's import-related information which the loader then uses to load the actual module. Let's see what I have in my sys.meta_path:

>>> sys.meta_path
[<class '_frozen_importlib.BuiltinImporter'>, <class '_frozen_importlib.FrozenImporter'>, <class '_frozen_importlib.PathFinder'>]

Python will use each finder object in sys.meta_path until the module is found and will raise an ImportError if it is not found. Let's call find_spec() with parameter 're' on each of these finder objects:

>>> sys.meta_path[0].find_spec('re')
>>> sys.meta_path[1].find_spec('re')
>>> sys.meta_path[2].find_spec('re')
ModuleSpec(name='re', loader=<_frozen_importlib.SourceFileLoader object at 0x7ff7eb314438>, origin='/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/re.py')

The first finder knows how to find built-in modules and since re is not a built-in module, it returns None.

>>> 're' in sys.builtin_module_names
False

The second finder knows how to find frozen modules, which re is not. The third knows how to find modules from a list of path entries called an import path. For re the import path is sys.path but for subpackages the import path can be the parent's __path__ attribute.

>>>sys.path
['', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/site-packages/distribute-0.6.49-py3.4.egg', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python34.zip', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/plat-linux', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/lib-dynload', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/site-packages', '/home/miguel/.pythonbrew/pythons/Python-3.4.1/lib/python3.4/site-packages/setuptools-0.6c11-py3.4.egg-info']

Once the module spec is found, the loading machinery takes over. That's as far as I dug but you can read more about the loading process by reading the documentation.

Python Subprocess Wrapping with sh

When working with shell scripts written in bash/csh/etc. one of the primary tools you have to rely on is a simple method of piping output and input from subprocesses called by the script to create complex logic to accomplish the goal of the script. When working with python, this same method of calling subprocesses to redirect the input/output is available, but the overhead of using this method in python would be so cumbersome as to make python a less desirable scripting language. In effect you were implementing large parts of the I/O facilities, and potentially even writing replacements for the existing shell utilities that would perform the same work. Recently, python developers attempted to solve this problem, by updating an existing python subprocess wrapper library called pbs, into an easier to use library called sh.

Sh can be installed using pip, and the author has posted some documentation for the library here: http://amoffat.github.io/sh/

Using the sh library

After installing the library into your version of python, there will be two ways to call any existing shell command available to the system, firstly you can import the command as though it was itself a python library:

from sh import hostname
print(hostname())
In addition, you can also call the command directly by just referencing the sh namespace prior to the command name:
import sh
print(sh.hostname())

When running this command on my linux workstation (hostname atlas) it will return the expected results:

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sh
>>> print(sh.hostname())
atlas

However at this point, we are merely replacing a single shell command which prints output to the screen, the real benefit of the shell scripts was that you could chain together commands in order to create complex logic to help you do work.

Advanced Gymnastics

A common use of shell scripts is to provide administrators the ability to quickly filter log file output and to potentially search for specific conditions within those logs, to alert in the event that an application starts throwing errors. With python piping in sh we can create a simple log watcher, which would be capable of calling anything we desire in python when the log file contains any of the conditions we are looking for.

To pipe together commands using the sh library, you would encapsulate each command in series to create a similar syntax to bash piping:

>>> print(sh.wc(sh.ls("-l", "/etc"), "-l"))
199

This command would have been equivalent to the bash pipe of "ls -l /etc | wc -l" indicating that the long listing of /etc on my workstation contained 199 lines of output. Each piped command is encapsulated inside the parenthesis of the command the precedes it.

For our log listener we will use the tail command along with a python iterator to watch for a potential error condition, which I will represent with the string "ERROR":

>>> for line in sh.tail("-f", "/tmp/test_log", _iter=True):
...     if "ERROR" in line:
...         print line

In this example, once executed, python will call the tail command to follow a particular log file. It will iterate over each line of output produced by tail and if any of the lines contain the string we are watching for python will print that line to standard output. At this point, this would be similar to using the tail command and piping the output to a string search command, like grep. However, you could replace the third line of the python with a more complex action, emailing the error condition out to a developer or administrator for review, or perhaps initiating a procedure to recover from the error automatically.

Conclusions

In this manner with just a few lines of python, much like with bash, one could create a relatively complex process without recreating all the shell commands which already perform this work, or create a convoluted wrapping process of passing output from command to command. This combination of the existing shell commands and the power of python; you get all the functions available to any python environment, with the ease of using the shell commands to do some of the work. In the future I will definitely be using this python library for my own shell scripting needs, as I have generally preferred the syntax and ease of use of python over that of bash, but now I will be able to enjoy both at the same time.

Postgresql conflict handling with Bucardo and multiple data sources


Image by Flickr user Rebecca Siegel (cropped)

Bucardo's much publicized ability to handle multiple data sources often raises questions about conflict resolution. People wonder, for example, what happens when a row in one source database gets updated one way, and the same row in another source database gets updated a different way? This article will explain some of the solutions Bucardo uses to solve conflicts. The recently released Bucardo 5.1.1 has some new features for conflict handling, so make sure you use at least that version.

Bucardo does multi-source replication, meaning that users can write to more than one source at the same time. (This is also called multi-master replication, but "source" is a much more accurate description than "master"). Bucardo deals in primary keys as a way to identify rows. If the same row has changed on one or more sources since the last Bucardo run, a conflict has arisen and Bucardo must be told how to handle it. In other words, Bucardo must decide which row is the "winner" and thus gets replicated to all the other databases.

For this demo, we will again use an Amazon AWS. See the earlier post about Bucardo 5 for directions on installing Bucardo itself. Once it is installed (after the './bucardo install' step), we can create some test databases for our conflict testing. Recall that we have a handy database named "shake1". As this name can get a bit long for some of the examples below, let's make a few databases copies with shorter names. We will also teach Bucardo about the databases, and create a sync named "ctest" to replicate between them all:

createdb aa -T shake1
createdb bb -T shake1
createdb cc -T shake1
bucardo add db A,B,C dbname=aa,bb,cc
## autokick=0 means new data won't replicate right away; useful for conflict testing!
bucardo add sync ctest dbs=A:source,B:source,C:source tables=all autokick=0
bucardo start

Bucardo has three general ways to handle conflicts: built in strategies, a list of databases, or using custom conflict handlers. The primary strategy, and also the default one for all syncs, is known as bucardo_latest. When this strategy is invoked, Bucardo scans all copies of the conflicted table across all source databases, and then orders the databases according to when they were last changed. This generates a list of databases, for example "B C A". For each conflicting row, the database most recently updated - of all the ones involved in the conflict for that row - is the winner. The other built in strategy is called "bucardo_latest_all_tables", which scans all the tables in the sync across all source databases to find a winner.

There may be other built in strategies added as experience/demand dictates, but it is hard to develop generic solutions to the complex problem of conflicts, so non built-in strategies are preferred. Before getting into those other solutions, let's see the default strategy (bucardo_latest) in action:

## This is the default, but it never hurts to be explicit:
bucardo update sync ctest conflict=bucardo_latest
Set conflict strategy to 'bucardo_latest'
psql aa -c "update work set totalwords=11 where title~'Juliet'"; \
psql bb -c "update work set totalwords=21 where title~'Juliet'"; \
psql cc -c "update work set totalwords=31 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 0
Kick ctest: [1 s] DONE!
## Because cc was the last to be changed, it wins:
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   31
bb   |   31
cc   |   31

Under the hood, Bucardo actually applies the list of winning databases to each conflicting row, such that example above of "B C A" means that database B wins in a conflict in which a rows was updated by B and C, or B and A, or B and C and A. However, if B did not change the row, and the conflict is only between C and A, then C will win.

As an alternative to the built-ins, you can set conflict_strategy to a list of the databases in the sync, ordered from highest priority to lowest, for example "C B A". The list does not have to include all the databases, but it is a good idea to do so. Let's see it in action. We will change the conflict_strategy for our test sync and then reload the sync to have it take effect:


bucardo update sync ctest conflict='B A C'
Set conflict strategy to 'B A C'
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=12 where title~'Juliet'"; \
psql bb -c "update work set totalwords=22 where title~'Juliet'"; \
psql cc -c "update work set totalwords=32 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 10
Kick ctest: [1 s] DONE!
## This time bb wins, because B comes before A and C
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   22
bb   |   22
cc   |   22

The final strategy for handling conflicts is to write your own code. Many will argue this is the best approach. It is certaiy the only one that will allow you to embed your business logic into the conflict handling.

Bucardo allows loading of snippets of Perl code known as "customcodes". These codes take effect at specified times, such as after triggers are disabled, or when a sync has failed because of an exception. The specific time we want is called "conflict", and it is an argument to the "whenrun" attribute of the customcode. A customcode needs a name, the whenrun argument, and a file to read in for its content. They can also be associated with one or more syncs or tables.

Once a conflict customcode is in place and a conflict is encountered, the code will be invoked, and it will in turn pass information back to Bucardo telling it how to handle the conflict.

The code should expect a single argument, a hashref containing information about the current sync. This hashref tells the current table, and gives a list of all conflicted rows. The code can tell Bucardo which database to consider the winner for each conflicted row, or it can simply declare a winning database for all rows, or even for all tables. It can even modify the data in any of the tables itself. What it cannot do (thanks to the magic of DBIx::Safe) is commit, rollback, or do other dangerous actions since we are in the middle of an important transaction.

It's probably best to show by example at this point. Here is a file called ctest1.pl that asks Bucardo to skip to the next applicable customcode if the conflict is in the table "chapter". Otherwise, it will tell it to have database "C" win all conflicts for this table, and fallback to the database "B" otherwise.

## ctest1.pl - a sample conflict handler for Bucardo
use strict;
use warnings;

my $info = shift;
## If this table is named 'chapter', do nothing
if ($info->{tablename} eq 'chapter') {
    $info->{skip} = 1;
}
else {
    ## Winning databases, in order
    $info->{tablewinner} = 'C B A';
}
return;

Let's add in this customcode, and associate it with our sync. Then we will reload the sync and cause a conflict.

bucardo add customcode ctest \
  whenrun=conflict src_code=ctest1.pl sync=ctest
Added customcode "ctest"
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=13 where title~'Juliet'"; \
psql bb -c "update work set totalwords=23 where title~'Juliet'"; \
psql cc -c "update work set totalwords=33 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 0
Kick ctest: [1 s] DONE!
## This time cc wins, because we set all rows to 'C B A'
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   33
bb   |   33
cc   |   33

We used the 'skip' hash value to tell Bucardo to not do anything if the table is named "chapter'. In real life, we would have another customcode that will handle the skipped table, else any conflict in it will cause the sync to stop. Any number of customcodes can be attached to syncs or tables.

The database preference will last for the remainder of this sync's run, so any other conflicts in other tables will not even bother to invoke the code. You can use the hash key "tablewinneralways" to make this decision sticky, in that it will apply for all future runs by this sync (its KID technically) - which effectively means the decision stays until Bucardo restarts.

One of the important structures sent to the code is a hash named "conflicts", which contains all the changed primary keys, and, for each one, a list of which databases were involved in the sync. A Data::Dumper peek at it would look like this:

$VAR1 = {
  'romeojuliet' => {
    'C' => 1,
    'A' => 1,
    'B' => 1,
  }
};

The job of the conflict handling code (unless using one of the "winner" hash keys) is to change each of those conflicted rows from a hash of involved databases into a string describing the preferred order of databases. The Data::Dumper output would thus look like this:

$VAR1 = {
  'romeojuliet' => 'B'
};

The code snippet would look like this:

## ctest2.pl - a simple conflict handler for Bucardo.
use strict;
use warnings;

my $info = shift;
for my $row (keys %{ $info->{conflicts} }) {
  ## Equivalent to 'A C B'
  $info->{conflicts}{$row} = exists $info->{conflicts}{$row}{A} ? 'A' : 'C';
}

## We don't want any other customcodes to fire: we have handled this!
$info->{lastcode} = 1;
return;

Let's see that code in action. Assuming the above "bucardo add customcode" command was run, we will need to load an updated version, and then reload the sync. We create some conflicts, and check on the results:


bucardo update customcode ctest src_code=ctest2.pl
Changed customcode "ctest" src_code with content of file "ctest2.pl"
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=14 where title~'Juliet'"; \
psql bb -c "update work set totalwords=24 where title~'Juliet'"; \
psql cc -c "update work set totalwords=34 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 10
Kick ctest: [2 s] DONE!
## This time aa wins, because we set all rows to 'A C B'
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   14
bb   |   14
cc   |   14

That was an obviously oversimplified example, as we picked 'A' for no discernible reason! These conflict handlers can be quite complex, and are only limited by your imagination - and your business logic. As a final example, let's have the code examine some other things in the database, and as well as jump out of the database itself(!) to determine the resolution to the conflict:

## ctest3.pl - a somewhat silly conflict handler for Bucardo.
use strict;
use warnings;
use LWP;

my $info = shift;

## What is the weather in Walla Walla, Washington?
## If it's really hot, we cannot trust server A
my $max_temp = 100;
my $weather_url = 'http://wxdata.weather.com/wxdata/weather/rss/local/USWA0476?cm_ven=LWO&cm_cat=rss';
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => $weather_url);
my $response = $ua->request($req)->content();
my $temp = ($response =~ /(\d+) \°/) ? $1 : 75;
## Store in our shared hash so we don't have to look it up every run
## Ideally we'd add something so we only call it if the temp has not been checked in last hour
$info->{shared}{wallawallatemp} = $temp;

## We want to count the number of sessions on each source database
my $SQL = 'SELECT count(*) FROM pg_stat_activity';
for my $db (sort keys %{ $info->{dbinfo} }) {
    ## Only source databases can have conflicting rows
    next if ! $info->{dbinfo}{$db}{issource};
    ## The safe database handles are stored in $info->{dbh}
    my $dbh = $info->{dbh}{$db};
    my $sth = $dbh->prepare($SQL);
    $sth->execute();
    $info->{shared}{dbcount}{$db} = $sth->fetchall_arrayref()->[0][0];
}

for my $row (keys %{ $info->{conflicts} }) {
    ## If the temp is too high, remove server A from consideration!
    if ($info->{shared}{wallawallatemp} > $max_temp) {
        delete $info->{conflicts}{$row}{A}; ## May not exist, but we delete anyway
    }

    ## Now we can sort by number of connections and let the least busy db win
    (my $winner) = sort {
        $info->{shared}{dbcount}{$a} <=> $info->{shared}{dbcount}{$b}
        or
        ## Fallback to reverse alphabetical if the session counts are the same
        $b cmp $a
    } keys %{ $info->{conflicts}{$row} };

    $info->{conflicts}{$row} = $winner;
}

## We don't want any other customcodes to fire: we have handled this!
$info->{lastcode} = 1;
return;

We'll forego the demo: suffice to say that B always won in my tests, as Walla Walla never got over 97, and all my test databases had the same number of connections. Note some of the other items in the $info hash: "shared" allows arbitrary data to be stored across invocations of the code. The "lastcode" key tells Bucardo not to fire any more customcodes. While this example is very impractical, it does demonstrate the power available to you when solving conflicts.

Hopefully this article answers many of the questions about conflict handling with Bucardo. Suggestions for new default handlers and examples of real-world conflict handlers are particularly welcome, as well as any other questions or comments. You can find the mailing list at bucardo-general@bucardo.org, and subscribe by visiting the bucardo-general Info Page.

Creating a Symbol Web Font

Creating a custom font that only includes a few characters can be very useful. I was looking for a good way to display left and right arrows for navigating between clients and also between team members on our site and after doing some research, creating a custom font seemed like a good way that would be small and that would support all kinds of screens and browsers. So, here I'll show how to create a web font with a few custom characters in it that you can use on your website.

You'll need to get the free, open source vector graphics editor Inkscape and familiarize yourself with its drawing tools.

To start, open Inkscape and open the SVG font editor by clicking Text -> SVG Font Editor. Under the font column, click "New" and then name your new font.

Now you can start adding characters. Begin by adding as many glyphs as you need and choosing letters for your character to be represented by. Only use characters that you can find on a standard QWERTY keyboard, as FontSquirrel (which we'll use to convert this to a web font) won't work with, for instance, Unicode special characters.

Now, for each symbol, draw it using Inkscape's tools and make sure that its dimensions are roughly 750 pixels high (which will be about the height of an uppercase letter) and that it's flush with the bottom of the canvas.

When your symbol looks like you want it to, make sure that all of the shapes you used to form it are selected and merge them together with Path -> Union. When you're done, you should have a single object, your glyph. Now, select your glyph and do Path -> Object to Path.

To add this symbol to your new font, select your object and the corresponding glyph and click "Get curves from selection."

To test, enter the character you're using for your symbol in the "Preview Text" area. If it shows your symbol, you're set. Otherwise, you need to make sure that you merged and converted your object to a path correctly.

After you've repeated these steps with every symbol you need, save with Inkscape as an SVG. We need to convert this to a TrueType font, so go to www.freefontconverter.com/ (or any other font converter) and convert to .ttf.

The last thing you need to do before using your font in your webpage is convert it to a webfont. Fortunately, FontSquirrel makes this easy. Go to FontSquirrel's webfont generator and upload your TrueType font. After the conversion has finished, you'll get a zipfile with the font in several different webfont formats, and even an HTML page telling you how to use it in a webpage.

Have fun creating custom webfonts!

Runaway Rewrite Rule

I am not an expert in Apache configuration. When I have to delve into a *.conf file for more than five minutes, I come out needing an aspirin, or at least a nerve-soothing cupcake. But necessity is the mother of contention, or something like that.

My application recently had added some new URLs, which were being parsed by your typical MVC route handler (although in Perl, because that's how I roll, and not in Dancer, because … well, I don't think it had been invented yet when this application first drew breath). 99.9% of the URLs worked just fine:

/browse/:brand/:category (the pattern)
/browse/acme/widget
/browse/ben-n-jerry/ice-cream

and so on. Suddenly a report reached me that one particular brand was failing:

/browse/unseen-images/stuff

("unseen-images" has been changed to protect the innocent. The key here is the word "images"; put a pin in that and hang on.)

/browse/unseen-images

worked just fine. What's worse, instrumenting the route handler code proved that it wasn't even being called for /browse/unseen-images/foo or any of its siblings, whether :category was valid or not.

Making sure my bottle of aspirin was at hand, I dove into the Apache configuration. I added –

RewriteLog /path-to-logs/logs/rewrite_log
RewriteLogLevel 9

and while its output was fascinating, it wasn't very enlightening. However, I did stumble upon this gem:

RewriteRule  ^/.*images/.*   -       [NE,PT,L]

Aha! Oho! A runaway regular expression is our culprit. I'm pretty sure this was added innocently, hoping to catch things like

/css/images/foo.jpg
/images/foo.png

and so on, but it misfired and gathered up my application URL. I replaced this temporarily with:

RewriteRule  ^/(.+/)*images/.*   -       [NE,PT,L]

"Temporarily" because I'm still trying to find someone who knows why that particular kind of rewrite was deemed necessary, so I don't know whether my replacement rule will have the same effect in the cases where it is supposed to be doing a job.

Is there a moral to this story? I don't know just yet, but it's probably something like "Regular expressions are powerful, use them with care", or maybe "When rewrite rules are good, they are very, very good, but when they are bad they are horrid."

Interactive Highlighting and Annotations with Annotator

Over a year ago, I wrote about JavaScript-driven interactive highlighting that emulates the behavior of physical highlighting and annotating text. It's interesting how much technology can change in a short time. This spring, I've been busy at work for on a major upgrade of both Rails (2.3 to 4.1) and of the annotation capabilities for H2O. As I explained in the original post, this highlighting functionality is one of the most interesting and core features of the platform. Here I'll go through a brief history and the latest round of changes.


Example digital highlighting of sample content.

History

My original post explains the challenges associated with highlighting content on a per-word basis, as well as determining color combinations for those highlights. In the past implementations, each word was wrapped in a single DOM element and that element would have its own background color based on the highlighting (or a white background for no highlighting). In the first iteration of the project, we didn't do allow for color combinations at all – instead we tracked history of highlights per word and always highlighted with the most recent color if it applied. In the second iteration of the annotation work, opaque layers of color were added under the words to simulate color combinations using absolute positioning. Cross browser support for absolute positioning is not always consistent, so this iteration had challenges.

In the third iteration that lasted for over a year, I found a great plugin (xColor) to calculate color combinations, eliminating the need for complex layering of highlights. The most recent iteration was acceptable in terms of functionality, but the major limitation we found was in performance. When every word of a piece of content has a DOM element, there are significant performance issues when content has more than 20,000 words, especially noticeable in the slower browsers.

I've had my eye out for a better way to accomplish this desired functionality, but without having a DOM per word markup, I didn't know if there was a better way to accomplish annotations without the performance challenges.

Annotator Tool

Along came Annotator, an open source JavaScript plugin offering annotation functionality. A coworker first brought this plugin to my attention and I spent time evaluating it. At the time, I concluded that while the plugin looked promising, there was too much customization required to support the already existing features in H2O. And of course, the tool did not support IE8, which was a huge at the time, although it becomes less of a limitation as time passes and users move away from IE8.

Time passed, and the H2O project manager also came across the same tool and brought it to my attention. I spent a bit of time developing a proof of concept to see how I might accomplish some of the desired behavior in a custom encapsulated plugin. With the success of the proof of concept, I also spent time working through the IE8 issues. Although I was able to work through many of them, I was not able to find a solution to fully support the tool in IE8. At that time, a decision was made to use Annotator and disable annotation capabilities for IE8. I moved forward on development.

How does it work?

Rather than highlighting content on a word level, Annotator determines the XPath of a section of highlighted characters. The XPath for the annotation starting point and ending point is retrieved, and one or more DOM elements wrap this content. If the annotated characters span multiple DOM elements (e.g. the annotation spans multiple paragraphs), multiple DOM elements are created for each parent element to wrap the annotated characters. Annotator handles all the management of the wrapped DOM elements for an annotation, and it provides triggers or hooks to be called tied to specific annotation events (e.g. after annotation created, before annotation deleted).

This solution has much better performance than the aforementioned techniques, and there's a growing community of open source developers involved in it, who have helped improve functionality and contribute additional features.

Annotator Customizations

Annotator includes a nice plugin architecture designed to allow custom functionality to be built on top of it. Below are customized features I've added to the application:

Colored Highlighting

In my custom plugin, I've added tagged colored highlighting. An editor can select a specific color for a tag assigned to an annotation from preselected colors. All users can highlight and unhighlight annotations with that specific tag. The plugin uses jQuery XColor, a JavaScript plugin that handles color calculation of overlapping highlights. Users can also turn on and off highlighting on a per tag basis.


Tagged colored highlighting (referred to as layers here) is selected from a predefined set of colors.

Linked Resources

Another customization I created was the ability to link annotations to other resources in the application, which allows for users to build relationships between multiple pieces of content. This is merely an extra data point saved on the annotation itself.


A linked collage from this annotation.

Toggle Display of layered and unlayered content

One of the most difficult customization points was building out the functionality that allows users to toggle display of unlayered and layered content, meaning that after a user annotates a certain amount of text, they can hide all the unannotated text (replaced with an ellipsis). The state of the content (e.g. with unlayered text hidden) is saved and presented to other users this way, which essentially allows the author to control what text is visible to users.

Learn More

Make sure to check out the Annotator website if you are interested in learning more about this plugin. The active community has interesting support for annotating images, video, and audio, and is always focused on improving plugin capabilities. One group is currently focused on cross browser support, including support of IE8.

Liquid Galaxy installation at Sparkassen-Finanzportal Forum 2014


In May, End Point and Google organized a Liquid Galaxy installation at Sparkassen-Finanzportal Forum 2014 in Düsseldorf, Germany.

For this event End Point installed the Liquid Galaxy and also prepared custom tours for the Liquid Galaxy showing different Sparkasse localizations across Germany.

We arrived at Düsseldorf a day before the event, and assembled the whole Liquid Galaxy without any problems. The system consists of 7 displays and 6 computers, so the potential for issues is pretty great, but we've done this a number of times, and have worked out a good stable build. After assembly, the system worked pretty well. Our US-based team then finalized the custom tours and uploaded the most recent software and content.

The next morning when we arrived at the event place, everything was working great, and the system was ready for people to explore and discover. As per usual, people were interested not only in the prepared tours, but also were looking for the places they know (usually, their house!). The overall user experience is great, especially when people were able to see places they hadn't seen for quite a long time, like the places where they grew up.

This is another successful conference deployment of the Liquid Galaxy platform. Thanks to all our partners in Europe and the US who helped make this happen.




Facebook, Twitter, Google Plus sharing with the url

This blog post is intended for the folks who spent way more time displaying social sharing buttons on their websites than originally planned. My buttons were supposed to bring up the Share Dialog for Facebook, Twitter and Google Plus platforms. I had the requirement to display a custom logo and a custom description. It seemed easy... until it turned out to be rather difficult.


The appropriate format for Twitter is:

Twitter

Note that Twitter dialog does not include the picture, only the description.

Trouble started with Facebook when I learned that custom parameters in Facebook's sharer.php url are officially not supported anymore: https://developers.facebook.com/bugs/357750474364812. I tried the format widely suggested on forums, and while custom image was successfully displayed, neither title, not description showed.

Facebook

Facebook documentation hinted that addition of the OpenGraph Protocol standard tags to the of the page may help:

 

 
 

I wasn't able to get it up and working with sharer.php. After spending considerable amount of time on this I had to give up and acknowledge that the only way to fully customize the dialog would require registering an app and utilizing APP_ID https://developers.facebook.com/docs/sharing/reference/share-dialog.

I anticipated the same kind of trouble with the last button for Google Plus. And I wasn't mistaken. The only allowed format for G+ is:

Google+

Despite a few mentions on the web this does not work anymore:


Same here. Doesn't work:






Eventually it turned out that it wasn't possible to use parameters for GPlus link unless you sign up for the API key and use one of the API methods. I wasn't planning to obtain a key at that time, so I had to simply drop the custom logo and text for G+.

Looks like both Facebook and Google Plus took steps to restrict the free usage of their share urls so more people would register their apps with them.




Version differences via Github from the command line

I work with a lot of open source projects, and I use the command-line for almost everything. It often happens that I need to examine a file from a project, and thanks to bash, Github, and curl, I can do so easily, without even needing to have the repo handy. One of the things I do sometimes is compare a file across versions to see what has changed. For example, I needed to see what changes were made between versions 1.22 and 1.23 to the file includes/UserMailer.php which is part of the MediaWiki project. For this trick to work, the project must be on Github, and must label their versions in a consistent manner, either via git branches or git tags.

MediaWiki exists on Github as wikimedia/mediawiki-core. The MediaWiki project tags all of their releases in the format X.Y.Z, so in this example we can use the git tags 1.22.0 and 1.23.0. Github is very nice because you can view a specific file at a certain commit (aka a tag), and even grab it over the web as a plain text file. The format is:

https://raw.githubusercontent.com/PROJECTNAME/BRANCH-OR-TAG/FILE

Note that you can use a tag OR a branch! So to compare these two files, we can use one of these pairs:

https://raw.githubusercontent.com/wikimedia/mediawiki-core/REL1_21/includes/UserMailer.php
https://raw.githubusercontent.com/wikimedia/mediawiki-core/REL1_22/includes/UserMailer.php

https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.21.0/includes/UserMailer.php
https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.22.0/includes/UserMailer.php

All that is left is to treat git as a web service and compare the two files at the command line ourselves. The program curl is a great tool for downloading the files, as it dumps to stdout by default. We will add a -s flag (for "silent") to prevent it from showing the progress meter as it usually does. The last bit of the puzzle is to use <(), bash's process substitution feature, to trick diff into comparing the curl outputs as if they were files. So our final command is:

diff <(curl -s https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.21.0/includes/UserMailer.php) \
<(curl -s https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.22.0/includes/UserMailer.php) \
| more

Voila! A quick and simple glance at what changed between those two tags. This should work for any project on Github. You can also replace the branch or tag with the word "master" to see the current version. For example, the PostgreSQL project lives on github as postgres/postgres. They use the format RELX_Y_Z in their tags. To see what has changed since release 9.3.4 in the psql help file (as a context diff), run:

diff -c <(curl -s https://raw.githubusercontent.com/postgres/postgres/REL9_3_4/src/bin/psql/help.c) \
<(curl -s https://raw.githubusercontent.com/postgres/postgres/master/src/bin/psql/help.c)

You are not limited to diff, of course. For a final example, let's see how many times Tom Lane is mentioned in the version 9 release notes:

for i in {0,1,2,3,4}
do grep -Fc 'Tom Lane' \
<(curl -s https://raw.githubusercontent.com/postgres/postgres/master/doc/src/sgml/release-9.$i.sgml)
done
272
206
174
115
16

The last number is so low relative to the rest because 9.4 is still under development. Rest assured Tom's contributions have not slowed down! :) Thanks to Github for providing such a useful service for so many open source projects, and for providing the raw text to allow useful hacks like this.

Interchange 5.8.2 Release with Bcrypt Encryption Support

The most recent release of Interchange contains support for adding encryption to user passwords with bcrypt. This provides for a significant improvement over the previously supported encryption options. Whether you are starting with a new user table, or have a long-established user table with any of the other supported encryption types, you can take advantage of this enhancement to Vend::UserDB.

In order to utilize bcrypt, you will need to have successfully installed the Digest::Bcrypt and Crypt::Random modules. Once installed, it is recommended (though not required, ironically) you add these as Required modules to interchange.cfg to ensure Interchange can find them and will report back immediately if not:

Require module Digest::Bcrypt
Require module Crypt::Random

To use bcrypt, configure your UserDB directive with the following options:

  • bcrypt - set to 1
  • cost - desired cost between 1 and 31 (optional; current default is 13)
  • bcrypt_pepper - random string used to apply a unique padding pattern (also optional, but strongly recommended)

To encourage adoption of bcrypt, there are two mechanisms available that allow a user table already using another encryption type to convert to using bcrypt. The first option is to add the "promote" directive - set to 1 - to your UserDB configuration. This will cause passwords stored by any other encryption type to update to bcrypt upon the next successful access of that user account. This causes no disruption to site operations or your users, but it does mean you have to allow full adoption of bcrypt to advance organically as your users return to your site over time.

The second option builds on "promote" by allowing the developer to construct a background process to slurp in the existing passwords stored by another encryption type and generate a bcrypt storage value of the encrypted password, by utilizing the construct_bcrypt() subroutine in Vend::UserDB. construct_bcrypt() takes a single hashref argument with a required key "password" and optional keys "type" and "profile". In this use case, password holds the encrypted password extracted from the existing user table and type contains one of "sha1", "md5", "md5_salted", or "crypt", corresponding to the encryption algorithm used to produce the password originally. The profile key maps to the profile name assigned to control this user table in the UserDB directive. Normally, this will be either "default", which controls the userdb table, or "ui", which controls the access table. Note that the profile used must already be configured to use bcrypt, or construct_bcrypt() will die. If profile is omitted, defaults to "default". It is anticipated the developer would construct an Interchange job to query all the users and encrypted passwords still stored by the previous algorithm, pass each into construct_bcrypt(), and update the source record's password field with the return value from construct_bcrypt().

For any configuration of UserDB utilizing bcrypt, it is recommended to enable "promote". Not only is it a painless way to begin the transition of your encrypted user table to use bcrypt, it also ensures continuing access for your users even if you advance with the second option. For a large user table, it may take weeks for your background process to complete the transition. In the mean time, any users who access the site before their passwords are updated will still have access to the site using promote. Lastly, over time you will likely want to increase the cost parameter of bcrypt to keep pace with hardware advances. By using promote, users returning to your site will have their passwords updated to use the current cost in addition to new users being stored at the current cost. In short, if you bcrypt, use promote.

Example configuration changes for UserDB using bcrypt for the access table. Replace:

UserDB  ui  crypt 1

with:

UserDB  ui  bcrypt         1
UserDB  ui  promote        1
UserDB  ui  cost           13
UserDB  ui  bcrypt_pepper  mrPIQ4$.qgJrbvD(5CStB.p8b8ABbpl+:0.`I`*J`)j{=7KY*b(M@QXXg+9B*-b

Of course, your bcrypt_pepper will be some suitable random string you generate and keep private. The above is just an example.

Also be aware that the password storage will be larger than any previous stored passwords, up to 63 characters in length. So be sure your user table's password field is sized (or resized, if necessary) to hold at least 63 characters.

Peppering your Padding

To provide additional support to protect weaker passwords against brute-force attacks, the process pads out password length to at least 72 characters. This padding is considerably more effective by adding a secret pepper that ensures the pattern cannot be reproduced without knowing the pepper value. This is, however, no substitute for enforcing good minimum standards on your passwords to ensure they are not weak in the first place, and does absolutely nothing to protect against a hacker guessing passwords through the front end.

Once a password hash is created with a particular pepper (which includes the possibility of no pepper) it is permanently dependent on that value. That is to say, the pepper should be settled on prior to introducing the use of bcrypt. If the pepper is changed, any hashes created and stored with the previous pepper (or no pepper) will become inaccessible. Even "promote" won't help here because in order for promote to function, both encryption types (the original and the target) must be able to work against the supplied password. Once a new value of "bcrypt_pepper" is introduced, Vend::UserDB loses access to the old pepper and, thus, cannot use it to test the validity of the password proffered by the user against the existing stored data structure.

If it becomes unavoidable to change a pepper, a developer would have to create a new UserDB profile with the old pepper, write custom code to use the new profile to validate the password obtained from the user and, upon success, update the password field in the user table in the custom code with the new structure built with the profile holding the new pepper. This could be done using construct_bcrypt() but is in no way natively supported as a promotable option. Moreover, there is no mechanism to allow such an update to occur as a background process, as is available to migrate the other encryption types to bcrypt.