News

Welcome to End Point’s blog

Ongoing observations by End Point people

A Beginner’s Guide to PCI DSS Compliance and TLS Versions

I recently did some research for one of End Point’s ecommerce clients on their PCI compliance and wanted to share some basic information for those of you who are new to this topic.

TLS

TLS (Transport Layer Security) is a standard for secure communications between applications. TLS is the current version of what used to be called SSL, the secure sockets layer. In the case of a financial transaction, this is the communication between the website selling a product and the end user. TLS works by encrypting data between two endpoints to ensure any sensitive data (such as financial details and private customer information) is exchanged securely. As security measures increase, new versions of TLS are released. To date, TLS 1.2 is the most up-to-date, with TLS 1.1 being considered safe, and TLS 1.0 being phased out. For details about OS versions supporting the latest TLS standards, please see Jon Jensen’s write-up here.

Compliance with PCI DSS

As all online retailers know, becoming and staying compliant with PCI DSS (Payment Card Industry Data Security Standard) is a big job. PCI is THE ecommerce security standard and in order to accept payment with Visa, MasterCard, American Express, and Discover, you must comply with their security standards.

As the Internet security landscape changes, PCI DSS standards are updated and reflect new risks and adjustments in security protections. As of today, PCI is requiring vendors to upgrade TLS 1.1 or above by June of 2016, with an optional extension until June 2018.

Compliance Assessors

Here’s where things get tricky. PCI does not actually do their own compliance, instead each merchant must have a neutral third party help them fulfill their PCI requirements. These are called ‘assessors’ and there are a large number of companies that offer this service along with help for other security-related tasks.

In preparation for the new requirements, many of the assessor companies are including the new TLS standards in their current compliance protocols.

What does that mean? Well, it means that even though PCI might not be requiring you to have TLS 1.1 until June of 2018, your compliance assessor might be require you to do it right now.

Bite the Bullet

So, now, given that you know you this change is coming AND you need it to get your compliance done, you might as well get your site updated. So where’s the catch?

Unsupported browsers

The big catch is that some browsers do not support TLS 1.1 or 1.2. In those cases, some of your users will not be able to complete a payment transaction and will instead hit an error screen and cannot continue. They are:

  • Internet Explorer on Windows XP
  • Internet Explorer older than version 11 on any version of Windows
  • the stock Android browser on versions of Android before 5.0
  • Safari 6 or older on Mac OS X 10.8 (Mountain Lion) or older
  • Safari on iOS 4 or older
  • very, very old versions of Firefox or Chrome that have been set not to auto-update

Okay, so how many people still use those old browsers? We’ll take a look at some of the breakdowns here:

http://www.w3schools.com/browsers/browsers_explorer.asp

You might be thinking, ‘That doesn’t seem like very many people’. And that’s true. However, every site has a different customer base and browsers use varies widely by demographics. So where can you go to find out what kinds of browser’s your customers use?

Google Analytics, your old friend

If you have Google Analytics setup, you can go through the Audience/Technology/Browser&OS screens to find out what kind of impact this might have.

Plan for the Worst

Now armed with your information, you will probably want to go ahead and get your website on the newest TLS version. The change is coming anyways but help your staff and web team plan for the worst by making sure everyone knows about the browser limitations and can help your customers through the process.

Server Compatibility Notes

For many ecommerce sites, enabling TLS 1.1 and 1.2 is easy, just changing a configuration setting and restarting the web server. But on older operating systems, such as the still supported and very popular Red Hat Enterprise Linux 5 and CentOS Linux 5, TLS 1.0 is the newest supported version. Various workarounds might be possible, but the only real solution is to migrate to a newer version of the operating system. There can be cost and time factors to consider, so it’s best to plan ahead. Ask us or your in-house developers whether a migration will be necessary!

Need Help?

As End Point’s client liaison, I’m happy to chat with anyone who needs answers or advice about PCI DSS and your ecommerce site.

Learning from data basics - the Naive Bayes model

Have you ever wondered what is the machinery behind some of the algorithms for doing seemingly very intelligent tasks? How is it possible that the computer program can recognize faces in photos, turn an image into a text or even classify some emails as legitimate or as spam?

Today, I’d like to present one of the simplest models for performing classification tasks. The model enables extremely fast execution, making it very practical in many use cases. The example I’ll choose will enable us to extend the discussion about the most optimal approach to another blog post.

The problem

Imagine that you’re working on an e-commerce store for your client. One of the requirements is to present the currently logged in user with a “promotion box” somewhere on the page. The goal is to maximize our chances of having the user put the product from the box into the basket. There’s one promotional box and a couple of different categories of products to choose the actual product from.

Thinking about the solution — using probability theory

One of the obvious directions we may want to turn towards is to use probability theory. If we could collect the data about the user’s previous choices and his or her characteristics, we can use probability to select the product category best suited for the current user. We would then choose a product from this category that currently has an active promotion.

Quick theory refresher for programmers

As we'll be exploring the probability approaches using Ruby code, I'd like to very quickly walk you through some of the basic concepts we will be using from now on.

Random variables

The simplest probability scenario many of us are already accustomed with is the coin toss results distribution. Here we're throwing the coin, noting whether we get heads or tails. In this experiment, we call "got heads" and "got tails" probability events. We can also shift the terminology a bit by calling them: two values of the "toss result" random variable.

So in this case we'd have a random variable — let's call it T (for "toss") that can take values of: "heads" or "tails". We then define the probability distribution P(T) as a function from the random variable value to a real number between 0 and 1 inclusively on both sides. In real world the probability values after e. g 10000 tosses might look like the following:

+-------+---------------------+
| toss  | value               |
+-------+---------------------+
| heads | 0.49929999999999947 |
| tails |   0.500699999999998 |
+-------+---------------------+

These values are nearing 0.5 more and more with the greater number of tosses.

Factors and probability distributions

We've shown a simple probability distribution. To ease the comprehension of the Ruby code we'll be working with, let me introduce the notion of the factor. We called the "table" from the last example a probability distribution. The table represented a function from a random variable's value to a real number between [0, 1]. The factor is a generalization of that notion. It's a function from the same domain, but returning any real number. We'll explore the usability of this notion in some of our next articles.

The probability distribution is a factor that adds two constraints:

  • its values are always in the range [0, 1] inclusively
  • the sum of all it's values is exactly 1

Simple Ruby modeling of random variables and factors

We need to have some ways of computing probability distributions. Let's define some simple tools we'll be using in this blog series:

# Let's define a simple version of the random variable
# - one that will hold discrete values
class RandomVariable
  attr_accessor :values, :name

  def initialize(name, values)
    @name = name
    @values = values
  end
end

# The following class really represents here a probability
# distribution. We'll adjust it in the next posts to make
# it match the definition of a "factor". We're naming it this
# way right now as every probability distribution is a factor
# too.
class Factor
  attr_accessor :_table, :_count, :variables

  def initialize(variables)
    @_table = {}
    @_count = 0.0
    @variables = variables
    initialize_table
  end

  # We're choosing to represent the factor / distribution
  # here as a table with value combinations in one column
  # and probability values in another. Technically, we're using
  # Ruby's Hash. The following method builds the the initial hash
  # with all the possible keys and values assigned to 0:
  def initialize_table
    variables_values = @variables.map do |var|
      var.values.map do |val|
        { var.name.to_sym => val }
      end.flatten
    end # [ [ { name: value } ] ]   
    @_table = variables_values[1..(variables_values.count)].inject(variables_values.first) do |all_array, var_arrays|
      all_array = all_array.map do |ob|
        var_arrays.map do |var_val|
          ob.merge var_val
        end
      end.flatten
      all_array
    end.inject({}) { |m, item| m[item] = 0; m }
  end

  # The following method adjusts the factor by adding information
  # about observed combination of values. This in turn adjusts probability
  # values for all the entries:
  def observe!(observation)
    if !@_table.has_key? observation
      raise ArgumentError, "Doesn't fit the factor - #{@variables} for observation: #{observation}"
    end

    @_count += 1

    @_table.keys.each do |key|
      observed = key == observation
      @_table[key] = (@_table[key] * (@_count == 0 ? 0 : (@_count - 1)) + 
       (observed ? 1 : 0)) / 
         (@_count == 0 ? 1 : @_count)
    end

    self
  end

  # Helper method for getting all the possible combinations
  # of random variable assignments
  def entries
    @_table.each
  end

  # Helper method for testing purposes. Sums the values for the whole
  # distribution - it should return 1 (close to 1 due to how computers
  # handle floating point operations)
  def sum
    @_table.values.inject(:+)
  end

  # Returns a probability of a given combination happening
  # in the experiment
  def value_for(key)
    if @_table[key].nil?
      raise ArgumentError, "Doesn't fit the factor - #{@varables} for: #{key}"
    end
    @_table[key]
  end

  # Helper method for testing purposes. Returns a table object
  # ready to be printed to stdout. It shows the whole distribution
  # as a table with some columns being random variables values and
  # the last one being the probability value
  def table
    rows = @_table.keys.map do |key|
      key.values << @_table[key]
    end
    table = Terminal::Table.new rows: rows, headings: ( @variables.map(&:name) << "value" )
    table.align_column(@variables.count, :right)
    table
  end

  protected

  def entries=(_entries)
    _entries.each do |entry|
      @_table[entry.keys.first] = entry.values.first
    end
  end

  def count
    @_count
  end

  def count=(_count)
    @_count = _count
  end
end

Notice that we're using here the terminal-table gem as a helper for printing out the factors in an easy to grasp fashion. You'll need the following requires:

require 'rubygems'
require 'terminal-table'

The scenario setup

Let’s imagine that we have the following categories to choose from:

category = RandomVariable.new :category, [ :veggies, :snacks, :meat, :drinks, :beauty, :magazines ]

And the following user features on each request:

age      = RandomVariable.new :age,      [ :teens, :young_adults, :adults, :elders ]
sex      = RandomVariable.new :sex,      [ :male, :female ]
relation = RandomVariable.new :relation, [ :single, :in_relationship ]
location = RandomVariable.new :location, [ :us, :canada, :europe, :asia ]

Let's define the data model that resembles logically the one we could have in our real e-commerce application:

class LineItem
  attr_accessor :category

  def initialize(category)
    self.category = category
  end
end

class Basket
  attr_accessor :line_items

  def initialize(line_items)
    self.line_items = line_items
  end
end

class User
  attr_accessor :age, :sex, :relationship, :location, :baskets

  def initialize(age, sex, relationship, location, baskets)
    self.age = age
    self.sex = sex
    self.relationship = relationship
    self.location = location
    self.baskets = baskets
  end
end

We want to utilize a user’s baskets in order to infer the most probable value for a category, given a set of user’s features. In our example, we can imagine that we’re offering authentication via Facebook. We can grab info about a user’s sex, location, age and whether she/he is in relationship or not. We want to find a category that’s being chosen the most by users with a given set of features.

As we don't have any real data to play with, we'll need a generator to create fake data of certain characteristics. Let's first define a helper class with a method, that will allow us to choose a value out of a given list of options along with their weights:

class Generator
  def self.pick(options)
    items = options.inject([]) do |memo, keyval|
      key, val = keyval
      memo << Array.new(val, key)
      memo
    end.flatten
    items.sample
  end
end

With all the above we can define a random data generation model:

class Model

  # Let's generate `num` users (1000 by default)
  def self.generate(num = 1000)
    num.times.to_a.map do |user_index|
      gen_user
    end
  end

  # Returns a user with randomly selected traits and baskets
  def self.gen_user
    age = gen_age
    sex = gen_sex
    rel = gen_rel(age)
    loc = gen_loc
    baskets = gen_baskets(age, sex)

    User.new age, sex, rel, loc, baskets
  end

  # Randomly select a sex with 40% chance for getting a male
  def self.gen_sex
    Generator.pick male: 4, female: 6
  end

  # Randomly select an age with 50% chance for getting a teen
  # (among other options and weights)
  def self.gen_age
    Generator.pick teens: 5, young_adults: 2, adults: 2, elders: 1
  end

  # Randomly select a relationship status.
  # Depend the chance of getting a given option on the user's age
  def self.gen_rel(age)
    case age
      when :teens        then Generator.pick single: 7, in_relationship: 3
      when :young_adults then Generator.pick single: 4, in_relationship: 6
      else                    Generator.pick single: 8, in_relationship: 2
    end
  end

  # Randomly select a location with 40% chance for getting a united states
  # (among other options and weights)
  def self.gen_loc
    Generator.pick us: 4, canada: 3, europe: 1, asia: 2
  end

  # Randomly select 20 basket line items.
  # Depend the chance of getting a given option on the user's age and sex
  def self.gen_items(age, sex)
    num = 20

    num.times.to_a.map do |i|
      if (age == :teens || age == :young_adults) && sex == :female
        Generator.pick veggies: 1, snacks: 3, meat: 1, drinks: 1, beauty: 9, magazines: 6
      elsif age == :teens  && sex == :male
        Generator.pick veggies: 1, snacks: 6, meat: 4, drinks: 1, beauty: 1, magazines: 4
      elsif (age == :young_adults || age == :adults) && sex == :male
        Generator.pick veggies: 1, snacks: 4, meat: 6, drinks: 6, beauty: 1, magazines: 1
      elsif (age == :young_adults || age == :adults) && sex == :female
        Generator.pick veggies: 4, snacks: 4, meat: 2, drinks: 1, beauty: 6, magazines: 3
      elsif age == :elders && sex == :male
        Generator.pick veggies: 6, snacks: 2, meat: 2, drinks: 2, beauty: 1, magazines: 1
      elsif age == :elders && sex == :female
        Generator.pick veggies: 8, snacks: 1, meat: 2, drinks: 1, beauty: 4, magazines: 1
      else
        Generator.pick veggies: 1, snacks: 1, meat: 1, drinks: 1, beauty: 1, magazines: 1
      end
    end.map do |cat|
      LineItem.new cat
    end
  end

  # Randomly select 5 baskets depending the traits of the basket on user
  # age and sex
  def self.gen_baskets(age, sex)
    num = 5

    num.times.to_a.map do |i|
      Basket.new gen_items(age, sex)
    end
  end
end

Where is the complexity?

The approach described above doesn’t seem that exciting or complex. Usually reading about probability theory applied in the field of machine learning requires going through quite a dense set of mathematical notions. The field is also being actively worked on by researchers. This implies a huge complexity — certainly not the simple definition of probability that we got used to in high school.

The problem becomes a bit more complex if you consider efficiency of computing the probabilities. In our example, the joined probability distribution — to fully describe the scenario — needs to specify probability values for 383 cases:

p(:veggies, :teens, :male, :single, :us) # one of 384 combinations

Given that the probability distributions have to sum up to 1, the last case can be fully inferred from the sum of all the others. This means that we need 6 * 4 * 2 * 2 * 4 - 1 = 383 parameters in the model: 6 categories, 4 age classes, 2 sexes, 2 relationship kinds and 4 locations. Imagine adding one additional, 4 valued feature (a season). This would grow our number of parameters to 1535. And this is a very simple training example. We could have a model with close to 100 different features. The number of parameters would clearly be unmanageable even on the biggest servers we could put them in. This approach would also make it very painful to add additional features to the model.

Very simple but powerful optimization: The Naive Bayes model

In this section I’m going to present you with an equation we’ll be working with when optimizing our example. I’m not going to explain the mathematics behind it as you can easily read about them on e. g. Wikipedia.

The approach is called the Naive Bayes model. It is being used e .g. in spam filters. It also has been used in the past in medical diagnosis field.

It allows us to present the full probability distribution as a product of factors:

p(cat, age, sex, rel, loc) == p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)

Where e. g. p(age | cat) represents the probability of a user being a certain age given that this user selects cat products most frequently. This is called the “posterior probability”. The above equation states that we can simplify the distribution to be a product of some number of much more easily manageable factors.

The category from our example is often called a class and the rest of random variables in the distribution are often called features.

In our example, the number of parameters we’ll need to manage when presenting the distribution in this form drops to:

(6 - 1) + (6 * 4 - 1) + (6 * 2 - 1) + (6 * 2 - 1) + (6 * 4 - 1) == 73

That’s just around 19% of the original amount! Also, adding another variable (season) would only add 23 new parameters (compared to 1152 in the full distribution case).

The Naive Bayes model limits the number of parameters we have to manage but it comes with very strong assumptions about the variables involved: in our example, that the user features are conditionally independent given the resulting category. Later on I’ll show why this isn’t true in this case even though the results will still be quite okay.

Implementing the Naive Bayes model

As we now have all the tools we need, let's get back to the probability theory to figure out how best to model the Naive Bayes in terms of the Ruby blocks we now have.

The approach says that under the assumptions we discussed we can approximate the original distribution to be the product of factors:

p(cat, age, sex, rel, loc) = p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)

Given the definition of the conditional probability we have that:

p(a | b) = p(a, b) / p(b)

Thus, we can express the approximation as:

p(cat, age, sex, rel, loc) = p(cat) * ( p(age, cat) / p(cat) ) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )

And then simplify it even further as:

p(cat, age, sex, rel, loc) = p(age, cat) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )

Let's define all the factors we will need:

cat_dist     = Factor.new [ category ]
age_cat_dist = Factor.new [ age, category ]
sex_cat_dist = Factor.new [ sex, category ]
rel_cat_dist = Factor.new [ relation, category ]
loc_cat_dist = Factor.new [ location, category ]

Also, we want a full distribution to compare the results:

full_dist = Factor.new [ category, age, sex, relation, location ]

Let's generate 1000 random users and looping through them and their baskets - adjust probability distributions for combinations of product categories and user traits:

Model.generate(1000).each do |user|
  user.baskets.each do |basket|
    basket.line_items.each do |item|
      cat_dist.observe! category: item.category
      age_cat_dist.observe! age: user.age, category: item.category
      sex_cat_dist.observe! sex: user.sex, category: item.category
      rel_cat_dist.observe! relation: user.relationship, category: item.category
      loc_cat_dist.observe! location: user.location, category: item.category
      full_dist.observe! category: item.category, age: user.age, sex: user.sex,
        relation: user.relationship, location: user.location
    end
  end
end

We can now print the distributions as tables to have an insight about the data:

[ cat_dist, age_cat_dist, sex_cat_dist, rel_cat_dist, 
  loc_cat_dist, full_dist ].each do |dist|
    puts dist.table
    # Let's print out the sum of all entries to ensure the
    # algorithm works well:
    puts dist.sum
    puts "\n\n"
end

Which yields the following to the console (the full distribution is truncated due to its size):

+-----------+---------------------+
| category  | value               |
+-----------+---------------------+
| veggies   |             0.10866 |
| snacks    | 0.19830999999999863 |
| meat      |             0.14769 |
| drinks    | 0.10115999999999989 |
| beauty    |             0.24632 |
| magazines | 0.19785999999999926 |
+-----------+---------------------+
0.9999999999999978

+--------------+-----------+----------------------+
| age          | category  | value                |
+--------------+-----------+----------------------+
| teens        | veggies   |  0.02608000000000002 |
| teens        | snacks    |  0.11347999999999969 |
| teens        | meat      |  0.06282999999999944 |
| teens        | drinks    |   0.0263200000000002 |
| teens        | beauty    |   0.1390699999999995 |
| teens        | magazines |  0.13322000000000103 |
| young_adults | veggies   | 0.010250000000000023 |
| young_adults | snacks    |  0.03676000000000003 |
| young_adults | meat      |  0.03678000000000005 |
| young_adults | drinks    |  0.03670000000000045 |
| young_adults | beauty    |  0.05172999999999976 |
| young_adults | magazines | 0.035779999999999916 |
| adults       | veggies   | 0.026749999999999927 |
| adults       | snacks    |  0.03827999999999962 |
| adults       | meat      | 0.034600000000000505 |
| adults       | drinks    | 0.028190000000000038 |
| adults       | beauty    |  0.03892000000000036 |
| adults       | magazines |  0.02225999999999998 |
| elders       | veggies   |  0.04558000000000066 |
| elders       | snacks    | 0.009790000000000047 |
| elders       | meat      | 0.013480000000000027 |
| elders       | drinks    | 0.009949999999999931 |
| elders       | beauty    | 0.016600000000000226 |
| elders       | magazines | 0.006600000000000025 |
+--------------+-----------+----------------------+
1.0000000000000013

+--------+-----------+----------------------+
| sex    | category  | value                |
+--------+-----------+----------------------+
| male   | veggies   |  0.03954000000000044 |
| male   | snacks    |   0.1132499999999996 |
| male   | meat      |  0.10851000000000031 |
| male   | drinks    |                0.073 |
| male   | beauty    | 0.023679999999999857 |
| male   | magazines |  0.05901999999999993 |
| female | veggies   |  0.06911999999999997 |
| female | snacks    |  0.08506000000000069 |
| female | meat      |  0.03918000000000006 |
| female | drinks    |  0.02816000000000005 |
| female | beauty    |  0.22264000000000062 |
| female | magazines |  0.13884000000000046 |
+--------+-----------+----------------------+
1.000000000000002

+-----------------+-----------+----------------------+
| relation        | category  | value                |
+-----------------+-----------+----------------------+
| single          | veggies   |  0.07722000000000082 |
| single          | snacks    |  0.13090999999999794 |
| single          | meat      |  0.09317000000000061 |
| single          | drinks    | 0.059979999999999915 |
| single          | beauty    |  0.16317999999999971 |
| single          | magazines |  0.13054000000000135 |
| in_relationship | veggies   | 0.031440000000000336 |
| in_relationship | snacks    |  0.06740000000000032 |
| in_relationship | meat      | 0.054520000000000006 |
| in_relationship | drinks    |  0.04118000000000009 |
| in_relationship | beauty    |  0.08314000000000002 |
| in_relationship | magazines |  0.06732000000000182 |
+-----------------+-----------+----------------------+
1.000000000000003

+----------+-----------+----------------------+
| location | category  | value                |
+----------+-----------+----------------------+
| us       | veggies   |  0.04209000000000062 |
| us       | snacks    |  0.07534000000000109 |
| us       | meat      | 0.055059999999999984 |
| us       | drinks    |  0.03704000000000108 |
| us       | beauty    |  0.09879000000000099 |
| us       | magazines |  0.07867999999999964 |
| canada   | veggies   | 0.027930000000000062 |
| canada   | snacks    |  0.05745999999999996 |
| canada   | meat      |  0.04288000000000003 |
| canada   | drinks    |  0.03078999999999948 |
| canada   | beauty    |  0.06397999999999997 |
| canada   | magazines | 0.053959999999999675 |
| europe   | veggies   | 0.013110000000000132 |
| europe   | snacks    |   0.0223200000000001 |
| europe   | meat      |  0.01730000000000005 |
| europe   | drinks    | 0.011859999999999964 |
| europe   | beauty    | 0.025490000000000183 |
| europe   | magazines | 0.020920000000000164 |
| asia     | veggies   |  0.02552999999999989 |
| asia     | snacks    |  0.04319000000000044 |
| asia     | meat      |  0.03244999999999966 |
| asia     | drinks    |  0.02147000000000005 |
| asia     | beauty    |  0.05805999999999953 |
| asia     | magazines |   0.0442999999999999 |
+----------+-----------+----------------------+
1.0000000000000029

+-----------+--------------+--------+-----------------+----------+------------------------+
| category  | age          | sex    | relation        | location | value                  |
+-----------+--------------+--------+-----------------+----------+------------------------+
| veggies   | teens        | male   | single          | us       |  0.0035299999999999936 |
| veggies   | teens        | male   | single          | canada   |  0.0024500000000000073 |
| veggies   | teens        | male   | single          | europe   |  0.0006999999999999944 |
| veggies   | teens        | male   | single          | asia     |  0.0016699999999999899 |
| veggies   | teens        | male   | in_relationship | us       |   0.001340000000000006 |
| veggies   | teens        | male   | in_relationship | canada   |  0.0010099999999999775 |
| veggies   | teens        | male   | in_relationship | europe   |  0.0006499999999999989 |
| veggies   | teens        | male   | in_relationship | asia     |   0.000819999999999994 |

(... many rows ...)

| magazines | elders       | male   | in_relationship | asia     | 0.00012000000000000163 |
| magazines | elders       | female | single          | us       |  0.0007399999999999966 |
| magazines | elders       | female | single          | canada   |  0.0007000000000000037 |
| magazines | elders       | female | single          | europe   |  0.0003199999999999965 |
| magazines | elders       | female | single          | asia     |  0.0005899999999999999 |
| magazines | elders       | female | in_relationship | us       |  0.0004899999999999885 |
| magazines | elders       | female | in_relationship | canada   | 0.00027000000000000114 |
| magazines | elders       | female | in_relationship | europe   | 0.00012000000000000014 |
| magazines | elders       | female | in_relationship | asia     | 0.00012000000000000014 |
+-----------+--------------+--------+-----------------+----------+------------------------+
1.0000000000000004

Let's define a Proc for inferring categories based on user traits as evidence:

infer = -> (age, sex, rel, loc) do

  # Let's map through the possible categories and the probability
  # values the distibutions assign to them:
  all = category.values.map do |cat|
    pc  = cat_dist.value_for category: cat
    pac = age_cat_dist.value_for age: age, category: cat
    psc = sex_cat_dist.value_for sex: sex, category: cat
    prc = rel_cat_dist.value_for relation: rel, category: cat
    plc = loc_cat_dist.value_for location: loc, category: cat

    { category: cat, value: (pac * psc/pc * prc/pc * plc/pc) }
  end

  # Let's do the same with the full distribution to be able to compare
  # the results:
  all_full = category.values.map do |cat|
    val = full_dist.value_for category: cat, age: age, sex: sex,
            relation: rel, location: loc

    { category: cat, value: val }
  end

  # Here's we're getting the most probable categories based on the
  # Naive Bayes distribution approximation model and based on the full
  # distribution:
  win      = all.max      { |a, b| a[:value] <=> b[:value] }
  win_full = all_full.max { |a, b| a[:value] <=> b[:value] }

  puts "Best match for #{[ age, sex, rel, loc ]}:"
  puts "   #{win[:category]} => #{win[:value]}"
  puts "Full pointed at:"
  puts "   #{win_full[:category]} => #{win_full[:value]}\n\n"
end

The results

We're ready now to use the model and see how well the Naive Bayes model performs in this particular scenario:

infer.call :teens, :male, :single, :us
infer.call :young_adults, :male, :single, :asia
infer.call :adults, :female, :in_relationship, :europe
infer.call :elders, :female, :in_relationship, :canada

This gave the following results on the console:

Best match for [:teens, :male, :single, :us]:
   snacks => 0.016252573282200262
Full pointed at:
   snacks => 0.01898999999999971

Best match for [:young_adults, :male, :single, :asia]:
   meat => 0.0037455794492659757
Full pointed at:
   meat => 0.0017000000000000016

Best match for [:adults, :female, :in_relationship, :europe]:
   beauty => 0.0012287311061725868
Full pointed at:
   beauty => 0.0003000000000000026

Best match for [:elders, :female, :in_relationship, :canada]:
   veggies => 0.002156365730474441
Full pointed at:
   veggies => 0.0013500000000000022

That's quite impressive! Even though we're using a simplified model to approximate the original distribution, the algorithm managed to infer the correct values in all cases. You can notice also that the results differ only by a couple of cases in 1000.

The approximation like that would certainly be very useful in a more complex e-commerce scenario, in the case where the number of evidence variables would be big enough to be unmanageable using the full distribution. There are use cases though, where a couple of errors in 1000 cases would be too many — the traditional example is medical diagnosis. There are also cases where the number of errors would be much greater just because the Naive Bayes assumption of conditional independence of variables is not always a fair an assumption. Is there a way to improve?

The Naive Bayes assumption says that the distribution factorizes the way we did it only if the features are conditionally independent given the category. The notion of conditional independence (apart from the formal mathematical definition) suggests that if some variables a and b are conditionally independent given c, then if we know the value of c then no additional information about b can alter our knowledge about a. In our example, knowing the category, let say :beauty doesn’t mean that e. g sex is independent from age. In real world examples, it's often very hard to find a use case for Naive Bayes that would follow the assumption in all the cases.

There are alternative approaches that allow us to apply the assumptions that more rigidly follow the chosen data set. We will explore these in the next articles, building on top of what we saw here.

Creating a video player with time markers - step by step

Introduction

Today we will show you how to create a video player with time markers using JavaScript and HTML5 only. Libraries that we will use are proven to be stable enough for production projects. What we want to achieve? The final result is visible below:

To simplify (or to make it harder for some of you :)) this tutorial we won't use any package management tools. The demo is available on Github here: https://github.com/peter-hank/video-with-markers

Requirements

We will need some libraries (all of these are free to use in commercial projects):

Step 1 - creating a project skeleton

Let's create a new folder for our project and call it video-with-markers. Inside let's create a new file called "index.html", three folders: "css", "js" and "var".

We also need to copy libraries files and put it into a proper directory:

Let's open the index.html file and fill it with some basic structure, that will include libraries files too.

<!doctype html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta http-equiv="x-ua-compatible" content="ie=edge"/>
  <title>Video with markers</title>
  <meta name="description" content=""/>
  <meta name="viewport" content="width=device-width, initial-scale=1"/>
  <link rel="stylesheet" href="css/video-js.min.css"/>
  <link rel="stylesheet" href="css/videojs.markers.min.css"/>
  <script src="js/jquery-2.0.3.min.js"></script>
  <script src="js/video.min.js"></script>
  <script src="js/videojs-markers.min.js"></script>
 </head>
 <body>
 </body>
</html>

Step 2 - activating player and creating markers

TIP: all the further code should be put inside the body tag.

First, we need to create a video tag:

 

We need to set an element "id" attribute and both classes "video-js vjs-default-skin" to let player CSS apply. Other attributes are not required and are here just for demo purpose. "Data-setup" attribute is an attribute that configures multiple player options. More details can be found in the documentation here: http://docs.videojs.com/docs/guides/options.html.

At this point after launching index.html in a browser we should see a video player like this:

The only thing left is initializing markers. We can do this by adding some JavaScript code under the video player element:

// load video object
 var video = videojs('example_video_1');

 //load markers
 video.markers({
    markers: [
        {time: 9.5, text: "this"},
        {time: 16,  text: "is"},
        {time: 23.6,text: "so"},
        {time: 28,  text: "cool"}
    ];
 });

In this part we load an object of the Video.js player to video variable and then load a list of markers. Each marker object includes time in seconds when it should be shown and marker text.

Now, markers should be visible like on my this image and markers text will be visible after pointing at it:

It needs some additional CSS styling to achieve a result like on the first image but it's not much work.

The end

Almost the end. There are many more Video.js plugins ready to use, for example:

  • videojs-HDtoggle - button which toggles between HD and non-HD source,
  • videojs-playlist - plays videos continuously or by selecting them,
  • videojs-watermark - displays a watermark on top of the video.

With not much effort most of video player functionalities can be finalized. There is always some tweaking, but at least you can have a base for creating your project. And remember, if you will find bugs, report it! And if you have implemented something cool, share it!

Thanks for reading.

Spree Admin pages unreachable (500 errors)

I was notified a few minutes ago by one of our Spree clients that their admin interface was unreachable due to errors.

Digging into the logs, I discovered SocketErrors (DNS lookup failures) were behind the 500 errors. Digging deeper, I discovered the SocketErrors were coming from a Spree file attempting to access "alerts.spreecommerce.com". I confirmed in my browser that alerts.spreecommerce.com fails to resolve.

This git commit discusses the removal of the class, but if you haven't stayed current and you've left the "Check for alerts" box checked, you may need to do some manual editing of your stored preferences to get the UI to load again.

Spree::Preference.where(key: "spree/app_configuration/check_for_spree_alerts").first.update_attributes(value: false)
It does appear that your app will need to restart to pull in this change.

I'm not sure what the chances are your particular config key might vary, so please use the above with caution.

QuickCheck - property based testing in Haskell and JavaScript

In my last article, I presented a functional programming pattern. The goal was to reach out to the developers who weren’t familiar with advanced type systems like the one found in Haskell and make them a bit curious. This time I’d like to take a step further and present a testing approach coming from the same world, that can be used with mainstream languages with a great success.

Many ways to test the code

The importance of testing is almost a cliché nowadays. Out of this relevance, a large number of testing frameworks and paradigms have been created. On the paradigm level we have notions like TDD and BDD. On the level of implementations we have hundreds of projects for each language like RSpec in Ruby and Jasmine or Mocha in JavaScript.

The ideas behind the libraries don’t differ that much. All of them are based on the idea of providing code examples with assertions on how the code should behave in these particular cases.

A bit more revolutionary in its approach was the Cucumber project. In its essence, it allows business people to express the system logic by stating it in specially formed, plain English. An example taken from the Cucumber’s website reads:

Feature: Refund item

  Scenario: Jeff returns a faulty microwave
    Given Jeff has bought a microwave for $100
    And he has a receipt
    When he returns the microwave
    Then Jeff should be refunded $100

In this article, I’d like to present an approach that was conceived in the realm of Haskell — the purely functional, statically typed language. Though it started just as a Haskell library, today it could be broadly named an “approach”. We now have implementations for almost every major language. This approach is what is known as QuickCheck.

Code testing limitations

Having a good test coverage is a sign of a potentially stable code. However, even well-tested software needs to be improved occasionally as new bugs are discovered. This happens even in the projects with the largest test suites. Moreover, the tests are code too — they also are prone to being invalid. But do we have a really valid solution when all our tests are valid and the code passes all assertions? Realistically, we can only provide a few examples per use case at most. The best we can do is to choose the most probable use cases and the ones we have a hunch that might go wrong.

This implies that for the tests suite to be guarding us against bugs, we need to have an insight as to where the potential bugs may be before even testing. Isn’t a definition of a bug telling the other story though? If we knew where to find them, we would fix them in the first place. In reality the systems grow so complex that we only can have a feeling of what might go wrong.

In 1969, Edsger W. Dijkstra said at the “Nato Software Engineering Conference”:

Testing shows the presence, not the absence of bugs.

I’ve got lots of ammo, give me a machine gun for killing bugs!

What if we could ask the computer to come up with many different use cases for us? Instead of having 2 or 3 cases per some code aspect, we’d have 100 of them including the ones we’d never consider ourselves. We could describe properties of the code we’d like to hold for all of the randomly chosen cases. That is exactly the idea behind the QuickCheck. In its functional - Haskell form, it takes advantage of the supreme type system and generates a random set of function parameters based on their types. It then runs the property check for all of them and stops if it’s able to find a counter example making the property falsifiable.

If we’d compare coming up with traditional test cases to shooting with a pistol, the QuickCheck approach had to be called firing with a machine gun. The reason is that we’re not focusing on a specific use case, but we’re focusing on certain properties of the code that have to hold for any argument from the accepted domain.

One of the most basic examples we could show, could be ensuring that the reverse function applied twice returns the original array (pseudo-code):

arr1 == reverse(reverse(arr1))

The idea here is to make sure this property holds against a large number of randomly selected arguments from the domain. In this example the checker would randomly generate e. g 100 arrays and test if the assertion evaluates to true for every one of them.

Working example — the Haskell way

Let’s take a look at how the approach is being used in its original environment. Later on we'll see how the pattern can be used when coding in JavaScript. For this, let’s imagine that we’re developing a graph data structure, to be used in some e-commerce project we’re working on. Here’s the basic very incomplete draft:

module BlogGraph where

import Data.Map as Map
import Data.List as List

-- Graph as an adjacency list as described here:
-- https://en.wikipedia.org/wiki/Adjacency_list

data Graph a = Graph (Map a [a]) deriving (Show)

empty :: Graph a
empty = Graph Map.empty

insertNode :: (Ord a) => a -> Graph a -> Graph a
insertNode node (Graph m) = Graph $ Map.insert node [] m

removeNode :: (Ord a) => a -> Graph a -> Graph a
removeNode node (Graph m) = Graph $ Map.delete node m

insertEdge :: (Ord a) => a -> a -> Graph a -> Graph a
insertEdge parent child (Graph m) =
  Graph $ Map.insertWithKey update parent [child] m
  where
    update _ _ old = List.nub $ child:old

nodes :: Graph a -> [a]
nodes (Graph m) = Map.keys m

If you’re not proficient yet in reading Haskell code, we’re just using a Map where keys are integers and values are arrays of integers to implement the graph as an Adjacency List. So each node has its representation in a map as one of its keys. Also, each edge has its representation as a child stored in the parent’s array in the map.

You might be able to find a silly bug in the removeNode function. It doesn’t remove the node from the edge definitions of other nodes. We’ll use QuickCheck to show how this could be found automatically.

Before doing that, let’s have a warm up, by adding two simple properties:

prop_insert_empty :: Int -> Bool
prop_insert_empty i =
  nodes (insertNode i BlogGraph.empty) == [i]

prop_insert_existing :: Int -> Bool
prop_insert_existing i =
  nodes (insertNode i $ insertNode i BlogGraph.empty) == [i]

Properties are just simple functions returning true or false. They take arguments which are randomly provided later on by the QuickCheck library.

The first property says that adding a node to an empty graph will always produce a one-node graph. The second one, that adding a node to a graph that already has this node will always return the same unmodified graph.

We can successfully run these cases:

quickCheck prop_insert_empty
quickCheck prop_insert_existing

Now we should add a property stating that for all removals of a node from the graph, all references of this node in edge definitions for other nodes are always also being removed:

prop_remove_removes_edges :: Graph Int -> Bool
prop_remove_removes_edges (Graph m) =
  List.null (nodes graph) || List.notElem node elemsAfter
  where 
    graph = Graph m
    node = List.head $ BlogGraph.nodes graph
    elemsAfter = List.concat $ Map.elems mapAfter
    mapAfter =
      case removeNode node graph of
        (Graph m) -> m

As I wrote before, these property testing functions are being run by the QuickCheck framework repeatedly with randomly generated values as arguments. Out of the box we’re able to generate random examples for many simple types — including e.g Int. That’s the reason we were able to just specify properties depending on random Int variables — without any additional code. But with the last example, we’re asking QuickCheck to generate a set of random graphs. We need to tell it how to construct a random graph first:

arbitrarySizedIntGraph :: Int -> Gen (Graph Int)
arbitrarySizedIntGraph s = do
  nodes <- vectorOf s $ choose (0, 32000)
  edges <- edges nodes
  let withNodes = List.foldr insertNode BlogGraph.empty nodes
  return $ List.foldr addEdge withNodes edges
  where
    addEdge (parent, child) = insertEdge parent child
    edges nodes = do
      parents <- sublistOf nodes
      let children = nodes List.\\ parents
      return [ (parent, child) | parent <- parents, child <- children ]

instance Arbitrary (Graph Int) where
  arbitrary = sized arbitrarySizedIntGraph

The above generator will be good enough for our case. It generates variable length graphs. A sublist of all nodes are made parents in edges and all parents are connected to the rest of non-parental nodes.

When we try to run the test we get:

Failed! Falsifiable (after 3 tests): 
Graph (fromList [(10089,[]),(25695,[10089])])

QuickCheck shows that the property doesn’t hold for the whole domain — it failed after 3 examples. It also prints the example for which our property did not hold.

We can now reexamine the code for the removeNode function and fix it as per the property’s specification:

removeNode :: (Ord a) => a -> Graph a -> Graph a
removeNode node (Graph m) =
  Graph $ Map.map remNode $ Map.delete node m
  where
    remNode = List.delete node

Now running the test again we can see that it works.

Another working example — the JavaScript land

As I stated before, this pattern became implemented for many different mainstream languages — this includes JavaScript. I’d like to show you the version of the above process for this language now. This might end up being helpful if you’d like to use it in your project but don't know much Haskell yet.

As a start, let’s make sure we have the following packages:

npm install jsverify
npm install lodash

We can now create a JS file with what might resemble the Haskell draft implementation:

var jsc = require("jsverify");
var _   = require("lodash");

var Graph = function() {
    var self = this;
      
    self._map = {};

    self.insertNode = function(node) {
      if(self._map[node] === undefined) {
        self._map[node] = [];
      }
      return self;
    };

    self.removeNode = function(node) {
      self._map.delete(node);
      return self;
    };

    self.insertEdge = function(a, b) {
      if(self._map[a] === undefined) {
        self.insertNode(a);
      }
      self._map[a].push(b);
      return self;
    };

    self.nodes = function() {
      return _.keys(self._map);
    };
}

Graph.empty = function() {
    return new Graph();
}

To reproduce the first property — for all integers, inserting one as a node to an empty graph results in a graph with one node:

var propInsertEmpty =
  jsc.forall("nat", function(i) {
    return _.isEqual(Graph.empty().insertNode(i).nodes(), [i]);
  });

jsc.assert(propInsertEmpty);

The jsVerify DSL takes some time to get used to. It cannot take advantage of the type system as in the Haskell example so aspects like generation of random data based on types requires some documentation reading.

Running jsc.assert we might have expected to get a success, but this time we’re getting:

Error: Failed after 1 tests and 5 shrinks. rngState: 001d40a68297fbce35; Counterexample: 0;

We can see that jsVerify has found 0 as a counterexample. Let’s see what’s happening by running the code by hand passing 0 as a parameter:

console.log(Graph.empty().insertNode(0).nodes());

Result:

[ '0' ]

Aha! It’s quite easy to shoot your own foot in JavaScript. We can fix it really fast with the following:

self.nodes = function() {
   return _.map(_.keys(self._map), function(i){ return parseInt(i, 10); });
};

Running the code again doesn’t show any errors which means that all assertions were valid. What about the bug we saw in the Haskell version? Let’s provide a property for that too:

var propRemoveRemovesEdges =
  jsc.forall(graphG, function(g) {
    if(g.nodes().length === 0){
      return true;
    }
    else {
      var numNodes = g.nodes().length;
      var index = _.random(0, numNodes - 1);
      var node = g.nodes()[index];
      return !_.includes(_.flattenDeep(_.values(g.removeNode(node)._map)), node);
    }
  });

jsc.assert(propRemoveRemovesEdges);

We will still need to specify how to generate a random graph. We can use the notion of a Functor that's coming from the functional programming world and turn a random array into a random graph:

var graphG = jsc.array(jsc.nat).smap(
  function(arr) {
    var ins = function(g, i) {
      return g.insertNode(i);
    };
    var graph = _.reduce(arr, ins, Graph.empty());
    var numParents = Math.floor(arr.length / 2);
    var parents = _.take(arr, numParents);
    var children = _.difference(arr, parents);
    var insEd = function(g, parent) {
      var insF = function(r, c) {
        return r.insertEdge(parent, c);
      };
      return _.reduce(children, insF, g);
    };
    return _.reduce(parents, insEd, graph);
  },
  function(graph) {
    return graph.nodes();
  }
);

When running the assert for that property we’re getting an error:

Error: Failed after 1 tests and 1 shrinks. rngState: 085f6c82ea10439d7b; Counterexample: {"_map":{"21":[44]}}; Exception: self._map.delete is not a function

This isn’t the issue we were expecting though. Still it’s great to find a problem before showing the code to the client. We can iron it out with:

self.removeNode = function(node) {
  delete self._map[node];
  return self;
};

When running again, we’re getting an error we were expecting:

Error: Failed after 8 tests and 2 shrinks. rngState: 8c97e25bc36f41da08; Counterexample: {"_map":{"2":[7]}};

The jsVerify has found a counterexample falsifying our property. It’s also worth noting that it took 8 tests to find this issue. We can notice that for the event of removing a node that is a child and doesn’t have any children itself the property isn’t true. Let’s reexamine our removeNode function:

self.removeNode = function(node) {
  delete self._map[node];
  return _.mapValues(self._map, function(children) {
    return _.without(children, node);
  });
};

And now it works!

Not only Haskell and JavaScript

The QuickCheck style testing is available for many different languages. The Wikipedia says:

Re-implementations of QuickCheck exist for C, C++, Chicken Scheme, Clojure, Common Lisp, D, Elm, Erlang, F# (and C#, VB.NET), Factor, Io, Java, JavaScript, Node.js, Objective-C, OCaml, Perl, Prolog, PHP, Python, R, Ruby, Rust, Scala, Scheme, Smalltalk, Standard ML and Swift.

You can find many useful links about the approach on Wikipedia. If you’re into Haskell, a good place to start reading about the library is the Haskell Wiki as well as the documentation found on the Hackage.

The JavaScript counterpart can be found on GitHub. It’s important to note that jsVerify isn’t the only JavaScript library implementing the QuickCheck approach.

MediaWiki extension EmailDiff: notification emails improved

One of the nice things about MediaWiki is the ability to use extensions to extend the core functionality in many ways. I've just released a new version of an extension I wrote called EmailDiff that helps provide a much needed function. When one is using a MediaWiki site, and a page is on your watchlist - or your username is inside the 'UsersNotifiedOnAllChanges' array - you will receive an email whenever a page is changed. However, this email simply gives you the editor's summary and states "the page has been changed, here's some links if you want to see exactly what". With the EmailDiff extension enabled, a full diff of what exactly has changed is sent in the email itself. This is extremely valuable because you can quickly see exactly what has changed, without leaving your email client to open a browser (and potentially have to login), and without breaking your flow.

Normally, a MediaWiki notification email for a page change will look something like this:

Subject: MediaWiki page Project:Sandbox requirements has been changed by Zimmerman

Dear Turnstep,

The MediaWiki page Project:Sandbox requirements has been changed on
16 November 2015 by Zimmerman, see
https://www.mediawiki.org/wiki/Project:Sandbox for the
current revision. 

See
https://www.mediawiki.org/w/index.php?title=Project:Sandbox&diff=next&oldid=7076877
to view this change.

See
https://www.mediawiki.org/w/index.php?title=Project:Sandbox&diff=0&oldid=8657769
for all changes since your last visit.

Editor's summary: important thoughts

Contact the editor:
mail: https://www.mediawiki.org/wiki/Special:EmailUser/Zimmerman
wiki: https://www.mediawiki.org/wiki/User:Zimmerman

There will be no other notifications in case of further activity unless
you visit this page while logged in. You could also reset the
notification flags for all your watched pages on your watchlist.

Your friendly MediaWiki notification system

--
To change your email notification settings, visit
https://www.mediawiki.org/wiki/Special:Preferences

To change your watchlist settings, visit
https://www.mediawiki.org/wiki/Special:EditWatchlist

To delete the page from your watchlist, visit
https://www.mediawiki.org/w/index.php?title=Project:Sandbox&action=unwatch

Feedback and further assistance:
https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents

The above is the default email message for page changes on mediawiki.org. As you can see, it is very wordy, but conveys little actual information. In contrast, here is the EmailDiff extension, along with the suggested changes in the notification email template mentioned below:

Subject: MediaWiki page Project:Sandbox requirements has been changed by Zimmerman (diff)

Page: Project:Sandbox
Summary: important thoughts
User: Zimmerman  Time: 11 November 2015

Version differences:
@@ -846,5 +887,3 @@
 In cattle, temperament can affect production traits such as carcass and meat 
 quality or milk yield as well as affecting the animal's overall health and 
-reproduction. Cattle temperament is defined as "the consistent behavioral and physiological 
-difference observed between individuals in response to a stressor or environmental 
+reproduction. If you succeed in tipping a cow only partway, such that only one 
+of its feet is still on the ground, you have created lean beef. Such a feat is 
+well done. Naturally, being outside, the cow is unstable. When it falls over, 
+it becomes ground beef. Cattle temperament is defined as "the consistent behavioral 
+and physiological difference observed between individuals in response to a stressor or environmental 
 challenge and is used to describe the relatively stable difference in the behavioral 
 predisposition of an animal, which can be related to psychobiological mechanisms.

That is so much better: short, sweet, and showing exactly the information you need. The lack of a diff has long been a pet peeve of mine, so much so that I wrote this functionality a long time ago as some hacks to the core code. Now, however, everything is bottled up into one neat extension.

This extension works by use of the great hook system of MediaWiki. In particular, it uses the SendPersonaliedNotificationEmail hook. It is not yet included in MediaWiki, but I am hoping it will get included for version 1.27. The hook fires just before the normal notification email is about to be sent. The extension generates the diff, and sticks it inside the email body. It will also append the string '(diff)' to the subject line, but that is configurable (see below).

The extension has changed a lot over the years, moving forward along with MediaWiki, whose support of extensions gets better all the time. The current version of the EmailDiff extension, 1.7, requires a MediaWiki version of 1.25 or better, as it uses the new extension.json format.

Installation is pretty straightforward with four steps. First, visit the official extension page at mediawiki.org, download the tarball, and untar it into your extensions directory. Second, add this line to your LocalSettings.php file:

wfLoadExtension( 'EmailDiff' );

If you need to change any of the configuration settings, add them to LocalSettings.php right below the wfLoadExtension line. Currently, the only two configuration items are:

  • $wgEmailDiffSubjectSuffix This is a string that gets added to the end of any notification emails that contain a diff. Defaults to (diff).
  • $wgEmailDiffCommand This is the command used to execute the diff. It should not need to be changed for most systems. Defaults to "/usr/bin/diff -u OLDFILE NEWFILE | /usr/bin/tail --lines=+3 > DIFFFILE"

As mentioned above, this extension requires the SendPersonaliedNotificationEmail hook to exist. For the third step, you need to add the hook in if it does not exist by editing the includes/mail/EmailNotification.php file. Insert this line at the bottom of the sendPersonalized function, just before the final return:

Hooks::run( 'SendPersonalizedNotificationEmail',
    [ $watchingUser, $this->oldid, $this->title, &$headers, &$this->subject, &$body ] );

The final step is to modify the template used to send the notification emails. You can find it by editing this page on your wiki: MediaWiki::Enotif_body, and adding the string $PAGEDIFF where you want the diff to appear. I recommend cleaning up the template while you are in there. Here is my preferred template:

Page: $PAGETITLE
Summary: $PAGESUMMARY $PAGEMINOREDIT
User: $PAGEEDITOR  Time: $PAGEEDITDATE
$PAGEDIFF
 
$NEWPAGE

Once installed, you will need to activate the email diffs for one or more users. A new user preference that allows emailing of diffs has been added. It is off by default; to turn it on, a user should visit their "Preferences" link, go to the "User profile" section, and look inside the "Email options" for a new checkbox that says "Send a diff of changes" (or the same but in a different language, if the localization has been set up). The checkbox will look like this:

Just check the box, click the "Save" button, and your notification emails become much more awesome. To enable email diffs for everyone on your wiki, add this line to your LocalSettings.php file:

$wgDefaultUserOptions['enotifshowdiff'] = true;

There are some limitations to this extension that should be mentioned. As each page edit will potentially cause three files to be created on the operating system as well as invoking an external diff command, large and extremely busy wikis may see a performance impact. However, file creations are cheap and the diff command is fast, so unless you are Wikipedia, it's probably worth at least testing out to see if the impact is meaningful.

I also like these emails as a kind of audit trail for the wiki. On that note, email notifications do NOT get sent to changes you have made yourself! Well, they do for me, but that has required some hacking of the core MediaWiki code. Maybe someday I will attempt to make that into a user preference and/or extension as well. :)

Strict typing fun example - Free Monads in Haskell

From time to time I’ve got a chance to discuss different programming paradigms with colleagues. Very often I like steering the discussion into the programming languages realm as it’s something that interests me a lot.

Looking at the most popular languages list on GitHub, published last August, we can see that in the most popular five, we only have one that is “statically typed”. https://github.com/blog/2047-language-trends-on-github

The most popular languages on GitHub as of August 2015:

  • JavaScript
  • Java
  • Ruby
  • PHP
  • Python

The dynamic typing approach gives great flexibility. It very often empowers teams to be more productive. There are use cases for static type systems I feel that many people are not aware of though. I view this post as an experiment. I’d like to present you with a pattern that’s being used in Haskell and Scala worlds (among others). The pattern is especially helpful in these contexts as both Haskell and Scala have extremely advanced type systems (comparing to e. g. Java or C++ and not to mention Ruby or Python).

My goal is not to explain in detail all the subtleties of the code I’m going to present. The learning curve for both languages can be pretty dramatic. The goal is to make you a bit curious about alternative development styles and how they could be very powerful.

Short intro to the idea behind the pattern

The pattern I’m going to present is called the “Free Monad + Interpreter”. The idea behind it is that we can build DSLs (domain specific languages) by making our functions not execute the code immediately, but to build the AST (abstract syntax tree) out of it and interpret it in different ways depending on the context.

A fun example I came up with is a DSL for system provisioning scripts that — among many use cases one could come up with —allows to:

  • present the AST in bash or zsh code or whatever other language like Python, Ruby or Perl
  • present the AST as a graph to visualize the execution
  • execute it directly, natively in Haskell
  • have an easy to comprehend set of provisioning instructions while lower level aspects like file handles etc. - being handled in common Haskell code used for the execution of ASTs

There are potentially many more use cases but I just wanted to show you a couple — enough to hopefully make you a bit curious. In this post we’ll focus on interpreting the AST as a bash script.

The coding part

The first step is to define the set of instructions our interpreted Domain Specific Language will support:

data Provision next =
  Begin next |
  Install String next |
  IfNotExists String (Free Provision ()) next |
  Touch String next |
  Cd String next |
  MkDir String Bool next |
  Echo String next |
  Continue |
  Done
  deriving(Functor)

This odd looking definition is what’s called an Algebraic Data Type. For now it should suffice that the commands can take arguments of different types and almost all of them take a continuation command as the last parameter.

The continuation parameter is meant to store the next “provisioning command” so that we would have e.g:

Begin (Install "postgresql-server" (Echo "installed!" (Done)))

Out of these blocks, our ASTs will be created. We need some way of composing these blocks into AST trees. I’m not going to explain here why the following code works — it’s just a teaser post. Let’s just say that the following functions allow us to just build the tree instead of calling any system-affecting code. In other words, it allows these calls to look as if they’re doing something when in fact they are just constructing the data structure in memory:

begin = liftF $ Begin id

install what = liftF $ Install what id

ifNotExists path what = liftF $ IfNotExists path what id

touch path = liftF $ Touch path id

cd path = liftF $ Cd path id

mkDir path wholeTree = liftF $ MkDir path wholeTree id

echo message = liftF $ Echo message id

continue = liftF $ Continue

done = liftF Done

Now that we have these building functions defined, we can create a function that uses them to construct a useful AST:

app :: Free Provision a
app = do
  begin
  install "postgresql-server"
  mkDir "/var/run/the-app" True
  cd "/var/run/the-app"
  ifNotExists "the-app.log" $ touch "the-app.log" >> continue
  done

Running this function does nothing except for returning AST wrapped inside the “free monad” — which you can think of as a special, useful kind of container. The above function looks like any other Haskell function. It’s also “type safe” — which weeds out one class of errors that we’re only able to notice after we ran the code — in languages like JavaScript or Python.

Later on we’ll see that to get different results out of the “provisioning workflow” we defined above, no change in this function will be needed.

Now, having an AST tree wrapped around some “useful” container almost screams for some kind of an interpreter for this to be useful too. That’s in fact part of the description of the pattern I gave you in the beginning of this post.

Let’s define a set of data types linked with the function that we’ll use as an interpreter:

class InterpretingContext a where
  run :: Free Provision () -> a

The above just says that if we want to use the function run to turn the AST wrapped in a monad to some concrete value (by executing it) — we need to implement this function for the type of the concrete value we’d like to get out of it.

For example, let’s say that for the portability sakes we want to turn the AST into the bash script. The natural (though naive) way to do this would be to implement this “class” along with its run function for the type of String:

instance InterpretingContext String where
  run (Free (Begin next)) = 
    "#!/usr/bin/env bash\n\n" ++ (run next)
    
  run (Free (Install what next)) =
    "apt-get install " ++ what ++ "\n" ++ nextStr
    where
      nextStr = run next
      
  run (Free (IfNotExists path what next)) =
    "if [ ! -f " ++ path ++ " ]; then\n\t" ++ whatStr 
      ++ "\nfi\n" ++ nextStr
    where
      whatStr = run what
      nextStr = run next
      
  run (Free (Touch path next)) =
    "touch " ++ path ++ "\n" ++ (run next)
    
  run (Free (Cd path next)) = 
    "cd " ++ path ++ "\n" ++ (run next)
    
  run (Free (MkDir path tree next)) = 
    "mkdir " ++ treeOption ++ " " ++ path ++ "\n" ++ (run next)
    where
      treeOption =
        if tree then "-p" else ""
        
  run (Free (Echo message next)) = 
    "echo " ++ message ++ "\n" ++ (run next)
              
  run (Free Continue) = ""
  
  run (Free Done) = "exit 0"

Each node kind is being interpreted as a data type we chose to be one of the instances of this class — in our example a String.

What this allows us to do, is to use the run function, specifying that we want a String as a return value and automatically the instance we’ve just created will be used:

run app :: String

This will return:

"#!/usr/bin/env bash\n\napt-get install postgresql-server\nmkdir -p /var/run/the-app\ncd /var/run/the-app\nif [ ! -f the-app.log ]; then\n\ttouch the-app.log\n\nfi\nexit 0"

Pretty printed:

#!/usr/bin/env bash

apt-get install postgresql-server
mkdir -p /var/run/the-app
cd /var/run/the-app
if [ ! -f the-app.log ]; then
    touch the-app.log
fi
exit 0

If now we’d like to execute the AST in the context of an action that prints the script to stdout we could do so like this:

instance InterpretingContext (IO ()) where
  run = print . run

From now on it would be perfectly valid to run the function with AST in both contexts:

run app :: String
run app :: IO ()

We could add a context returning an ExitStatus by running the code against the system very easily too:

data ExitStatus = ExitSuccess | ExitFailure Int

instance InterpretingContext (IO ExitStatus) where
  run = (…)

What this gives us is the ability to have the provisioning code that could be run in production while having a different interpreter in the testing suite to be able to ensure the structure of execution without inflicting any changes to the system itself.

If you’d like to play with the code yourself, you’ll need a couple of more lines for this to work:

{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE TypeSynonymInstances #-}
{-# LANGUAGE OverloadedStrings #-}

And also:

import Control.Monad.Free

Bear in mind though that the code I presented here is by no means optimal — especially memory wise. I chose to present it this way for the clarity of what the code is doing for those of you not familiar with the language.

What are other use cases for this pattern?

The pattern presented here has a huge number of uses. It could be used for providing a DSL for building SVG combined with an interpreter that could draw it visually for the ease of work. It could also be used for defining RPC data types describing structures and interpreting them differently based on the underlying RPC (remote procedure call) mechanics (Thrift, SOAP etc).

I doubt that the ability this gives thanks to the very helpful Haskell type system could be reproduced in languages like Ruby or Python easily. It is possible of course, but the amount of boilerplate code and complexity would require lots of testing code too. Here on the other hand the code holds many guarantees just because we’re coding in a language with an advanced strict type system.

Also, the similarity to the Interpreter Pattern known from the object oriented languages is only superficial. In that case there’s no way to use regular normal functions (or methods) to build AST — as if it was a regular imperative code. It’s always about some weird mangling of data structures by hand.

Curious?

If I managed to make you a bit curious about the aspects I presented here, here are some of the resources you might want to take a look at:

Story telling with Cesium

Let me tell you about my own town

I was born in Yekaterinburg. It's a middle-sized town in Russia.
Most likely you don't know where it is. So let me show you:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Hello World!</title>
    <script src="/cesium/Build/Cesium/Cesium.js"></script>
    <link rel="stylesheet" href="layout.css"></link>
  </head>
<body>
  <div id="cesiumContainer"></div>
  <script>
    var viewer = new Cesium.Viewer('cesiumContainer');
    (function(){
        var ekb = viewer.entities.add({
          name : 'Yekaterinburg',
          // Lon, Lat coordinates
          position : Cesium.Cartesian3.fromDegrees(60.6054, 56.8389),
          // Styled geometry
          point : {
            pixelSize : 5,
          color : Cesium.Color.RED
          },
          // Labeling
          label : {
            text : 'Yekaterinburg',
            font : '16pt monospace',
            style: Cesium.LabelStyle.FILL_AND_OUTLINE,
            outlineWidth : 2,
            verticalOrigin : Cesium.VerticalOrigin.BOTTOM,
            pixelOffset : new Cesium.Cartesian2(0, -9)
          }
        });
        
        // How to place camera around point
        var heading = Cesium.Math.toRadians(0);
        var pitch = Cesium.Math.toRadians(-30);
        viewer.zoomTo(ekb, new Cesium.HeadingPitchRange(heading, pitch, 10000));
    })();
  </script>
</body>
</html>

Now, I would like to tell you about the history of my town. In the beginning it was a fortified metallurgical plant with a few residential blocks and some public buildings. It was relatively small.

I could say that its size was about the size of a modern city-center, but that description is too vague. I think the best way to tell you something about the city is with a map.

I'll show you two maps.

  1. A map from 1750, early in the town's history when it was just a factory: http://www.retromap.ru/m.php#r=1417506
  2. A map from 1924 at about the start of Soviet industrialization, just before Yekaterinburg became the industrial center of the Urals: http://www.retromap.ru/m.php#r=1419241
Thanks for these beautiful maps to http://www.retromap.ru.

The map from 1750 is a small image, so I've added it via SingleTileImageryProvider. Just specifying the src and coordinates:

function setupImagery() {
    var layers = viewer.scene.imageryLayers;
        
    var s = 56.8321929;
    var n = 56.8442609;
    var w = 60.5878970;
    var e = 60.6187892;
    var l1750 = layers.addImageryProvider(new Cesium.SingleTileImageryProvider({
        url : 'assets/1750.png',
        rectangle : Cesium.Rectangle.fromDegrees(w,s,e,n)
    }));
    l1750.alpha = 0.75;
}

Now you can see how small my town was initially.

The second image is larger, so I've split it up into tiles. If you use gdal2tiles.py for generating tiles, it creates all the metadata necessary for TMS and you are able connect the imagery set via TileMapServiceImageryProvider. Otherwise, you can use UrlTemplateImageryProvider.

var l1924 = layers.addImageryProvider(new Cesium.TileMapServiceImageryProvider({
    url: 'assets/1924t',
    minimumLevel: 8,
    maximumLevel: 16,
    credit: 'retromap.ru'
}));

l1924.alpha = 0.75;

I've used QGIS for geo referencing. Here is a good tutorial.

User interface and Angular.js

And below here you see how I've added some controls for visibility and opacity of the overlays. Later we will bind them with an Angular-driven GUI:

// APP is global

var layersHash = {
    'l1750': l1750,
    'l1924': l1924
}

APP.setAlpha = function(layer, alpha) {
    if(layersHash[layer] && layersHash[layer].alpha !== undefined) {
        layersHash[layer].alpha = alpha;
    }            
};

APP.show = function(layer) {
    if(layersHash[layer] && layers.indexOf(layersHash[layer]) < 0) {
        layers.add(layersHash[layer]);
    }            
};

Why not keep our views in a namespace and access them directly from an Angular controller? Using this approach will give us a lot of flexibility:

  • You can split up the GUI and Cesium modules and use something else instead of Cesium or Angular.
  • You are able to make a proxy for `APP` and add business logic to the calls made to it.
  • It just feels right not to meld all parts of the application into one unmanageble mish-mash of code.

For the GUI I've added a big slider for the timeline, small sliders for layer opacity, checkboxes for visibility, and call APP methods via Angular's $watch.

$scope.$watch('slider.value', function() {
    var v = $scope.slider.options.stepsArray[$scope.slider.value];
    if (v >= 1723 && v <= 1830) {
        $scope.menu.layers[0].active = true;
        $scope.menu.layers[1].active = false;
    }
    if (v > 1830 && v < 1980) {
        $scope.menu.layers[0].active = false;
        $scope.menu.layers[1].active = true;
    }
    if(v >= 1980) {
        $scope.menu.layers[0].active = false;
        $scope.menu.layers[1].active = false;
    }
    
    $scope.updateLayers();
});

$scope.updateLayers = function() {
    for (var i = 0; i < $scope.menu.layers.length; i++) {
        if ($scope.menu.layers[i].active ) {
            APP.show && APP.show($scope.menu.layers[i].name);
        }
        else {
            APP.hide && APP.hide($scope.menu.layers[i].name);
        }
    }
};

Back to the history:

Yekaterinburg was founded on November 7, 1723. This is the date of the first test-run of the forging hammers in the new factory. The original factory design by Tatishew had 40 forging hammers and 4 blast furnaces. That may well have made it the best equipped and most productive factory of its time.
Now I want to add text to the application. Also, I have some cool pictures of the hammers and furnaces. Adding an overlay for text and binding its content is a matter of CSS and ng-include/ng-bind knowledge and it's a bit out of scope for this post, but let's push on and add some pictures and link them to the timeline. Cesium has KmlDataSource for KML loading and parsing. This is how my application loads and accesses these attributes:
var entityByName = {};
var promise = Cesium.KmlDataSource.load('assets/foto.kml');
promise.then(function(dataSource) {
    viewer.dataSources.add(dataSource);
    
    //KML entities
    var entities = dataSource.entities.values;
    for (var i = 0; i < entities.length; i++) {
        var entity = entities[i];
        // <Data> attributes, parsed into js object
        var ed = entity.kml.extendedData;
        
        entityByName[entity.id] = {
            'entity': entity,
            since: parseInt(ed.since.value),
            to: parseInt(ed.to.value)
        };
    }
});

Add bindings for Angular:

APP.filterEtityByY = function(y) {
    for (var k in entityByName) {
        if(entityByName.hasOwnProperty(k)) {
            var s = entityByName[k].since;
            var t = entityByName[k].to;
            entityByName[k].entity.show = (y >= s && y <= t);
        }
    }
};

var heading = Cesium.Math.toRadians(0);
var pitch = Cesium.Math.toRadians(-30);
var distanceMeters = 500;
var enityHeading = new Cesium.HeadingPitchRange(heading, pitch, distanceMeters);

APP.zoomToEntity = function(name) {
    if(name && entityByName[name]) {
        viewer.zoomTo(entityByName[name].entity, enityHeading);
    }
};
I've added object timespans via extended data. If you want to use the Cesium/GE default timeline, you should do it via a TimeSpan section in the KML's entries:

  2000-01-00T00:00:00Z
    2000-02-00T00:00:00Z
 

Another interesting fact about my town is that between 1924 and 1991 it had a different name: Sverdlovsk (Свердловск). So I've added town name changing via APP and connected it to the timeline.

Using APP.filterEtityByY and APP.zoomToEntity it's relatively easy to connect a page hash like example.com/#!/feature/smth with features from KML. One point to note is that I use my own hash's path part parser instead of ngRoute's approach.

You can see how all these elements work together at https://dmitry.endpoint.com/cesium/ekb, the sources are on Github at https://github.com/kiselev-dv/EkbHistory/tree/master.

Loading JSON Files Into PostgreSQL 9.5

In the previous posts I have described a simple database table for storing JSON values, and a way to unpack nested JSON attributes into simple database views. This time I will show how to write a very simple query (thanks to PostgreSQL 9.5) to load the JSON files

Here's a simple Python script to load the database.

This script is made for PostgreSQL 9.4 (in fact it should work for 9.5 too, but is not using a nice new 9.5 feature described below).

#!/usr/bin/env python

import os
import sys
import logging

try:
    import psycopg2 as pg
    import psycopg2.extras
except:
    print "Install psycopg2"
    exit(123)

try:
    import progressbar
except:
    print "Install progressbar2"
    exit(123)

import json

import logging
logger = logging.getLogger()

PG_CONN_STRING = "dbname='blogpost' port='5433'"

data_dir = "data"
dbconn = pg.connect(PG_CONN_STRING)

logger.info("Loading data from '{}'".format(data_dir))

cursor = dbconn.cursor()

counter = 0
empty_files = []

class ProgressInfo:

    def __init__(self, dir):
        files_no = 0
        for root, dirs, files in os.walk(dir):
            for file in files:
                if file.endswith(".json"):
                    files_no += 1
        self.files_no = files_no
        print "Found {} files to process".format(self.files_no)
        self.bar = progressbar.ProgressBar(maxval=self.files_no,
                                           widgets=[' [', progressbar.Timer(), '] [', progressbar.ETA(), '] ', progressbar.Bar(),])

    def update(self, counter):
        self.bar.update(counter)

pi = ProgressInfo(os.path.expanduser(data_dir))

for root, dirs, files in os.walk(os.path.expanduser(data_dir)):
    for f in files:
        fname = os.path.join(root, f)

        if not fname.endswith(".json"):
            continue
        with open(fname) as js:
            data = js.read()
            if not data:
                empty_files.append(fname)
                continue
            import json
            dd = json.loads(data)
            counter += 1
            pi.update(counter)
            cursor.execute("""
                            INSERT INTO stats_data(data)
                            SELECT %s
                            WHERE NOT EXISTS (SELECT 42
                                              FROM stats_data
                                              WHERE
                                                    ((data->>'metadata')::json->>'country')  = %s
                                                AND ((data->>'metadata')::json->>'installation') = %s
                                                AND tstzrange(
                                                        to_timestamp((data->>'start_ts')::double precision),
                                                        to_timestamp((data->>'end_ts'  )::double precision)
                                                    ) &&
                                                    tstzrange(
                                                        to_timestamp(%s::text::double precision),
                                                        to_timestamp(%s::text::double precision)
                                                    )
                                             )
                        """, (data, str(dd['metadata']['country']), str(dd['metadata']['installation']), str(dd['start_ts']), str(dd['end_ts'])))

print ""

logger.debug("Refreshing materialized views")
cursor.execute("""REFRESH MATERIALIZED VIEW sessions""");
cursor.execute("""ANALYZE""");

dbconn.commit()

logger.info("Loaded {} files".format(counter))
logger.info("Found {} empty files".format(len(empty_files)))
if empty_files:
    logger.info("Empty files:")
    for f in empty_files:
        logger.info(" >>> {}".format(f))

I have created two example files in the 'data' directory, the output of this script is:

Found 2 files to process
 [Elapsed Time: 0:00:00] [ETA:  0:00:00] |#####################################|

Yey, so it works. What's more, I can run the script again on the same files, and it will try loading the same data without any errors. Do you rememember that there was an EXCLUDE constraint which doesn't allow us to load any JSON for the same country, and installation, and overlapping time range? That's why the query is so long. I also need to check that such a JSON is not in the database, so I can load it.

This is twice slower than the next solution. The problem is that it needs to unpack the JSON to run the subquery, then insert the data checking the same thing (in fact the insert, and the subquery are using the same index made by the EXCLUDE constraint).

And then PostgreSQL 9.5 was released, with one great feature: ON CONFLICT DO SOMETHING. The conflict is a UNIQUE index violation. The EXCLUDE clause in the stats_data table created such a unique index.

There can also be ON CONFLICT DO NOTHING, and that's what I have used. I changed only one query in the script, and instead of this:


            cursor.execute("""
                            INSERT INTO stats_data(data)
                            SELECT %s
                            WHERE NOT EXISTS (SELECT 42
                                              FROM stats_data
                                              WHERE
                                                    ((data->>'metadata')::json->>'country')  = %s
                                                AND ((data->>'metadata')::json->>'installation') = %s
                                                AND tstzrange(
                                                        to_timestamp((data->>'start_ts')::double precision),
                                                        to_timestamp((data->>'end_ts'  )::double precision)
                                                    ) &&
                                                    tstzrange(
                                                        to_timestamp(%s::text::double precision),
                                                        to_timestamp(%s::text::double precision)
                                                    )
                                             )
                        """, (data, str(dd['metadata']['country']), str(dd['metadata']['installation']), str(dd['start_ts']), str(dd['end_ts'])))

It looks like this:


            cursor.execute("""
                            INSERT INTO stats_data(data)
                            VALUES (%s)
                            ON CONFLICT ON CONSTRAINT no_overlapping_jsons DO NOTHING
                        """, (data, ))

This version requires PostgreSQL 9.5 and will not work on the previous versions.

It is twice as fast as the original, and works as expected. This means that I can run it on the already loaded files, and will not load them. This way when I use rsync to download the new files, I can just run the script, and it will load only the new files into the database.

Loading 88k of JSON files using the production version of the script with the first query takes over 1 minute. Loading the files using the second version takes less than 30 seconds.