News

Welcome to End Point’s blog

Ongoing observations by End Point people

Recognizing handwritten digits - a quick peek into the basics of machine learning

Previous in series:
In the previous two posts on machine learning, I presented a very basic introduction of an approach called "probabilistic graphical models". In this post I'd like to take a tour of some different techniques while creating code that will recognize handwritten digits.

The handwritten digits recognition is an interesting topic that has been explored for many years. It is now considered one of the best ways to start the journey into the world of machine learning.

Taking the Kaggle challenge

We'll take the "digits recognition" challenge as presented in Kaggle. It is an online platform with challenges for data scientists. Most of the challenges have their prizes expressed in real money to win. Some of them are there to help us out in our journey on learning data science techniques — so is the "digits recognition" contest.

The challenge

As explained on Kaggle:

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision.

The "digits recognition" challenge is one of the best ways to get acquainted with machine learning and computer vision. The so-called "MNIST" dataset consists of 70k images of handwritten digits - each one grayscaled and of a 28x28 size. The Kaggle challenge is about taking a subset of 42k of them along with labels (what actual number does the image show) and "training" the computer on that set. The next step is to take the rest 28k of images without the labels and "predict" which actual number they present.

Here's a short overview of how the digits in a set really look like (along with the numbers they represent):


I have to admit that for some of them I have a really hard time recognizing the actual numbers on my own :)

The general approach to supervised learning

Learning from labelled data is what is called "supervised learning". It's supervised because we're taking the computer by hand through the whole training data set and "teaching" it how the data that is linked with different labels "looks" like.

In all such scenarios we can express the data and labels as:
Y ~ X1, X2, X3, X4, ..., Xn
The Y is called a dependent variable while each Xn are independent variables. This formula holds both for classification problems as well as regressions.

Classification is when the dependent variable Y is so called categorical — taking values from a concrete set without a meaningful order. Regression is when the Y is not categorical — most often continuous.

In the digits recognition challenge we're faced with the classification task. The dependent variable takes values from the set:
Y = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
I'm sure the question you might be asking yourself now is: what are the independent variables Xn? It turns out to be the crux of the whole problem to solve :)

The plan of attack

A good introduction to computer vision techniques is a book by J. R Parker - "Algorithms for Image Processing and Computer Vision". I encourage the reader to buy that book. I took some ideas from it while having fun with my own solution to the challenge.

The book outlines the ideas revolving around computing image profiles — for each side. For each row of pixels, a number representing the distance of the first pixel from the edge is computed. This way we're getting our first independent variables. To capture even more information about digit shapes, we'll also capture the differences between consecutive row values as well as their global maxima and minima. We'll also compute the width of the shape for each row.

Because the handwritten digits vary greatly in their thickness, we will first preprocess the images to detect so-called skeletons of the digit. The skeleton is an image representation where the thickness of the shape has been reduced to just one.

Having the image thinned will also allow us to capture some more info about the shapes. We will write an algorithm that walks the skeleton and records the direction change frequencies.

Once we'll have our set of independent variables Xn, we'll use a classification algorithm to first learn in a supervised way (using the provided labels) and then to predict the values of the test data set. Lastly we'll submit our predictions to Kaggle and see how well did we do.

Having fun with languages

In the data science world, the lingua franca still remains to be the R programming language. In the last years Python has also came close in popularity and nowadays we can say it's the duo of R and Python that rule the data science world (not counting high performance code written e. g. in C++ in production systems).

Lately a new language designed with data scientists in mind has emerged - Julia. It's a language with characteristics of both dynamically typed scripting languages as well as strictly typed compiled ones. It compiles its code into efficient native binary via LLVM — but it's using it in a JIT fashion - inferring the types when needed on the go.

While having fun with the Kaggle challenge I'll use Julia and Python for the so called feature extraction phase (the one in which we're computing information about our Xn variables). I'll then turn towards R for doing the classification itself. Note that I might use any of those languages at each step getting very similar results. The purpose of this series of articles is to be a bird eye fun overview so I decided that this way will be much more interesting.

Feature Extraction

The end result of this phase is the data frame saved as a CSV file so that we'll be able to load it in R and do the classification.

First let's define the general function in Julia that takes the name of the input CSV file and returns a data frame with features of given images extracted into columns:
using DataFrames

function get_data(name :: String, include_label = true)
  println("Loading CSV file into a data frame...")
  table = readtable(string(name, ".csv"))
  extract(table, include_label)
end
Now the extract function looks like the following:
"""
Extracts the features from the dataframe. Puts them into
separate columns and removes all other columns except the
labels.

The features:

* Left and right profiles (after fitting into the same sized rect):
  * Min
  * Max
  * Width[y]
  * Diff[y]
* Paths:
  * Frequencies of movement directions
  * Simplified directions:
    * Frequencies of 3 element simplified paths
"""
function extract(frame :: DataFrame, include_label = true)
  println("Reshaping data...")
  
  function to_image(flat :: Array{Float64}) :: Array{Float64}
    dim      = Base.isqrt(length(flat))
    reshape(flat, (dim, dim))'
  end
  
  from = include_label ? 2 : 1
  frame[:pixels] = map((i) -> convert(Array{Float64}, frame[i, from:end]) |> to_image, 1:size(frame, 1))
  images = frame[:, :pixels] ./ 255
  data = Array{Array{Float64}}(length(images))
  
  @showprogress 1 "Computing features..." for i in 1:length(images)
    features = pixels_to_features(images[i])
    data[i] = features_to_row(features)
  end
  start_column = include_label ? [:label] : []
  columns = vcat(start_column, features_columns(images[1]))
  
  result = DataFrame()
  for c in columns
    result[c] = []
  end

  for i in 1:length(data)
    if include_label
      push!(result, vcat(frame[i, :label], data[i]))
    else
      push!(result, vcat([],               data[i]))
    end
  end

  result
end
A few nice things to notice here about Julia itself are:
  • The function documentation is written in Markdown
  • We can nest functions inside other functions
  • The language is statically and strongly typed
  • Types can be inferred from the context
  • It is often desirable to provide the concrete types to improve performance (but that an advanced Julia related topic)
  • Arrays are indexed from 1
  • There's the nice |> operator found e. g. In Elixir (which I absolutely love)
The above code converts the images to be arrays of Float64 and converts the values to be within 0 and 1 (instead of 0..255 originally).

A thing to notice is that in Julia we can vectorize operations easily and we're using this fact to tersely convert our number:
images = frame[:, :pixels] ./ 255
We are referencing the pixels_to_features function which we define as:
"""
Returns ImageFeatures struct for the image pixels
given as an argument
"""
function pixels_to_features(image :: Array{Float64})
  dim      = Base.isqrt(length(image))
  skeleton = compute_skeleton(image)
  bounds   = compute_bounds(skeleton)
  resized  = compute_resized(skeleton, bounds, (dim, dim))
  left     = compute_profile(resized, :left)
  right    = compute_profile(resized, :right)
  width_min, width_max, width_at = compute_widths(left, right, image)
  frequencies, simples = compute_transitions(skeleton)

  ImageStats(dim, left, right, width_min, width_max, width_at, frequencies, simples)
end
This in turn uses the ImageStats structure:
immutable ImageStats
  image_dim             :: Int64
  left                  :: ProfileStats
  right                 :: ProfileStats
  width_min             :: Int64
  width_max             :: Int64
  width_at              :: Array{Int64}
  direction_frequencies :: Array{Float64}

  # The following adds information about transitions
  # in 2 element simplified paths:
  simple_direction_frequencies :: Array{Float64}
end

immutable ProfileStats
  min :: Int64
  max :: Int64
  at  :: Array{Int64}
  diff :: Array{Int64}
end
The pixels_to_features function first gets the skeleton of the digit shape as an image and then uses other functions passing that skeleton to them. The function returning the skeleton utilizes the fact that in Julia it's trivially easy to use Python libraries. Here's its definition:
using PyCall

@pyimport skimage.morphology as cv

"""
Thin the number in the image by computing the skeleton
"""
function compute_skeleton(number_image :: Array{Float64}) :: Array{Float64}
  convert(Array{Float64}, cv.skeletonize_3d(number_image))
end
It uses the scikit-image library's function skeletonize3d by using the @pyimport macro and using the function as if it was just a regular Julia code.

Next the code crops the digit itself from the 28x28 image and resizes it back to 28x28 so that the edges of the shape always "touch" the edges of the image. For this we need the function that returns the bounds of the shape so that it's easy to do the cropping:
function compute_bounds(number_image :: Array{Float64}) :: Bounds
  rows = size(number_image, 1)
  cols = size(number_image, 2)

  saw_top = false
  saw_bottom = false

  top = 1
  bottom = rows
  left = cols
  right = 1

  for y = 1:rows
    saw_left = false
    row_sum = 0

    for x = 1:cols
      row_sum += number_image[y, x]

      if !saw_top && number_image[y, x] > 0
        saw_top = true
        top = y
      end

      if !saw_left && number_image[y, x] > 0 && x < left
        saw_left = true
        left = x
      end

      if saw_top && !saw_bottom && x == cols && row_sum == 0
        saw_bottom = true
        bottom = y - 1
      end

      if number_image[y, x] > 0 && x > right
        right = x
      end
    end
  end
  Bounds(top, right, bottom, left)
end
Resizing the image is pretty straight-forward:
using Images

function compute_resized(image :: Array{Float64}, bounds :: Bounds, dims :: Tuple{Int64, Int64}) :: Array{Float64}
  cropped = image[bounds.left:bounds.right, bounds.top:bounds.bottom]
  imresize(cropped, dims)
end
Next, we need to compute the profile stats as described in our plan of attack:
function compute_profile(image :: Array{Float64}, side :: Symbol) :: ProfileStats
  @assert side == :left || side == :right

  rows = size(image, 1)
  cols = size(image, 2)

  columns = side == :left ? collect(1:cols) : (collect(1:cols) |> reverse)
  at = zeros(Int64, rows)
  diff = zeros(Int64, rows)
  min = rows
  max = 0

  min_val = cols
  max_val = 0

  for y = 1:rows
    for x = columns
      if image[y, x] > 0
        at[y] = side == :left ? x : cols - x + 1

        if at[y] < min_val
          min_val = at[y]
          min = y
        end

        if at[y] > max_val
          max_val = at[y]
          max = y
        end
        break
      end
    end
    if y == 1
      diff[y] = at[y]
    else
      diff[y] = at[y] - at[y - 1]
    end
  end

  ProfileStats(min, max, at, diff)
end
The widths of shapes can be computed with the following:
function compute_widths(left :: ProfileStats, right :: ProfileStats, image :: Array{Float64}) :: Tuple{Int64, Int64, Array{Int64}}
  image_width = size(image, 2)
  min_width = image_width
  max_width = 0
  width_ats = length(left.at) |> zeros

  for row in 1:length(left.at)
    width_ats[row] = image_width - (left.at[row] - 1) - (right.at[row] - 1)

    if width_ats[row] < min_width
      min_width = width_ats[row]
    end

    if width_ats[row] > max_width
      max_width = width_ats[row]
    end
  end

  (min_width, max_width, width_ats)
end
And lastly, the transitions:
function compute_transitions(image :: Image) :: Tuple{Array{Float64}, Array{Float64}}
  history = zeros((size(image,1), size(image,2)))

  function next_point() :: Nullable{Point}
    point = Nullable()

    for row in 1:size(image, 1) |> reverse
      for col in 1:size(image, 2) |> reverse
        if image[row, col] > 0.0 && history[row, col] == 0.0
          point = Nullable((row, col))
          history[row, col] = 1.0

          return point
        end
      end
    end
  end

  function next_point(point :: Nullable{Point}) :: Tuple{Nullable{Point}, Int64}
    result = Nullable()
    trans = 0

    function direction_to_moves(direction :: Int64) :: Tuple{Int64, Int64}
      # for frequencies:
      # 8 1 2
      # 7 - 3
      # 6 5 4
      [
       ( -1,  0 ),
       ( -1,  1 ),
       (  0,  1 ),
       (  1,  1 ),
       (  1,  0 ),
       (  1, -1 ),
       (  0, -1 ),
       ( -1, -1 ),
      ][direction]
    end

    function peek_point(direction :: Int64) :: Nullable{Point}
      actual_current = get(point)

      row_move, col_move = direction_to_moves(direction)

      new_row = actual_current[1] + row_move
      new_col = actual_current[2] + col_move

      if new_row <= size(image, 1) && new_col <= size(image, 2) &&
         new_row >= 1 && new_col >= 1
        return Nullable((new_row, new_col))
      else
        return Nullable()
      end
    end

    for direction in 1:8
      peeked = peek_point(direction)

      if !isnull(peeked)
        actual = get(peeked)
        if image[actual[1], actual[2]] > 0.0 && history[actual[1], actual[2]] == 0.0
          result = peeked
          history[actual[1], actual[2]] = 1
          trans = direction
          break
        end
      end
    end

    ( result, trans )
  end

  function trans_to_simples(transition :: Int64) :: Array{Int64}
    # for frequencies:
    # 8 1 2
    # 7 - 3
    # 6 5 4

    # for simples:
    # - 1 -
    # 4 - 2
    # - 3 -
    [
      [ 1 ],
      [ 1, 2 ],
      [ 2 ],
      [ 2, 3 ],
      [ 3 ],
      [ 3, 4 ],
      [ 4 ],
      [ 1, 4 ]
    ][transition]
  end

  transitions     = zeros(8)
  simples         = zeros(16)
  last_simples    = [ ]
  point           = next_point()
  num_transitions = .0
  ind(r, c) = (c - 1)*4 + r

  while !isnull(point)
    point, trans = next_point(point)

    if isnull(point)
      point = next_point()
    else
      current_simples = trans_to_simples(trans)
      transitions[trans] += 1
      for simple in current_simples
        for last_simple in last_simples
          simples[ind(last_simple, simple)] +=1
        end
      end
      last_simples = current_simples
      num_transitions += 1.0
    end
  end

  (transitions ./ num_transitions, simples ./ num_transitions)
end
All those gathered features can be turned into rows with:
function features_to_row(features :: ImageStats)
  lefts       = [ features.left.min,  features.left.max  ]
  rights      = [ features.right.min, features.right.max ]

  left_ats    = [ features.left.at[i]  for i in 1:features.image_dim ]
  left_diffs  = [ features.left.diff[i]  for i in 1:features.image_dim ]
  right_ats   = [ features.right.at[i] for i in 1:features.image_dim ]
  right_diffs = [ features.right.diff[i]  for i in 1:features.image_dim ]
  frequencies = features.direction_frequencies
  simples     = features.simple_direction_frequencies

  vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
end
Similarly we can construct the column names with:
function features_columns(image :: Array{Float64})
  image_dim   = Base.isqrt(length(image))

  lefts       = [ :left_min,  :left_max  ]
  rights      = [ :right_min, :right_max ]

  left_ats    = [ Symbol("left_at_",  i) for i in 1:image_dim ]
  left_diffs  = [ Symbol("left_diff_",  i) for i in 1:image_dim ]
  right_ats   = [ Symbol("right_at_", i) for i in 1:image_dim ]
  right_diffs = [ Symbol("right_diff_", i) for i in 1:image_dim ]
  frequencies = [ Symbol("direction_freq_", i)   for i in 1:8 ]
  simples     = [ Symbol("simple_trans_", i)   for i in 1:4^2 ]

  vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
end
The data frame constructed with the get_data function can be easily dumped into the CSV file with the writeable function from the DataFrames package.

You can notice that gathering / extracting features is a lot of work. All this was needed to be done because in this article we're focusing on the somewhat "classical" way of doing machine learning. You might have heard about algorithms existing that mimic how the human brain learns. We're not focusing on them here. This we will explore in some future article.

We use the mentioned writetable on data frames computed for both training and test datasets to store two files: processed_train.csv and processed_test.csv.

Choosing the model

For the task of classifying I decided to use the XGBoost library which is somewhat a hot new technology in the world of machine learning. It's an improvement over the so-called Random Forest algorithm. The reader can read more about XGBoost on its website: http://xgboost.readthedocs.io/.

Both random forest and xgboost revolve around the idea called ensemble learning. In this approach we're not getting just one learning model — the algorithm actually creates many variations of models and uses them to collectively come up with better results. This is as much as can be written as a short description as this article is already quite lengthy.

Training the model

The training and classification code in R is very simple. We first need to load the libraries that will allow us to load data as well as to build the classification model:
library(xgboost)
library(readr)
Loading the data into data frames is equally straight-forward:
processed_train <- read_csv("processed_train.csv")
processed_test <- read_csv("processed_test.csv")
We then move on to preparing the vector of labels for each row as well as the matrix of features:
labels = processed_train$label
features = processed_train[, 2:141]
features = scale(features)
features = as.matrix(features)

The train-test split

When working with models, one of the ways of evaluating their performance is to split the data into so-called train and test sets. We train the model on one set and then we predict the values from the test set. We then calculate the accuracy of predicted values as the ratio between the number of correct predictions and the number of all observations.

Because Kaggle provides the test set without labels, for the sake of evaluating the model's performance without the need to submit the results, we'll split our Kaggle-training set into local train and test ones. We'll use the amazing caret library which provides a wealth of tools for doing machine learning:
library(caret)

index <- createDataPartition(processed_train$label, p = .8, 
                             list = FALSE, 
                             times = 1)

train_labels <- labels[index]
train_features <- features[index,]

test_labels <- labels[-index]
test_features <- features[-index,]
The above code splits the set uniformly based on the labels so that the train set is approximately 80% in size of the whole data set.

Using XGBoost as the classification model

We can now make our data digestible by the XGBoost library:
train <- xgb.DMatrix(as.matrix(train_features), label = train_labels)
test  <- xgb.DMatrix(as.matrix(test_features),  label = test_labels)
The next step is to make the XGBoost learn from our data. The actual parameters and their explanations are beyond the scope of this overview article, but the reader can look them up on the XGBoost pages:
model <- xgboost(train,
                 max_depth = 16,
                 nrounds = 600,
                 eta = 0.2,
                 objective = "multi:softmax",
                 num_class = 10)
It's critically important to pass the objective as "multi:softmax" and num_class as 10.

Simple performance evaluation with confusion matrix

After waiting a while (couple of minutes) for the last batch of code to finish computing, we now have the classification model ready to be used. Let's use it to predict the labels from our test set:
predicted = predict(model, test)
This returns the vector of predicted values. We'd now like to check how well our model predicts the values. One of the easiest ways is to use the so-called confusion matrix.

As per Wikipedia, confusion matrix is simply:

(...) also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

The caret library provides a very easy to use function for examining the confusion matrix and statistics derived from it:
confusionMatrix(data=predicted, reference=labels)
The function returns an R list that gets pretty printed to the R console. In our case it looks like the following:
Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9
         0 819   0   3   3   1   1   2   1  10   5
         1   0 923   0   4   5   1   5   3   4   5
         2   4   2 766  26   2   6   8  12   5   0
         3   2   0  15 799   0  22   2   8   0   8
         4   5   2   1   0 761   1   0  15   4  19
         5   1   3   0  13   2 719   3   0   9   6
         6   5   3   4   1   6   5 790   0  16   2
         7   1   7  12   9   2   3   1 813   4  16
         8   6   2   4   7   8  11   8   5 767  10
         9   5   2   1  13  22   6   1  14  14 746

Overall Statistics
                                         
               Accuracy : 0.9411         
                 95% CI : (0.9358, 0.946)
    No Information Rate : 0.1124         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9345         
 Mcnemar's Test P-Value : NA             

(...)
Each column in the matrix represents actual labels while rows represent what our algorithms predicted this value to be. There's also the accuracy rate printed for us and in this case it equals 0.9411. This means that our code was able to predict correct values of handwritten digits for 94.11% of observations.

Submitting the results

We got 0.9411 of an accuracy rate for our local test set and it turned out to be very close to the one we got against the test set coming from Kaggle. After predicting the competition values and submitting them, the accuracy rate computed by Kaggle was 0.94357. That's quite okay given the fact that we're not using here any of the new and fancy techniques.

Also, we haven't done any parameter tuning which could surely improve the overall accuracy. We could also revisit the code from the features extraction phase. One improvement I can think of would be to first crop and resize back - and only then compute the skeleton which might preserve more information about the shape. We could also use the confusion matrix and taking the number that was being confused the most, look at the real images that we failed to recognize. This could lead us to conclusions about improvements to our feature extraction code. There's always a way to extract more information.

Nowadays, Kagglers from around the world were successfully using advanced techniques like Convolutional Neural Networks getting accuracy scores close to 0.999. Those live in somewhat different branch of the machine learning world though. Using this type of neural networks we don't need to do the feature extraction on our own. The algorithm includes the step that automatically gathers features that it later on feeds into the network itself. We will take a look at them in some of the future articles.

See also

infoShare 2017 - JavaScript, JavaScript everywhere

The last week was really interesting for me. I attended the infoShare 2017, the biggest tech conference in central-eastern Europe. The agenda was impressive, but that’s not everything. There was a startup competition going on and really, I’m totally impressed.

infoShare in numbers:

  • 5500 attendees
  • 133 speakers
  • 250 startups
  • 122 hours of speeches
  • 12 side events
Let’s go through each speech I was attending.

Day 1

Why Fast Matters by Harry Roberts from csswizardry.com

Harry tried to convince us that performance is important.


Great speech, showing that it’s an interesting problem not only from a financial point of view. You must see it, link to his presentation: https://speakerdeck.com/csswizardry/why-fast-matters

Dirty Little Tricks From The Dark Corners of Front-End by Vitaly Friedman from smashingmagazine.com

It was magic! I work a lot with CSS, but this speech showed me some new ideas and reminded me that the simplest solution is maybe not the best solution usually and that we should reuse CSS between components as much as possible.

Keep it DRY!

One of these tricks is a quantity query CSS selector. It’s a pretty complex selector that can apply your styles to elements based on the number of siblings. (http://quantityqueries.com/)

The Art of Debugging (browsers) by Remy Sharp

It was great to see some other developer and see his workflow during debugging. I usually work from home and it’s not easy to do it in my case.

Remy is a very experienced JavaScript developer and showed us his skills and tricks, especially interesting Chrome developer console integration.

I always thought that using the developer console for programming is not the best idea, maybe it’s not? It looked pretty neat.

Desktop Apps with JavaScript by Felix Rieseberg from Slack

Felix from Slack presented and show the power of desktop hybrid apps. He used a framework called Electron. Using Electron you can build native, cross-system desktop apps using HTML, JavaScript and CSS. I don’t think that it’s the best approach for more complex applications and probably takes more system memory than native-native applications, but for simpler apps it can a way to go!

Github uses it to build their desktop app, so maybe it’s not so slow? :)

RxJava in existing projects by Tomasz Nurkiewicz from Allegro

Tomasz Nurkiewicz from Allegro showed us his high programming skills and provided some practical RxJava examples. RxJava is a library for composing asynchronous and event-based programs using observable sequences for the Java VM.

Definitely something to read about.

Day 2

What does a production ready Kubernetes application look like? by Carter Morgan from Google

Carter Morgan from Google showed us practical uses of Kubernetes.

Kubernetes is an open-source system for automating deployment, scaling and management of containerized applications. It was originally designed by Google developers and I think that they really want to popularize it. It looked that Kubernetes has a low learning curve, but devops agents I spoke after the presentation were sceptical, saying that if you know how to use Docker Swarm then you don’t really need Kubernetes.

Vue.js and Service Workers become Realtime by Blake Newman from Sainsbury's

Blake Newman is a JavaScript developer, member of the core Vue.js (trending, hot JavaScript framework) team. He explained how to use Vue.js with service workers.

The service workers are scripts that your browser runs in the background. Nice to see how it fits together, even though it’s not yet supported by every popular browser.

 

 

Listen to your application and sleep by Gianluca Arbezzano from InfluxData

Gianluca showed us his modern and flexible monitoring stack. Great tips and mostly discussing and recommending InfluxDB and Telegraf, we use it a lot in End Point.

He was right that it’s easy to configure, open-source and really useful. Great speech!

Summary

Amazing two days. All the presentations will be available on Youtube soon: https://www.youtube.com/user/infoSharePL/videos.

I can fully recommend this conference, see you next time!

Drupal - rapid development

Here at End Point, we had the pleasure to be a part of multiple Drupal 6, 7 and 8 projects. Most of our clients wanted to use the latest Drupal version, to have a long term support, stable platform.

A few years ago, I already had big experience with PHP itself and other, various PHP frameworks like WordPress, Joomla! or TYPO3. I was happy to use all of them, but then one of our clients asked us for a simple Drupal 6 task. That’s how I started my Drupal journey which continues until now.

To be honest, I had a difficult start, it was different, new and pretty inscrutable for me. After a few days of reading documentation and playing with the system I was ready to do some simple work. Here, I wanted to share my thoughts about Drupal and tell you why I LOVE! it.

Low learning curve

It took, of course, a few months until I was ready to build something more complex, but it really takes a few days only to be ready for simple development. It’s not only about Drupal, but also PHP, it’s much cheaper to maintain and extend a project. Maybe it’s not so important with smaller projects, but definitely important for massive code bases. Programmers can jump in and start being productive really quick.

Great documentation

Drupal documentation is well structured and constantly developed, usually you can find what you need within a few minutes. It’s critical and must have for any other framework and not so common unfortunately.

Big community

The Drupal community is one of the biggest IT communities I have ever encountered. They extend, fix and document the Drupal core regularly. Most of them have their other jobs and work on this project just for fun and with passion.

It’s free

It’s an open source project, that’s one of the biggest pros here. You can get it for free, you can get support for free, you can join the community for free too (:)).

Modules

On the official Drupal website you can find tons of free plugins/modules. It’s a time and money saver, you don’t need to reinvent the wheel for every new widget on your website and focus on fireworks.

Usually you can just go there and find a proper component. E-commerce shop? Slideshow? Online classifieds website? No problem! It’s all there.

PHP7 support

I can often hear from other developers that PHP is slow, well, it’s not the Road Runner, but come on, unless you are Facebook (and I think that they, correct me if I’m wrong, still use PHP :)) it’s just OK to use PHP.

Drupal fully supports PHP7.

With PHP7 it’s much faster, better and safer. To learn more: https://pages.zend.com/rs/zendtechnologies/images/PHP7-Performance%20Infographic.pdf.

In the infographic you can see that PHP7 is much faster than Ruby, Perl and Python when you try to render a Mandelbrot fractal. In general, you definitely can’t say that PHP is slow, same as Drupal.

REST API support

Drupal has the built in, ready to use API system. In a few moments you can spawn a new API endpoint for you application. You don’t need to implement a whole API by yourself, I did it a few times in multiple languages, believe me, it’s problematic.

Perfect for a backend system

Drupal is a perfect candidate for a backend system. Let’s imagine that you want to build a beautiful, mobile application. You want to let editors, other people to edit content. You want to grab this content through the API. It’s easy as pie with Drupal.

Drupal’s web interface is stable and easy to use.

Power of taxonomies

Taxonomies are, really basically, just dictionaries. The best thing about taxonomies is that you don’t need to touch code to play with them.

Let’s say that on your website you want to create a list of states in the USA. Using most of the frameworks you need to ask your developer/technical person to do so. With taxonomies you just need a few clicks and that’s it, you can put in on your website. That’s sweet, not only for non technical person, but for us, developers as well. Again, you can focus on actually making the website attractive, rather than spending time on things that can be automated.

Summary

Of course, Drupal is not perfect, but it’s undeniably a great tool. Mobile application, single page application, corporate website - there are no limits for this content management system. And actually, it is, in my opinion, the best tool to manage your content and it does not mean that you need to use Drupal to present it. You can create a mobile, ReactJS, AngularJS, VueJS application and combine it with Drupal easily.

I hope that you’ve had a good reading and wish to hear back from you! Thanks.

Malaysia Open Source Conference (MOSC) 2017

A three days Malaysia Open Source Conference (MOSC) ended last week. MOSC is an open source conference which is held annually and this year it reaches its 10 years anniversary. I managed to attend the conference with a selective focus on system administration related presentations, computer security and web application development.

The First Day

The first day's talks were occupied with keynotes from the conference sponsors and major IT brands. After the opening speech and a lightning talk from the community, Mr Julian Gordon delivered his speech which regards to the Hyperledger project, a blockchain technology based ledger. Later Mr Sanjay delivered his speech on the open source implementation in the financial sector in Malaysia. Before lunch break we then listened to Mr Jay Swaminathan from Microsoft whom presented his talks on Azure based service for blockchain technology.




For the afternoon part of the first day I then attended a talk by Mr Shak Hassan on the Electron based application development. You can read his slides here. I personally used Electron based application for Zulip so basically as a non web developer I already have a mental picture what Electron is prior to the talk, but the speaker's session enlightened me more on what was happening at the background of the application. Finally for the first day before I went back I attended a slot delivered by Intel Corp on Yocto Project - in which we could automate the process of creating a bootable Linux image to any platform - whether it is an Intel x86/x86_64 platform or ARM based platform.



The Second Day

The second day of the conference was started with a talk from Malaysia Digital Hub. The speaker, Diana, presented the state of Malaysian-based startups which are currently shaped and assisted by Malaysia Digital Hub and also the ones which already matured and able to stand by themselves. Later, a presenter from Google - Mr Dambo Ren - presented a talk on Google cloud projects.



He also pointed out several major services which are available on the cloud, for example - the TensorFlow. After that I chose to enter the Scilab software slot. Dr Khatim who is an academician shared his experience on using Scilab - an open source software which is similar to Matlab - to be used in his research and for his students. Later I entered a speaking slot with a title "Electronic Document Management System with Open Source Tools".


Here two speakers from Cyber Security Malaysia (an agency within the Malaysia's Ministry of Science and Technology) presented their studies on two open source document management software - OpenDocMan and LogicalDoc. The evaluation matrices were based from the following elements - the access easiness, costs, centralized repo, disaster recovery and the security features. From their observation LogicalDoc managed to get higher scores compared to OpenDocMan.

Later after that I attended a talk by Mr Kamarul on his experience using R language and R studio in his university for medical-based research. After the lunch break then it was my turn on delivering a workshop. Basically my talk was targeted upon the entry level system administration, in which I shared pretty much my experiences using tmux/screen, git, AIDE to monitor file changes on our machines and Ansible in order to automate common tasks as much as possible within the system administration context. I demonstrated the use of Ansible with multiple Linux distros - CentOS, Debian/Ubuntu in order to show how Ansible would handle heterogeneous Linux distribution after the command execution. Most of the presented stuffs were "live" during the workshop, but I also created a slides in order to help the audience and the public to get the basic ideas of the tools which I presented. You can read about them here [PDF].


The Third Day (Finale)

On the third day I came into the workshop slot which was delivered by a speaker with his pseudonym - Wak Arianto (not his original name though). He explained Suricata, a tool which has an almost similar syntax for pattern matching with the well known Snort IDS. Mr Wak explained OS fingerprinting concepts, flowbits and later how to create rules with Suricata. It was an interesting talk as I could see how to quarantine suspicious files captured from the network (let's say - possible malware) to a sandbox for further analysis. As far as I understood from the demo and from my extra readings, flowbits is a syntax which being used to grab the state of the session which being used by Suricata that works primarily with TCP in order to detect. You can read an article about flowbits here. It's being called a flowbits because it does the parsing on the TCP flows. I can see that we can parse the state of the TCP (for example, if it is established) based from the writings here.

I have a chance to listen to FreeBSD developer's slot too. We were lucky to have Mr Martin Wilke who is living in Malaysia and actively advocating FreeBSD to the local community. Together with Mr Muhammad Moinur Rahman - another FreeBSD developer they presented the FreeBSD development ecosystem and the current state of the operating system.



Possibly we preserved the best thing at the last - I attended a Wi-Fi security workshop which was presented by Mr Matnet and Mr Jep (both are pseudonyms). This workshop began with the theoretical foundations on the wireless technology and later the development of encryption around it.



The outline of the talks were outlined here. The speakers introduced the frame types of 802.11 protocols, which includes Control Frame, Data Frame and Management Frame. Management Frame is unencrypted so the attacking tools were developed to concentrate on this part.



The Management Frames is susceptible to the following attacks:
  • Deauthentication Attacks
  • Beacon Injection Attacks
  • Karma/MANA Wifi Attacks
  • EvilTwin AP Attacks

    Matnet and Jep also showed a social engineering tool called as "WiFi Phisher" in which it could be used as (according to the developer's page in GitHub) a "security tool that mounts automated victim-customized phishing attacks against WiFi clients in order to obtain credentials or infect the victims with malwares". It works together with the EvilTwin AP attacks by putting its role after achieving a man-in-the-middle position - Wifiphisher will redirect all HTTP requests to an attacker-controlled phishing page. Matnet told us the safest way to work within the WiFi environment is either using 802.11w supported device (which is yet to be widely found - at least in Malaysia). I found some infos on 802.11w that possibly could help to understand a bit on this protocol here.

    Conclusion

    For me this is considered the most anticipated annual event where I could meet professionals from different backgrounds and keeping my knowledge up to date with the latest development of the open source tools in the industry. The organizer surely had done a good job by organizing this event and I hope to attend this event again next year! Thank you for giving me opportunity to talk within this conference (and for the nice swag too!)

    Apart from MOSC I also planned to attend the annual Python Conference (Pycon) in which this year it is going to be special as it will be organized at the Asia Pacific (APAC) level. You can read more about Pycon APAC 2017 here (in case you probably would like to attend this event).

  • End Point Liquid Galaxy at GEOINT Symposium

    End Point Liquid Galaxy will be coming to San Antonio to participate in GEOINT 2017 Symposium. We are excited to demonstrate our geospatial capabilities on an immersive and panoramic 7 screen Liquid Galaxy system. We will be exhibiting at booth #1012 from June 4-7.

    On the Liquid Galaxy, complex data sets can be explored and analyzed in a 3D immersive fly-through environment. Presentations can highlight specific data layers combined with video, 3D models, and browsers for maximum communications efficiency. The end result is a rich, highly immersive, and engaging way to experience your data.

    Liquid Galaxy’s extensive capabilities include ArcGIS, Cesium, Google Maps, Google Earth, LIDAR point clouds, realtime data integration, 360 panoramic video, and more. The system always draws huge crowds at conferences; people line up to try out the system for themselves.

    End Point has deployed Liquid Galaxy systems around the world. This includes many high profile clients, such as Google, NOAA, CBRE, National Air & Space Museum, Hyundai, and Barclays. Our clients utilize our content management system to create immersive and interactive presentations that tell engaging stories to their users.

    GEOINT is hosted and produced by the United States Geospatial Intelligence Foundation (USGIF). It is the nation’s largest gathering of industry, academia, and government to include Defense, Intelligence and Homeland Security communities as well as commercial, Fed/Civil, State and Local geospatial intelligence stakeholders.

    We look forward to meeting you at booth #1012 at GEOINT. In the meantime, if you have any questions please visit our website or email ask@endpoint.com.

    Age comparison in Bash for files and processes

    You want your script to run a command only if elapsed-time for a given process is greater than X?

    Well, bash does not inherently understand a time comparison like:

    if [ 01:23:45 -gt 00:05:00 ]; then
        foo
    fi
    

    However, bash can compare timestamps of files using -ot and -nt for "older than" and "newer than", respectively. If the launch of our process includes creation of a PID file, then we are in luck! At the beginning of our loop, we can create a file with a specific age and use that for quick and simple comparison.

    For example, if we only want to take action when the process we care about was launched longer than 24 hours ago, try:

    touch -t $(date --date=yesterday +%Y%m%d%H%M.%S) $STAMPFILE
    

    Then, within your script loop, compare the PID file with the $STAMPFILE, like this:

    if [ $PIDFILE -ot $STAMPFILE ]; then
        foo
    fi
    

    And of course if you want to be sure you're working with the PID file of a process which is actually responding, you can try to send it signal 0 to check:

    if kill -0 `cat $PIDFILE`; then
        foo
    fi
    

    Postal code pain and fun

    We do a lot of ecommerce development at End Point. You know the usual flow as a customer: Select products, add to the shopping cart, then check out. Checkout asks questions about the buyer, payment, and delivery, at least. Some online sales are for “soft goods”, downloadable items that don’t require a delivery address. Much of online sales are still for physical goods to be delivered to an address. For that, a postal code or zip code is usually required.

    No postal code?

    I say usually because there are some countries that do not use postal codes at all. An ecommerce site that expects to ship products to buyers in one of those countries needs to allow for an empty postal code at checkout time. Otherwise, customers may leave thinking they aren’t welcome there. The more creative among them will make up something to put in there, such as “00000” or “99999” or “NONE”.

    Someone has helpfully assembled and maintains a machine-readable (in Ruby, easily convertible to JSON or other formats) list of the countries that don’t require a postal code. You may be surprised to see on the list such countries as Hong Kong, Ireland, Panama, Saudi Arabia, and South Africa. Some countries on the list actually do have postal codes but do not require them or commonly use them.

    Do you really need the customer’s address?

    When selling both downloadable and shipped products, it would be nice to not bother asking the customer for an address at all. Unfortunately even when there is no shipping address because there’s nothing to ship, the billing address is still needed if payment is made by credit card through a normal credit card payment gateway — as opposed to PayPal, Amazon Pay, Venmo, Bitcoin, or other alternative payment methods.

    The credit card Address Verification System (AVS) allows merchants to ask a credit card issuing bank whether the mailing address provided matches the address on file for that credit card. Normally only two parts are checked: (1) the street address numeric part, for example, “123” if “123 Main St.” was provided; (2) the zip or postal code, normally only the first 5 digits for US zip codes, and often non-US postal code AVS doesn’t work at all with non-US banks.

    Before sending the address to AVS, validating the format of postal codes is simple for many countries: 5 digits in the US (allowing an optional -nnnn for ZIP+4), and 4 or 5 digits in most others countries — see the Wikipedia List of postal codes in various countries for a high-level view. Canada is slightly more complicated: 6 characters total, alternating a letter followed by a number, formally with a space in the middle, like K1A 0B1 as explained in Wikipedia’s components of a Canadian postal code.

    So most countries’ postal codes can be validated in software with simple regular expressions, to catch typos such as transpositions and missing or extra characters.

    UK postcodes

    The most complicated postal codes I have worked with is the United Kingdom’s, because they can be from 5 to 7 characters, with an unpredictable mix of letters and numbers, normally formatted with a space in the middle. The benefit they bring is that they encode a lot of detail about the address, and it’s possible to catch transposed character errors that would be missed in a purely numeric postal code. The Wikipedia article Postcodes in the United Kingdom has the gory details.

    It is common to use a regular expression to validate UK postcodes in software, and many of these regexes are to some degree wrong. Most let through many invalid postcodes, and some disallow valid codes.

    We recently had a client get a customer report of a valid UK postcode being rejected during checkout on their ecommerce site. The validation code was using a regex that is widely copied in software in the wild:

    [A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?[0-9][ABD-HJLN-UW-Z]{2}

    (This example removes support for the odd exception GIR 0AA for simplicity’s sake.)

    The customer’s valid postcode that doesn’t pass that test was W1F 0DP, in London, which the Royal Mail website confirms is valid. The problem is that the regex above doesn’t allow for F in the third position, as that was not valid at the time the regex was written.

    This is one problem with being too strict in validations of this sort: The rules change over time, usually to allow things that once were not allowed. Reusable, maintained software libraries that specialize in UK postal codes can keep up, but there is always lag time between when updates are released and when they’re incorporated into production software. And copied or customized regexes will likely stay the way they are until someone runs into a problem.

    The ecommerce site in question is running on the Interchange ecommerce platform, which is based on Perl, so the most natural place to look for an updated validation routine is on CPAN, the Perl network of open source library code. There we find the nice module Geo::UK::Postcode which has a more current validation routine and a nice interface. It also has a function to format a UK postcode in the canonical way, capitalized (easy) and with the space in the correct place (less easy).

    It also presents us with a new decision: Should we use the basic “valid” test, or the “strict” one? This is where it gets a little trickier. The “valid” check uses a regex validation approach will still let through some invalid postcodes, because it doesn’t know what all the current valid delivery destinations are. This module has a “strict” check that uses a comprehensive list of all the “outcode” data — which as you can see if you look at that source code, is extensive.

    The bulkiness of that list, and its short shelf life — the likelihood that it will become outdated and reject a future valid postcode — makes strict validation checks like this of questionable value for basic ecommerce needs. Often it is better to let a few invalid postcodes through now so that future valid ones will also be allowed.

    The ecommerce site I mentioned also does in-browser validation via JavaScript before ever submitting the order to the server. Loading a huge list of valid outcodes would waste a lot of bandwidth and slow down checkout loading, especially on mobile devices. So a more lax regex check there is a good choice.

    When Christmas comes

    There’s no Christmas gift of a single UK postal code validation solution for all needs, but there are some fun trivia notes in the Wikipedia page covering Non-geographic postal codes:

    A fictional address is used by UK Royal Mail for letters to Santa Claus:

    Santa’s Grotto
    Reindeerland XM4 5HQ

    Previously, the postcode SAN TA1 was used.

    In Finland the special postal code 99999 is for Korvatunturi, the place where Santa Claus (Joulupukki in Finnish) is said to live, although mail is delivered to the Santa Claus Village in Rovaniemi.

    In Canada the amount of mail sent to Santa Claus increased every Christmas, up to the point that Canada Post decided to start an official Santa Claus letter-response program in 1983. Approximately one million letters come in to Santa Claus each Christmas, including from outside of Canada, and they are answered in the same languages in which they are written. Canada Post introduced a special address for mail to Santa Claus, complete with its own postal code:

    SANTA CLAUS
    NORTH POLE H0H 0H0

    In Belgium bpost sends a small present to children who have written a letter to Sinterklaas. They can use the non-geographic postal code 0612, which refers to the date Sinterklaas is celebrated (6 December), although a fictional town, street and house number are also used. In Dutch, the address is:

    Sinterklaas
    Spanjestraat 1
    0612 Hemel

    This translates as “1 Spain Street, 0612 Heaven”. In French, the street is called “Paradise Street”:

    Saint-Nicolas
    Rue du Paradis 1
    0612 Ciel

    That UK postcode for Santa doesn’t validate in some of the regexes, but the simpler Finnish, Canadian, and Belgian ones do, so if you want to order something online for Santa, you may want to choose one of those countries for delivery. :)

    Designing a Computer Science Program for Free (or Cheap)

    This blog post is for people like me who are interested in improving their knowledge about computers, software and technology in general but are inundated with an abundance of resources and no clear path to follow. Many of the courses online tend to not have any real structure. While it's great that this knowledge is available to anyone with access to the internet, it often feels overwhelming and confusing. I always enjoy a little more structure to study, much like in a traditional college setting. So, to that end I began to look at MIT's OpenCourseWare and compare it to their actual curriculum.

    I'd like to begin by acknowledging that some time ago Scott Young completed the MIT Challenge where he "attempted to learn MIT’s 4-year computer science curriculum without taking classes". My friend Najmi here at End Point also shared a great website with me to "Teach Yourself Computer Science". So, this is not the first post to try to make sense of all the free resources available to you, it's just one which tries to help organize a coherent plan of study.

    Methodology

    I wanted to mimic MIT's real CS curriculum. I also wanted to limit my studies to Computer Science only, while stripping out anything not strictly related. It's not that I am not interested in things like speech classes or more advanced mathematics and physics, but I wanted to be pragmatic about the amount of time I have each week to put in to study outside of my normal (very busy) work week. I imagine anyone reading this would understand and very likely agree.

    I examined MIT's course catalog. They have 4 undergraduate programs in the Department of Electrical Engineering and Computer Science:

    • 6-1 program: Leads to the Bachelor of Science in Electrical Science and Engineering. (Electrical Science and Engineering)
    • 6-2 program: Leads to the Bachelor of Science in Electrical Engineering and Computer Science and is for those whose interests cross this traditional boundary.
    • 6-3 program: Leads to the Bachelor of Science in Computer Science and Engineering.(Computer Science and Engineering)
    • 6-7 program: Is for students specializing in computer science and molecular biology.
    Because I wanted to stick what I believed would be most practical for my work at End Point, I selected the 6-3 program. With my intended program selected, I also decided that the full course load for a bachelor's degree was not really what I was interested in. Instead, I just wanted to focus on the computer science related courses (with maybe some math and physics only if needed to understand any of the computer courses).

    So, looking at the requirements, I began to determine which classes I'd require. Once I had this, I could then begin to search the MIT OpenCourseWare site to ensure the classes are offered, or find suitable alternatives on Coursera or Udemy. As is typical, there are General Requirements and Departmental Requirements. So, beginning with the General Institute Requirements, lets start designing a computer science program with all the fat (non-computer science) cut out.


    General Requirements:



    I removed that which was not computer science related. As I mentioned, I was aware I may need to add some math/science. So, for the time being this left me with:


    Notice that it says

    one subject can be satisfied by 6.004 and 6.042[J] (if taken under joint number 18.062[J]) in the Department Program

    It was unclear to me what "if taken under joint number 18.062[J]" meant (nor could I find clarification) but as will be shown later, 6.004 and 6.042[J] are in the departmental requirements, so let's commit to taking those two which would leave the requirement of one more REST course. After some Googling I found the list of REST courses here. So, if you're reading this to design your own program, please remember that later we will commit to 6.004 and 6.042[J] and go here to select a course.

    So, now on to the General Institute Requirements Laboratory Requirement. We only need to choose one of three:

    • - 6.01: Introduction to EECS via Robot Sensing, Software and Control
    • - 6.02: Introduction to EECS via Communications Networks
    • - 6.03: Introduction to EECS via Medical Technology


    So, to summarize the general requirements we will take 4 courses:

    Major (Computer Science) Requirements:


    In keeping with the idea that we want to remove non-essential, and non-CS courses, let's remove the speech class. So here we have a nice summary of what we discovered above in the General Requirements, along with details of the computer science major requirements:


    As stated, let's look at the list of Advanced Undergraduate Subjects and Independent Inquiry Subjects so that we may select one from each of them:



    Lastly, it's stated that we must

    Select one subject from the departmental list of EECS subjects

    a link is provided to do so, however it brings you here and I cannot find a list of courses. I believe that this link no longer takes you to the intended location. A Google search brought up a similar page, but with a list of courses, as can be seen here. So, I will pick one from that page.

    The next step was to find the associated courses on MIT OpenCourseWare

    Sample List of Classes

    So, now you will be able to follow the links I provided above to select your classes. I was not always able to find courses that matched by exact name and/or course number. Sometimes I had to read the description and look through several courses which seemed similar. I will provide my own list in case you'd just like to us mine:

    Conclusion

    So there you have it, please feel free to comment with any of your favorite resources.