News

Welcome to End Point’s blog

Ongoing observations by End Point people

A caching, resizing, reverse proxying image server with Nginx

While working on a complex project, we had to set up a caching reverse proxying image server with the ability of automatically resize any cached image on the fly.

Looking around on the Internet, I discovered that Nginx has a neat Image Filter module capable of resizing, cropping and rotating images. I decided to try and combine this with Nginx's well-known caching capabilities, to create an Nginx-only solution.

I'll describe here a sample setup to achieve a similar configuration.

Prerequisites and requisites


We obviously need to install Nginx.

Note that the Image Filter module is not installed by default on many Linux distributions, and we may have to install it as a separate module. If we're using Nginx's official repositories, it should just be a matter of installing the nginx_module_image_filter package and restarting the service.

What we want to achieve in this example configuration is to have a URL like:

http://www.example.com/image/<width>x<height>/<URL>

...that will retrieve the image at:

https://upload.wikimedia.org/<URL>

...then resize it on the fly, cache it and serve it.

Cache Storage configuration


First of all, we need to set up the cache in our main http section:

proxy_cache_path /tmp/nginx_cache levels=1:2 keys_zone=nginx_cache:10M max_size=100M inactive=40d;

This will provide us with a 10MB storage space for keys and 100MB for actual images, that will be removed after not being accessed for 40 days. These values can be tuned as needed.

Caching Proxy configuration


Next, we'll configure our front facing virtual host.

In our case, we needed the reverse proxy to live within an already existing site, and that's why we chose the /image/ path prefix.

  server {
      listen       80;
      server_name  www.example.com;
  
      location /image/ {
          proxy_pass http://127.0.0.1:20000;
          proxy_cache nginx_cache;
          proxy_cache_key "$proxy_host$uri$is_args$args";
          proxy_cache_valid 30d;
          proxy_cache_valid any 10s;
          proxy_cache_lock on;
          proxy_cache_use_stale error invalid_header timeout updating;
          proxy_http_version 1.1;
          expires 30d;
      }
  
      location / {
          # other locations we may need for the site.
          root /var/www/whatever;
      }
  
  }

Every URL starting with /image/ will be server from the cache if present, otherwise it will be proxied to our Resizing Server, and cached for 30 days.

Resizing Server configuration


Finally, we'll configure the resizing server. Here, we use a regexp to extract the width, height and URL of the image we desire.

The server will proxy the request to https://upload.wikimedia.org/ looking for the image, resize it and then serve it back to the Caching Proxy.

  server {
      listen 127.0.0.1:20000;
      server_name localhost;
  
      resolver 8.8.8.8;
  
      location ~ ^/image/([0-9]+)x([0-9]+)/(.+) {
          image_filter_buffer 20M; # Will return 415 if image is bigger than this
          image_filter_jpeg_quality 75; # Desired JPG quality
          image_filter_interlace on; # For progressive JPG
  
          image_filter resize $1 $2;
  
          proxy_pass https://upload.wikimedia.org/$3;
      }
  
  }

We may want to tune the buffer size and jpeg quality here.

Note that we may also use image_filter resize and crop options, should we need different results than just resizing.

Testing the final result


You should now be able to fire up your browser and access an URL like:

http://www.example.com/image/150x150/wikipedia/commons/0/01/Tiger.25.jpg

...and enjoy your caching, resizing, reverse proxying image server.

Optionally securing access to your image server


As a (simple) security measure to prevent abuse from unauthorized access, you can use the Secure Link module.

All we need to do is update the Resizing Server configuration, adding some lines to the location section:

  server {
      listen 127.0.0.1:20000;
      server_name localhost;
  
      resolver 8.8.8.8;
  
      location ~ ^/image/([0-9]+)x([0-9]+)/(.+) {
          secure_link $arg_auth;
          secure_link_md5 "$uri your_secret";
          if ($secure_link = "") {
              return 403;
          }
          if ($secure_link = "0") {
              return 410;
          }
  
          image_filter_buffer 20M; # Return 415 if image is bigger than this
          image_filter_jpeg_quality 75; # Desired JPG quality
          image_filter_interlace on; # For progressive JPG
  
          image_filter resize $1 $2;
  
          proxy_pass https://upload.wikimedia.org/$3;
      }
  
  }

To access your server you will now need to add an auth parameter to the request, with a secure token that can be easily calculated as an MD5 hash.

For example, to access the previous URL you can use the following bash command:

echo -n '/image/150x150/wikipedia/commons/0/01/Tiger.25.jpg your_secret' | openssl md5 -binary | openssl base64 | tr +/ -_ | tr -d =

...and the resulting URL will be:

http://www.example.com/image/150x150/wikipedia/commons/0/01/Tiger.25.jpg?auth=TwcXg954Rhkjt1RK8IO4jA

How to Add Labels to a Dimple JS Line Chart

I was recently working on a project that was using DimpleJS, which the docs describe as "An object-oriented API for business analytics powered by d3". I was using it to create a variety of graphs, some of which were line graphs. The client had requested that the line graph display the y-value of the line on the graph. This is easily accomplished with bar graphs in Dimple, however, not so easily done with line graphs.

I had spent some time Googling to find what others had done to add this functionality but could not find it anywhere. So, I read the documentation where they add labels to a bar graph, and "tweaked" it like so:

var s = myChart.addSeries(null, dimple.plot.line);
.
.
.
/*Add prices to line chart*/
s.afterDraw = function (shape, data) {
  // Get the shape as a d3 selection
  var s = d3.select(shape);
  var i = 0;
  _.forEach(data.points, function(point) {
    var rect = {
    x: parseFloat(point.x),
    y: parseFloat(point.y)
  };
  // Add a text label for the value
  if(data.markerData[i] != undefined) {
    svg.append("text")
    .attr("x", rect.x)
    .attr("y", rect.y - 10)
    // Centre align
    .style("text-anchor", "middle")
    .style("font-size", "10px")
    .style("font-family", "sans-serif")
    // Format the number
    .text(data.markerData[i].y);
  }
  i++
});

Some styling still needs to be done but you can see that the y-values are now placed on the line graph. We are using lodash on this project but if you do not want to use lodash, just replace the _.forEach (line 10)and this technique should just plug in for you.

If you're reading this it's likely you've run into the same or similar issue and I hope this helps you!

Randomized Queries in Ruby on Rails

I was recently asked about options for displaying a random set of items from a table using Ruby on Rails. The request was complicated by the fact that the technology stack hadn’t been completely decided on and one of the items still up in the air was the database. I’ve had an experience with a project I was working on where the decision was made to switch from MySQL to PostgreSQL. During the switch, a sizable amount of hand constructed queries stopped functioning and had to be manually translated before they would work again. Learning from that experience, I favor avoidance of handwritten SQL in my Rails queries when possible. This precludes the option to use built-in database functions like rand() or random().

With the goal set in mind, I decided to look around to find out what other people were doing to solve similar requests. While perusing various suggested implementations, I noticed a lot of comments along the lines of “Don’t use this approach if you have a large data set.” or “this handles large data sets, but won’t always give a truly random result.”

These comments and the variety of solutions got me thinking about evaluating based not only on what database is in use, but what the dataset is expected to look like. I really enjoyed the mental gymnastics and thought others might as well.

Let’s pretend we’re working on an average project. The table we’ll be pulling from has several thousand entries and we want to pull back something small like 3-5 random records. The most common solution offered based on the research I performed works perfectly for this situation.

records_desired = 3
count = [OurObject.count, 1].max
offsets = records_desired.times.inject([]) do |offsets|
  offsets << rand(count)
end
while count > offsets.uniq!.size && offsets.size < records_desired do
  offsets << rand(count)
end
offsets.collect {|offset| OurObject.offset(offset).first}

Analyzing this approach, we’re looking at minimal processing time and a total of four queries. One to determine the total count and the rest to fetch each of our three objects individually. Seems perfectly reasonable.

What happens if our client needs 100 random records at a time? The processing is still probably within tolerances, but 101 queries? I say no unless our table is Dalmations! Let’s see if we can tweak things to be more large-set friendly.

records_desired = 100
count = [OurObject.count - records_desired, 1].max
offset = rand(count)
OurObject.limit(records_desired).offset(offset)

How’s this look? Very minimal processing and only 2 queries. Fantastic! But is this result going to appear random to an observer? I think it’s highly possible that you could end up with runs of related looking objects (created at similar times or all updated recently). When people say they want random, they often really mean they want unrelated. Is this solution close enough for most clients? I would say it probably is. But I can imagine the possibility that for some it might not be. Is there something else we can tweak to get a more desirable sampling without blowing processing time sky-high? After a little thought, this is what I came up with.

records_desired = 100
count = records_desired * 3
offset = rand([OurObject.count - count, 1].max)
set = OurObject.limit(count).offset(offset).pluck(:id)
OurObject.find(ids.sample(records_desired))

While this approach may not truly provide more random results from a mathematical perspective, by assembling a larger subset and pulling randomly from inside it, I think you may be able to more closely achieve the feel of what people expect from randomness if the previous method seemed to return too many similar records for your needs.

Sketchfab on the Liquid Galaxy

For the last few weeks, our developers have been working on syncing our Liquid Galaxy with Sketchfab. Our integration makes use of the Sketchfab API to synchronize multiple instances of Sketchfab in the immersive and panoramic environment of the Liquid Galaxy. The Liquid Galaxy already has so many amazing capabilities, and to be able to add Sketchfab to our portfolio is very exciting for us! Sketchfab, known as the “YouTube for 3D files,” is the leading platform to publish and find 3D and VR content. Sketchfab integrates with all major 3D creation tools and publishing platforms, and is the 3D publishing partner of Adobe Photoshop, Facebook, Microsoft HoloLens and Intel RealSense. Given that Sketchfab can sync with almost any 3D format, we are excited about the new capabilities our integration provides.

Sketchfab content can be deployed onto the system in minutes! Users from many industries use Sketchfab, including architecture, hospitals, museums, gaming, design, and education. There is a natural overlap between the Liquid Galaxy and Sketchfab, as members of all of these industries utilize the Liquid Galaxy for its visually stunning and immersive atmosphere.

We recently had Alban Denoyel, cofounder of Sketchfab, into our office to demo Sketchfab on the Liquid Galaxy. We're happy to report that Alban loved it! He told us about new features that are going to be coming out on Sketchfab soon. These features will automatically roll out to Sketchfab content on the Liquid Galaxy system, and will serve to make the Liquid Galaxy's pull with 3D modeling even greater.

We’re thrilled with how well Sketchfab works on our Liquid Galaxy as is, but we’re in the process of making it even more impressive. Some Sketchfab models take a bit of time to load (on their website and on our system), so our developers are working on having models load in the background so they can be activated instantaneously on the system. We will also be extending our Sketchfab implementation to make use of some of the features already present on Sketchfab's excellent API, including displaying model annotations and animating the models.

You can view a video of Sketchfab content on the Liquid Galaxy below. If you'd like to learn more, you can call us at 212-929-6923, or contact us here.

Gem Dependency Issues with Rails 5 Beta

The third-party gem ecosystem is one of the biggest selling points of Rails development, but the addition of a single line to your project's Gemfile can introduce literally dozens of new dependencies. A compatibility issue in any one of those gems can bring your development to a halt, and the transition to a new major version of Rails requires even more caution when managing your gem dependencies.

In this post I'll illustrate this issue by showing the steps required to get rails_admin (one of the two most popular admin interface gems for Rails) up and running even partially on a freshly-generated Rails 5 project. I'll also identify some techniques for getting unreleased and forked versions of gems installed as stopgap measures to unblock your development while the gem ecosystem catches up to the new version of Rails.

After installing the current beta3 version of Rails 5 with gem install rails --pre and creating a Rails 5 project with rails new I decided to address the first requirement of my application, admin interface, by installing the popular Rails Admin gem. The rubygems page for rails_admin shows that its most recent release 0.8.1 from mid-November 2015 lists Rails 4 as a requirement. And indeed, trying to install rails_admin 0.8.1 in a Rails 5 app via bundler fails with a dependency error:

Resolving dependencies...
Bundler could not find compatible versions for gem "rails":
In snapshot (Gemfile.lock):
rails (= 5.0.0.beta3)

In Gemfile:
rails (< 5.1, >= 5.0.0.beta3)

rails_admin (~> 0.8.1) was resolved to 0.8.1, which depends on
rails (~> 4.0)

I took a look at the GitHub page for rails_admin and noticed that recent commits make reference to Rails 5, which is an encouraging sign that its developers are working on adding compatibility with Rails 5. Looking at the gemspec in the master branch on GitHub shows that the rails_admin gem dependency has been broadened to include both Rails 4 and 5, so I updated my app's Gemfile to install rails_admin directly from the master branch on GitHub:

gem 'rails_admin', github: 'sferik/rails_admin'

This solved the above dependency of rails_admin on Rails 4 but revealed some new issues with gems that rails_admin itself depends on:

Resolving dependencies...
Bundler could not find compatible versions for gem "rack":
In snapshot (Gemfile.lock):
rack (= 2.0.0.alpha)

In Gemfile:
rails (< 5.1, >= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionmailer (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionpack (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
rack (~> 2.x)

rails_admin was resolved to 0.8.1, which depends on
rack-pjax (~> 0.7) was resolved to 0.7.0, which depends on
rack (~> 1.3)

rails (< 5.1, >= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionmailer (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionpack (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
rack-test (~> 0.6.3) was resolved to 0.6.3, which depends on
rack (>= 1.0)

rails_admin was resolved to 0.8.1, which depends on
sass-rails (< 6, >= 4.0) was resolved to 5.0.4, which depends on
sprockets (< 4.0, >= 2.8) was resolved to 3.6.0, which depends on
rack (< 3, > 1)

This bundler output shows a conflict where Rails 5 depends on rack 2.x while rails_admin's rack-pjax dependency depends on rack 1.x. I ended up resorting to a Google search which led me to the following issue in the rails_admin repo: https://github.com/sferik/rails_admin/issues/2532

Installing rack-pjax from GitHub:

gem 'rack-pjax', github: 'afcapel/rack-pjax', branch: 'master'

resolves the rack dependency conflict, and bundle install now completes without error. Things are looking up! At least until you try to run the Rake task to rails g rails_admin:install and you're presented with this mess:

/Users/patrick/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/actionpack-5.0.0.beta3/lib/action_dispatch/middleware/stack.rb:108:in `assert_index': No such middleware to insert after: ActionDispatch::ParamsParser (RuntimeError)
from /Users/patrick/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/actionpack-5.0.0.beta3/lib/action_dispatch/middleware/stack.rb:80:in `insert_after'

This error is more difficult to understand, especially given the fact that the culprit (the remotipart gem) is not actually mentioned anywhere in the error. Thankfully, commenters on the above-mentioned rails_admin issue #2532 were able to identify the remotipart gem as the source of this error and provide a link to a forked version of that gem which allows rails_admin:install to complete successfully (albeit with some functionality still not working).

In the end, my Gemfile looked something like this:

gem 'rails_admin', github: 'sferik/rails_admin'
# Use github rack-pjax to fix dependency versioning issue with Rails 5
# https://github.com/sferik/rails_admin/issues/2532
gem 'rack-pjax', github: 'afcapel/rack-pjax'
# Use forked remotipart until following issues are resolved
# https://github.com/JangoSteve/remotipart/issues/139
# https://github.com/sferik/rails_admin/issues/2532
gem 'remotipart', github: 'mshibuya/remotipart', ref: '3a6acb3'

A total of three unreleased versions of gems, including the forked remotipart gem that breaks some functionality, just to get rails_admin installed and up and running enough to start working with. And some technical debt in the form of comments about follow-up tasks to revisit the various gems as they have new versions released for Rails 5 compatibility.

This process has been a reminder that when working in a Rails 4 app it's easy to take for granted the ability to install gems and have them 'just work' in your application. When dealing with pre-release versions of Rails, don't be surprised when you have to do some investigative work to figure out why gems are failing to install or work as expected.

My experience has also underscored the importance of understanding all of your application's gem dependencies and having some awareness of their developers' intentions when it comes to keeping their gems current with new versions of Rails. As a developer it's in your best interest to minimize the amount of dependencies in your application, because adding just one gem (which turns out to have a dozen of its own dependencies) can greatly increase the potential for encountering incompatibilities.

Postgres concurrent indexes and the curse of IIT

Postgres has a wonderful feature called concurrent indexes. It allows you to create indexes on a table without blocking reads OR writes, which is quite a handy trick. There are a number of circumstances in which one might want to use concurrent indexes, the most common one being not blocking writes to production tables. There are a few other use cases as well, including:


Photograph by Nicholas A. Tonelli

  • Replacing a corrupted index
  • Replacing a bloated index
  • Replacing an existing index (e.g. better column list)
  • Changing index parameters
  • Restoring a production dump as quickly as possible

In this article, I will focus on that last use case, restoring a database as quickly as possible. We recently upgraded a client from a very old version of Postgres to the current version (9.5 as of this writing). The fact that use of pg_upgrade was not available should give you a clue as to just how old the "very old" version was!

Our strategy was to create a new 9.5 cluster, get it optimized for bulk loading, import the globals and schema, stop write connections to the old database, transfer the data from old to new, and bring the new one up for reading and writing.

The goal was to reduce the application downtime as much as reasonably possible. To that end, we did not want to wait until all the indexes were created before letting people back in, as testing showed that the index creations were the longest part of the process. We used the "--section" flags of pg_dump to create pre-data, data, and post-data sections. All of the index creation statements appeared in the post-data file.

Because the client determined that it was more important for the data to be available, and the tables writable, than it was for them to be fully indexed, we decided to try using CONCURRENT indexes. In this way, writes to the tables could happen at the same time that they were being indexed - and those writes could occur as soon as the table was populated. That was the theory anyway.

The migration went smooth - the data was transferred over quickly, the database was restarted with a new postgresql.conf (e.g. turn fsync back on), and clients were able to connect, albeit with some queries running slower than normal. We parsed the post-data file and created a new file in which all the CREATE INDEX commands were changed to CREATE INDEX CONCURRENTLY. We kicked that off, but after a certain amount of time, it seemed to freeze up.


The frogurt is also cursed.

Looking closer showed that the CREATE INDEX CONCURRENTLY statement was waiting, and waiting, and never able to complete - because other transactions were not finishing. This is why concurrent indexing is both a blessing and a curse. The concurrent index creation is so polite that it never blocks writers, but this means processes can charge ahead and be none the wiser that the create index statement is waiting on them to finish their transaction. When you also have a misbehaving application that stays "idle in transaction", it's a recipe for confusion. (Idle in transaction is what happens when your application keeps a database connection open without doing a COMMIT or ROLLBACK). A concurrent index can only completely finish being created once any transaction that has referenced the table has completed. The problem was that because the create index did not block, the app kept chugging along, spawning new processes that all ended up in idle in transaction.

At that point, the only way to get the concurrent index creation to complete was to forcibly kill all the other idle in transaction processes, forcing them to rollback and causing a lot of distress for the application. In contrast, a regular index creation would have caused other processes to block on their first attempt to access the table, and then carried on once the creation was complete, and nothing would have to rollback.

Another business decision was made - the concurrent indexes were nice, but we needed the indexes, even if some had to be created as regular indexes. Many of the indexes were able to be completed (concurrently) very quickly - and they were on not-very-busy tables - so we plowed through the index creation script, and simply canceled any concurrent index creations that were being blocked for too long. This only left a handful of uncreated indexes, so we simply dropped the "invalid" indexes (these appear when a concurrent index creation is interrupted), and reran with regular CREATE INDEX statements.

The lesson here is that nothing comes without a cost. The overly polite concurrent index creation is great at letting everyone else access the table, but it also means that large complex transactions can chug along without being blocked, and have to have all of their work rolled back. In this case, things worked out as we did 99% of the indexes as CONCURRENT, and the remaining ones as regular. All in all, the use of concurrent indexes was a big win, and they are still an amazing feature of Postgres.

Cybergenetics Helps Free Innocent Man

We all love a good ending. I was happy to hear that one of End Point’s clients, Cybergenetics, was involved in a case this week to free a falsely imprisoned man, Darryl Pinkins.

Darryl was convicted of a crime in Indiana in 1991. In 1995 Pinkins sought the help of the Innocence Project. His attorney Frances Watson and her students turned to Cybergenetics and their DNA interpretation technology called TrueAllele® Casework. The TrueAllele DNA identification results exonerated Pinkins. The Indiana Court of Appeals dropped all charges against Pinkins earlier this week and he walked out of jail a free man after fighting for 24 years to clear his name.

TrueAllele can separate out the people who contributed their DNA to a mixed DNA evidence sample. It then compares the separated out DNA identification information to other reference or evidence samples to see if there is a DNA match.

End Point has worked with Cybergenetics since 2003 and consults with them on security, database infrastructure, and website hosting. We congratulate Cybergenetics on their success in being part of the happy ending for Darryl Pinkins and his family!

More of the story is available at Cybergenetics’ Newsroom or the Chicago Tribune.