End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Chrome, onmousemove, and MediaWiki JavaScript


Image by Flickr user Dennis Jarvis

tl;dr: avoid using onmousemove events with Google Chrome

I recently fielded a complaint about not being able to select text with the mouse on a wiki running the MediaWiki software. After some troubleshooting and research, I narrowed the problem down to a bug in the Chrome browser regarding the onmousemove event. The solution in this case was to tweak javascript to use onmouseover instead of onmousemove.

The first step in troubleshooting is to duplicate the problem. In this case, the page worked fine for me in Firefox, so I tried using the same browser as the reporter: Chrome. Sure enough, I could no longer hold down the mouse button and select text on the page. Now that the browser was implicated, it was time to see what it was about this page that caused the problem.

It seemed fairly unlikely that something like this would go unfixed if it was happening on the flagship MediaWiki site, Wikipedia. Sure enough, that site worked fine, I could select the text with no problem. Testing some other random sites showed no problems either. Some googling indicated others had similar problems with Chrome, and gave a bunch of workarounds for selecting the text. However, I wanted a fix, not a workaround.

There were hints that JavaScript was involved, so I disabled JavaScript in Chrome, reloaded the page, and suddenly everything started working again. Call that big clue number two. The next step was to see what was different between the local MediaWiki installation and Wikipedia. The local site was a few versions behind, but I was fortuitously testing an upgrade on a test server. This showed the problem still existed on the newer version, which meant that the problem was something specific to the wiki itself.

The most likely culprit was one of the many installed MediaWiki extensions, which are small pieces of code that perform certain actions on a wiki. These often have their own JavaScript that they run, which was still the most likely problem.

Then it was some basic troubleshooting. After turning javascript back on, I edited the LocalSettings.php file and commented out all the user-supplied extensions. This made the problem disappear again. Then I commented out half the extensions, then half again, etc, until I was able to narrow the problem down to a single extension.

The extension in question, known simply as "balloons", has actually been removed from the MediaWiki extensions site, for "prolonged security issues with the code." The extension allows creation of very nice looking pop up CSS "balloons" full of text. I'm guessing the concern is because the arguments for the balloons were not sanitized properly. In a public wiki, this would be a problem, but this was for a private intranet, so we were not worried about continuing to use the extension. As a side note, such changes would be caught anyway as this wiki sends an email to a few people on any change, including a full text diff of all the changes.

Looking inside the JavaScript used by the extension, I was able to narrow the problem down to a single line inside balloons/js/balloons.js:

  // Track the cursor every time the mouse moves
  document.onmousemove = this.setActiveCoordinates;

Sure enough, duck-duck-going through the Internet quickly found a fairly incriminating chromium bug, indicating that onmousemove did not work very well at all. Looking over the balloon extension code, it appeared that onmouseover would probably be good enough to gather the same information and allow the extension to work while not blocking the ability for people to select text. One small replacement of "move" to "over", and everything was back to working as it should!

So in summary, if you cannot select text with the mouse in Chrome (or you see any other odd mouse-related behaviors), suspect a onmousemove issue

Piggybak: Upgrade to Rails 4.1.0

Piggybak and gems available in the demo (piggybak_variants, piggybak_giftcerts, piggybak_coupons, piggybak_bundle_discounts, piggybak_taxonomy) have been updated to Rails 4.1.0, Ruby 2.1.1 via Piggybak version gem 0.7.1. Interested in the technical details of the upgrade? Here are some fine points:

  • Dependencies were refactored so that the parent Rails app controls the Rails dependency only. There was a bit of redundancy in the various plugin gemspec dependencies. This has been cleaned up so the parent Rails app shall be the canonical reference to the Rails version used in the application.
  • Modified use of assets which require "//= require piggybak/piggybak-application" to be added to the assets moving forward. There have been several observed issues with precompling and asset behavior, so I simplified this by requiring this require to be added to the main Rails application.js for now. The engine file is supposed to have a way around this, but it has not behaved as expected, specifically on unique deployment architectures (e.g. Heroku). Patches welcome to address this.
  • Tables migrated to namespaced tables, e.g. "orders" migrated to "piggybak_orders". This is how namespaced engine tables are supposed to look, and this upgrade fixes the table names with a migration and related code.
  • Handled strong parameters. This was one of the most significant jumps from Rails 3 to Rails 4. The main element of Piggybak that needed updating here was the orders controller, which receives the order parameters and must determine which parameters to handle. Any references to attr_accessible in the code were removed.
  • ActiveRecord "find" method replaced with where & chaining, where applicable. The jump to Rails 4.0 deprecated find methods, but did not remove support, but the jump to Rails 4.1 removed support of these finder methods. These were removed.
  • Scope syntax update. Rails 4 handles scopes with new syntax, and all default scope and named scopes were updated to reflect this new syntax.
  • Validates syntax updated. Rails 4 has new validates syntax which accepts arguments, e.g. presence: true, uniqueness: true. Piggybak was upgraded to use the new syntax, although the old syntax is still supported.
  • Significant routes update. Rails 4 introduced a significant change in routing, and Piggybak was updated to reflect these changes.

The full commits of Piggybak are available for browsing here and here.

Wishlist

There are a few things that I'd love to see adopted in Piggybak, with the help of the community. These include:

  • Consider move to CoffeeScript. I'm still on the fence about this, but I'm seeing more projects with node and CoffeeScript lately, so I wonder if it would be worth the overhead to move to CoffeeScript.
  • Add test coverage. Perhaps Travis CI integration would make sense since it hooks into github nicely?
  • Build out more features. Things like reviews & ratings, saved cart, wishlist support, and saved address support have been on the feature list for a while. It'd be nice to see movement here.

Firefox, Input (type=button), and Line-Height

I recently discovered a discrepancy in the way Firefox treats inputs with a line-height style defined and how other browsers handle the same styling rule. Specifically, Firefox completely ignores it.

This behavior seemed odd enough to me that I did some Googling to determine if this was recently introduced, a long standing issue, or something I was just doing wrong. I found some interesting discussions on the issue. Several of the search results used the word “bug” in the title though it appears to be more of a deliberate (though possibly outdated) “feature” instead. Along with the discussions, I also came across a couple of suggestions for a solution.

First of all, I was able to locate a simple enough explanation of what’s causing the behavior. As Rob Glazebrook explains:

Basically, Firefox is setting the line-height to “normal” on buttons and is enforcing this decision with an !important declaration.” and, “browser-defined !important rules cannot be over-ruled by author-defined !important rules. This rule cannot be overruled by a CSS file, an inline style — anything.

Good news is I can stop experimenting hoping for different results.

I also located a Bugzilla ticket opened in 2011 which contains some discussion on the pros and cons of allowing designers to control the line-height of input elements. The last few comments suggest that Firefox 30 may remove the !important declaration which would open up access to the styling property. At the time of this writing, Firefox version 30 appears to be in alpha.

Due to this long-standing stance by Mozilla, Twitter Bootstrap makes the recommendation to avoid using inputs with type set to button, submit, or reset. Instead, they recommend using button tags paired with the types already mentioned. Button tags are much more flexible in what styling rules are allowed to be applied and are therefore easier to get similar rendering results from the widest range of browsers.

If switching to button tags is not an option for whatever reason, another possible solution is to adjust the padding values for your input buttons. By shrinking the top padding, you can more easily fit text that needs to wrap due to a limited available width. Adjusting the top padding can better center the text on the button by preventing the first line of the text from rendering vertically dead center of the button.

SPF, DKIM and DMARC brief explanation and best practices

Spam mail messages have been a plague since the Internet became popular and they kept growing more and more as the number of devices and people connected grew. Despite the numerous attempts of creation of anti-spam tools, there's still a fairly high number of unwanted messages sent every day.

Luckily it seems that lately something is changing with the adoption of three (relatively) new tools which are starting to be widely used: SPF, DKIM and DMARC. Let's have a quick look at each of these tools and what they achieve.

What are SPF, DKIM and DMARC

SPF (Sender Policy Framework) is a DNS text entry which shows a list of servers that should be considered allowed to send mail for a specific domain. Incidentally the fact that SPF is a DNS entry can also considered a way to enforce the fact that the list is authoritative for the domain, since the owners/administrators are the only people allowed to add/change that main domain zone.

DKIM (DomainKeys Identified Mail) should be instead considered a method to verify that the messages' content are trustworthy, meaning that they weren't changed from the moment the message left the initial mail server. This additional layer of trustability is achieved by an implementation of the standard public/private key signing process. Once again the owners of the domain add a DNS entry with the public DKIM key which will be used by receivers to verify that the message DKIM signature is correct, while on the sender side the server will sign the entitled mail messages with the corresponding private key.

DMARC (Domain-based Message Authentication, Reporting and Conformance) empowers SPF and DKIM by stating a clear policy which should be used about both the aforementioned tools and allows to set an address which can be used to send reports about the mail messages statistics gathered by receivers against the specific domain [1].

How do they work?

All these tools relies heavily on DNS and luckily their functioning process, after all the setup phase is finished, is simple enough to be (roughly) explained below:

SPF:

  • upon receipt the HELO message and the sender address are fetched by the receiving mail server
  • the receiving mail server runs an TXT DNS query against the claimed domain SPF entry
  • the SPF entry data is then used to verify the sender server
  • in case the check fails a rejection message is given to the sender server


Source [*]

DKIM:

  • when sending an outgoing message, the last server within the domain infrastructure checks against its internal settings if the domain used in the "From:" header is included in its "signing table". If not the process stops here
  • a new header, called "DKIM-Signature", is added to the mail message by using the private part of the key on the message content
  • from here on the message *main* content cannot be modified otherwise the DKIM header won't match anymore
  • upon reception the receiving server will make a TXT DNS query to retrieve the key used in the DKIM-Signature field
  • the DKIM header check result can be then used when deciding if a message is fraudulent or trustworthy


Source [*]

DMARC:

  • upon reception the receiving mail server checks if there is any existing DMARC policy published in the domain used by the SPF and/or DKIM checks
  • if *one or both* the SPF and DKIM checks succeed while still being *aligned* with the policy set by DMARC, then the check is considered successful, otherwise it's set as failed
  • if the check fails, based on the action published by the DMARC policy, different actions are taken


Source [*]

The bad news: limits and best practices

Unfortunately even by having a perfectly functional mail system with all the above tools enforced you won't be 100% safe from the bad guys out there. Not all servers are using all three tools shown above. It's enough to take a look at the table shown in Wikipedia [2] to see how that's possible.

Furthermore there are some limits that you should always consider when dealing with SPF, DKIM and DMARC:

  • as already said above DKIM alone doesn't grant in any way that the sender server is allowed to send outgoing mail for the specific domain
  • SPF is powerless with messages forged in shared hosting scenario as all the mail will appear as the same coming IP
  • DMARC is still in its early age and unfortunately not used as much as hoped to make a huge difference
  • DMARC can (and will) break your mail flow if you don't set up both SPF and DKIM before changing DMARC policy to anything above "none".

Please work through the proper process carefully, otherwise your precious messages won't be delivered to your users as potentially seen as fraudulent by a wrong SPF, DKIM or DMARC setup.

What's the message behind all this? Should I use these tools or not?

The short answer is: "Yes". The longer answer is that everybody should and eventually will in future, but we're just not there yet. So even if all these tools already have a lot of power, they're not still shining as bright as they should because of poor adoption.

Hopefully things will change soon and that starts by every one of us adopting these tools as soon as possible.

[1] The lack of such a monitoring tool is considered one of the reasons why other tools (such as ADSP) in past have failed during the adoption phase.
[2] Comparison of mail servers on Wikipedia

jQuery Content Replacement with AJAX

This is not a huge breakthrough, but it colored in some gaps in my knowledge so I thought I would share. Let's say you have a product flypage for a widget that comes in several colors. Other than some of the descriptive text, and maybe a hidden field for use in ordering one color instead of another, all the pages look the same. So your page looks like this (extremely simplified):
... a lot of boilerplate ...
... a lot more boilerplate ...
Probably the page is generated into a template based on a parameter or path segment:
http://.../app/product/WDGT-001-RED
What we're going to add is a quick-and-dirty way of having your page rewrite itself on the fly with just the bits that change when you select a different version (or variant) of the same product. E.g.,

The old-school approach was something like:
 $('select[name=sku]').change(function(){
   document.location.href = my_url + $(this).val();
 });
I.e., we'll just send the browser to re-display the page, but with the selected SKU in the URL instead of where we are now. Slow, clunky, and boring! Instead, let's take advantage of the ability to grab the page from the server and only freshen the parts that change for the desired SKU (warning: this is a bit hand-wavy, as your specifics will change up the code below quite a bit):
// This is subtly wrong:
$('select[name=sku]').change(function(){
  $.ajax({
    async: false,
    url: my_url + $(this).val(),
    complete: function(data){
      $('form#order_item').html( $(data.responseText).find('form#order_item').html() );
    }
});
Why wrong? Well, any event handlers you may have installed (such as the .change() on our selector!) will fail to fire after the content is replaced, because the contents of the form don't have those handlers. You could set them up all over again, but there's a better way:
// This is better:
$('form#order_item').on('change', 'select[name=sku]',
  function(){
  $.ajax({
    async: false,
    url: my_url + $(this).val(),
    complete: function(data){
      var doc = $(data.responseText);
      var $form = $('form#order_item');
      var $clone = $form.clone( true );
      $clone.html(doc.find('form#order_item').html());
      $form.replaceWith($clone);
    }
  });
Using an "on" handler for the whole form, with a filter of just the select element we care about, works better – because when we clone the form, we copy its handler(s), too. There's room for improvement in this solution, because we're still fetching the entire product display page, even the bits that we're going to ignore, so we should look at changing the .ajax() call to reference something else – maybe a custom version of the page that only generates the form and leaves out all the boilerplate. This solution also leaves the browser's address showing the original product, not the one we selected, so a page refresh will be confusing. There are fixes for both of these, but that's for another day.

Speeding Up Saving Millions of ORM Objects in PostgreSQL

The Problem

Sometimes you need to generate sample data, like random data for tests. Sometimes you need to generate it with huge amount of code you have in your ORM mappings, just because an architect decided that all the logic needs to be stored in the ORM, and the database should be just a dummy data container. The real reason is not important - the problem is: let’s generate lots of, millions of rows, for a sample table from ORM mappings.

Sometimes the data is read from a file, but due to business logic kept in ORM, you need to load the data from file to ORM and then save the millions of ORM objects to database.

This can be done in many different ways, but here I will concentrate on making that as fast as possible.

I will use PostgreSQL and SQLAlchemy (with psycopg2) for ORM, so all the code will be implemented in Python. I will create a couple of functions, each implementing another solution for saving the data to the database, and I will test them using 10k and 100k of generated ORM objects.

Sample Table

The table I used is quite simple, just a simplified blog post:

CREATE TABLE posts (
  id SERIAL PRIMARY KEY,
  title TEXT NOT NULL,
  body TEXT NOT NULL,
  payload TEXT NOT NULL
);

SQLAlchemy Mapping

I'm using SQLAlchemy for ORM, so I need a mapping, I will use this simple one:
class BlogPost(Base):
    __tablename__ = "posts"

    id = Column(Integer, primary_key=True)
    title = Column(Text)
    body = Column(Text)
    payload = Column(Text)

The payload field is just to make the object bigger, to simulate real life where objects can be much more complicated, and thus slower to save to the database.

Generating Random Object

The main idea for this test is to have a randomly generated object, however what I really check is the database speed, and the whole randomness is used at the client side, so having a randomly generated object doesn’t really matter at this moment. The overhead of a fully random function is the same regardless of the method of saving the data to the database. So instead of randomly generating the object, I will use a static one, with static data, and I will use the function below:

TITLE   = "title"      * 1764
BODY    = "body"       * 1764
PAYLOAD = "dummy data" * 1764

def generate_random_post():
    "Generates a kind of random blog post"
    return BlogPost(title=TITLE, body=BODY, payload=PAYLOAD)

Solution Ideas

Generally there are two main ideas for such a bulk inserting of multiple ORM objects:

  • Insert them one-by-one with autocommit
  • Insert them one-by-one in one transaction

Save One By One

This is the simplest way. Usually we don’t save just one object, but instead we save many different objects in one transaction, and making a couple of related changes in multiple transactions is a great way leading to a database with bad data.

For generating millions of unrelated objects this shouldn’t cause data inconsistency, but this is highly inefficient. I’ve seen this multiple times in code: create an object, save it to the database, commit, create another object and so on. It works, but is quite slow. Sometimes it is fast enough, but for the cost of making a very simple change in this algorithm we can make it 10 times faster.

I’ve implemented this algorithm in the function below:

def save_objects_one_by_one(count=MAX_COUNT):
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        session.add(post)
        session.commit()

Save All in One Transaction

This solution is as simple as: create objects, save them to the database, commit the transaction at the end, so do everything in one huge transaction.

The implementation differs only by four spaces from the previous one, just run commit() once, after adding all objects:

def save_objects_one_transaction(count=MAX_COUNT):
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        session.add(post)
    session.commit()

Time difference

I ran the tests multiple times, truncating the table each time. The average results of saving 10k objects were quite predictable:

  • Multiple transactions - 268 seconds
  • One transaction - 25 seconds

The difference is not surprising, the whole table size is 4.8MB, but after each transaction the database needs to write the changes on disk, which slows the procedure a lot.

Copy

So far, I’ve described the most common methods of generating and storing many ORM objects. I was wondering about another, which may seem surprising a little bit at the beginning.

PostgreSQL has a great COPY command which can copy data between a table and a file. The file format is simple: one table row per one file row, fields delimited with a defined delimiter etc. It can be a normal csv or tsv file.

My crazy idea was: how about using the COPY for loading all the generated ORM objects? To do that, I need to serialize them to a text representation, to create a text file with all of them. So I created a simple function, which does that. This function is made outside the BlogPost class, so I don't need to change the data model.

def serialize_post_to_out_stream(post, out):
    import csv
    writer = csv.writer(out, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
    writer.writerow([post.title, post.body, post.payload])

The function above gets two parameters:

  • post - the object to be serialized
  • out - the output stream where the row with the post object will be saved, in Python it is a file-like object, so an object with all the functions a file object has

Here I use a standard csv module, which supports reading and writing csv files. I really don’t want to write my own function for escaping all the possible forms of data I could have - this usually leads to many tricky bugs.

The only thing left is to use the COPY command. I don’t want to create a file with data and load that later; the generated data can be really huge, and creating temporary files can just slow things down. I want to keep the whole procedure in Python, and use pipes for data loading.

I will use the psql program for accessing the PostgreSQL database. Psql has a different command called \COPY, which can read the csv file from psql's standard input. This can be done using e.g.: cat file.csv | psql database.

To use it in Python, I’m going to use the subprocess module, and create a psql process with stdin=subprocess.PIPE which will give me write access to the pipe psql reads from. The function I’ve implemented is:

def save_objects_using_copy(count=MAX_COUNT):
    import subprocess
    p = subprocess.Popen([
        'psql', 'pgtest', '-U', 'pgtest',
        '-c', '\COPY posts(title, body, payload) FROM STDIN',
        '--set=ON_ERROR_STOP=true'
        ], stdin=subprocess.PIPE
    )
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        serialize_post_to_out_stream(post, p.stdin)
    p.stdin.close()

Results

I’ve also tested that on the same database table, truncating the table before running it. After that I’ve also checked this function, and the previous one (with one transaction) on a bigger sample - 100k of BlogPost objects.

The results are:

Sample size Multiple Transactions One Transaction COPY
10k 268 s 25 s 5 s
100k 262 s 51 s

I haven’t tested the multiple transactions version for 100k sample, as I just didn’t want to wait multiple hours for finishing that (as I run each of the tests multiple times to get more reliable results).

As you can see, the COPY version is the fastest, even 5 times faster than the full ORM version with one huge transaction. This version is also memory friendly, as no matter how many objects you want to generate, it always needs to store one ORM object in memory, and you can destroy it after saving.

The Drawbacks

Of course using psql poses a couple of problems:

  • you need to have psql available; sometimes that’s not an option
  • calling psql creates another connection to the database; sometimes that could be a problem
  • you need to set up a password in ~/.psql file; you cannot provide it in the command line

You could also get the pcycopg2 cursor directly from the SQLAlchemy connection, and then use the copy_from() function, but this method needs to have all the data already prepared in memory, as it reads from a file-like object, e.g. StringIO. This is not a good solution for inserting millions of objects, as they can be quite huge - streaming is much better in this case.

Another solution to this is to write a generator, which is a file like object, and the copy_from() method can read from it directly. This function calls the file's read() method trying to read 8192 bytes per call. This can be a good idea when you don't have access to the psql, however due to the overhead for generating the 8192 bytes strings, it should be slowever than the psql version.

Sanity, thy name is not MySQL

Probably old news, but I hit this MySQL oddity today after a long day dealing with unrelated crazy stuff and it just made me go cross-eyed:
CREATE TABLE foo (id integer, val enum('','1'));
INSERT INTO foo VALUES (1, '');
INSERT INTO foo VALUES (2, '1');
SELECT * FROM foo WHERE val = 1;
What row do you get? I'll wait while you second- and third-guess yourself. It turns out that the "enum" datatype in MySQL just translates to a set of unique integer values. In our case, that means:
  • '' == 1
  • '1' == 2
So you get the row with (1,''). Now, if that doesn't confuse readers of your code, I don't know what will.