End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

The Real Cost of Data Roundtrip

Sometimes you need to perform some heavy database operations. I don't know why very often programmers are afraid of using databases for that. They usually have some fancy ORM which performs all the operations, and the only way to change the data is to make some SELECT * from a table, create a bunch of unneeded objects, change one field, convert those changed objects into queries and send that to the database.

Have you ever thought about the cost of the roundtrip of data? The cost of getting all the data from database just to send changed data into the database? Why do that if there would be much faster way of achieving the same results?

Imagine that you have quite a heavy operation. Let's make something which normally databases cannot do, some more complicated operation. Many programmers just don't know that there is any other way than writing this in the application code. Let's change all the HTML entities into real characters.

The HTML entities are a way of writing many different characters in HTML. This way you can write for instance the Euro currency sign "€" in HTML even if you don't have it on your keyboard. You just have to write € or € instead. I don't have to, as when I use UTF-8 encoding and write this character directly, it should be showed normally. What's more I have this character on my keyboard.

I will convert the text stored in database changing all the htmlentities into real unicode characters. I will do it using three different methods.

  • The first will be a simple query run inside PostgreSQL
  • The second will be an external program which downloads the text column from database, changes it externally and loads into database.
  • The third method will be almost the same as the second, however it will download whole rows.

Generate Data

So, for this test I need to have some data. Let's write a simple data generator.

First, a simple function for returning a random number within the given range.

CREATE FUNCTION random(INTEGER, INTEGER) RETURNS INTEGER AS $$
  SELECT floor ( $1 + ($2 - $1 + 1 ) * random())::INTEGER;
$$ LANGUAGE SQL;

Now the function for generating random texts of random length filled with the HTML entities.

CREATE FUNCTION generate_random_string() RETURNS TEXT AS $$
DECLARE
  items TEXT[] =
    ARRAY[
      'AAAA','BBBB','CCCC','DDDD','EEEE','FFFF','GGGG',
      'HHHH','IIII','JJJJ','KKKK','LLLL','MMMM','NNNN',
      'OOOO','PPPP','QQQQ','RRRR','SSSS','TTTT','UUUU',
      'VVVV','WWWW','XXXX','YYYY','ZZZZ',
      '&', '"', ''', '&','<','>',
      '¢','£','¤','¥','¦','§',
      '¨','©','ª','«','¬','­',
      '®','¯','°','±','²','³',
      '´','µ','¶','·','¸','¹',
      'º','»','¼','½','¾'
    ];
  length INTEGER := random(500, 1500);
  result TEXT := '';
  items_length INTEGER := array_length(items, 1);
BEGIN
  FOR x IN 1..length LOOP
    result := result || items[ random(1, items_length) ];
  END LOOP;

  RETURN result;
END;
$$ LANGUAGE PLPGSQL;

The table for storing the data is created with the following query:

CREATE TABLE data (
    id SERIAL PRIMARY KEY, 
    padding TEXT, 
    t TEXT
);

Then I filled this table using a query generating 50k rows with random data:

INSERT INTO data(payload, t) 
SELECT 
    generate_random_string(), 
    generate_random_string() 
FROM 
    generate_series(1, 50*1000);

Let's check the table size:

SELECT pg_size_pretty(pg_relation_size('data'));
 pg_size_pretty 
 ----------------
  207 MB
  (1 row)

As the table is filled with random data, I need to have two tables with exactly the same data.

CREATE TABLE query (id SERIAL PRIMARY KEY, payload TEXT, t TEXT);
CREATE TABLE script (id SERIAL PRIMARY KEY, payload TEXT, t TEXT);
CREATE TABLE script_full (id SERIAL PRIMARY KEY, payload TEXT, t TEXT);

INSERT INTO query SELECT * FROM data;
INSERT INTO script SELECT * FROM data;
INSERT INTO script_full SELECT * FROM data;

The Tests

SQL

Many programmers think that such operations are not normally available inside a database. However PostgreSQL has quite a nice feature, it can execute functions written in many different languages. For the purpose of this test I will use the language pl/perlu which allows me to use external libraries. I will also use HTML::Entities package for the conversion.

The function I wrote is quite simple:

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

The update of the data can be done using the following query:

UPDATE query SET t = decode_html_entities(t);

Application

In order to have those tests comparable, I will write a simple perl script using exactly the same package for converting html entities.

#!/usr/bin/env perl

use DBI;
use HTML::Entities;
use Encode; 

my $dbh = DBI->connect(
    "DBI:Pg:dbname=test;host=localhost", 
    "szymon", 
    "", 
    {'RaiseError' => 1, 'pg_utf8_strings' => 1});

$dbh->do('BEGIN');

my $upd = $dbh->prepare("UPDATE script SET t = ? WHERE id = ?");

my $sth = $dbh->prepare("SELECT id, t FROM script");
$sth->execute();

while(my $row = $sth->fetchrow_hashref()) {
    my $t = decode_entities( $row->{'t'} );
    $t = encode("UTF-8", $t);
    $upd->execute( $t, $row->{'id'} );
}

$dbh->do('COMMIT');
$dbh->disconnect();

The Worst Application

There is another terrible idea implemented by programmers too often. Why select only the column you want to change? Let's select all the rows and send them back to database.

This script will look like this (the important changes are in lines 17 and 23)

#!/usr/bin/env perl

use DBI;
use HTML::Entities;
use Encode; 

my $dbh = DBI->connect(
    "DBI:Pg:dbname=test;host=localhost", 
    "szymon", 
    "", 
    {'RaiseError' => 1, 'pg_utf8_strings' => 1});

$dbh->do('BEGIN');

my $upd = $dbh->prepare("UPDATE script_all SET t = ? WHERE id = ?");

my $sth = $dbh->prepare("SELECT id, payload, t FROM script_all");
$sth->execute();

while(my $row = $sth->fetchrow_hashref()) {
    my $t = decode_entities( $row->{'t'} );
    $t = encode("UTF-8", $t);
    $upd->execute( $t, $row->{'payload'}, $row->{'id'} );
}

$dbh->do('COMMIT');
$dbh->disconnect();

Results.

The query using pl/perlu function executed in 26 seconds.

The script changing data externally execuded in 2 minutes 10 seconds (5 times slower)

The worst script getting and resending whole rows finished in 4 minutes 35 seconds (10 times slower).

I used quite a small number of rows. There were just 50k rows (about 200MB). On production servers the numbers are much bigger.

Just imagine that the code you developed for changing data could run 10 times faster if you'd do this in the database.

6 comments:

vdp said...

Nice clear numbers and I've often lobbied for this myself.

One counter-argument I've often heard and that is hard to counter (it may sometimes be valid, and it is hard to measure) is that you end up using the db server's CPU when it would be more efficient to let the app server do that work. Especially since app servers tend to scale horizontally but db server vertically.

Not a definitive counter-argument, just a YMMV warning.

Szymon Guz said...

This is a very good argument. The best way is always to check the real cost of performing such database operations and calculate what's cheaper or more business friendly.

It is always good to know that there is another, possibly better, way of changing database data, than the pure ORM solution.

Jeff Boes said...

There's another argument against "SELECT *" in embedded Perl code. (Admittedly, this one gets solved if you use an interface like Rose, but if you are scripting down at the DBI level, this can be a pitfall.)

Using "SELECT *" means your code is vulnerable to column name-changes with very little warning. For instance,

SELECT * FROM emp

and code that references $row->{fname} will break (sometimes silently) when column "fname" becomes "first_name". (Yes, this is a somewhat contrived example, but it happens.)

Much better to find out about it if your embedded SQL code is:

SELECT fname, lname FROM emp

then at least you get a run-time error that provides far more help.

Robert Young said...

As to why coders instinctively reach for java/etc.? Simple. Ever since programming became synonymous with "the web", coders have reveled in the reactionary paradigm of COBOL/VSAM of their granddaddies: dumb data, smart code. Few, if any, have taken a serious database course while in school. Lots of languages, of course, but little or nothing of databases. I don't count xml/NoSql as databases.

As to PG's support for many languages, not so much as one might think. Most (all?) that aren't C are themselves implemented in C, so the engine actually implements the underlying C. Not saying this is a bad thing, only that non-C languages aren't natively supported.

Should we ever get to the point that 5NF databases are the norm rather than the exception, then the argument that the DB is just a "file store" and data management *belongs* in the client (as it did/does in the COBOL/VSAM world) will fade.

Szymon Guz said...

@Jeff you're absolutely right, I just wrote that as a simple example. However I think it is best to provide good examples, I will fix that in a moment.

Joshua Tolley said...

@Robert Young, non-C languages are supported in much the same way SQL itself is -- it's all interpreted somewhere inside the server. For most of those languages there will be some extra overhead of converting the data from types PostgreSQL understands to types the embedded interpreter understands, and depending on the task at hand, that can be significant. But the same sort of conversion is necessary to get the data ready to send across a network, and then again in application code.