End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Speeding Up Saving Millions of ORM Objects in PostgreSQL

The Problem

Sometimes you need to generate sample data, like random data for tests. Sometimes you need to generate it with huge amount of code you have in your ORM mappings, just because an architect decided that all the logic needs to be stored in the ORM, and the database should be just a dummy data container. The real reason is not important - the problem is: let’s generate lots of, millions of rows, for a sample table from ORM mappings.

Sometimes the data is read from a file, but due to business logic kept in ORM, you need to load the data from file to ORM and then save the millions of ORM objects to database.

This can be done in many different ways, but here I will concentrate on making that as fast as possible.

I will use PostgreSQL and SQLAlchemy (with psycopg2) for ORM, so all the code will be implemented in Python. I will create a couple of functions, each implementing another solution for saving the data to the database, and I will test them using 10k and 100k of generated ORM objects.

Sample Table

The table I used is quite simple, just a simplified blog post:

CREATE TABLE posts (
  id SERIAL PRIMARY KEY,
  title TEXT NOT NULL,
  body TEXT NOT NULL,
  payload TEXT NOT NULL
);

SQLAlchemy Mapping

I'm using SQLAlchemy for ORM, so I need a mapping, I will use this simple one:
class BlogPost(Base):
    __tablename__ = "posts"

    id = Column(Integer, primary_key=True)
    title = Column(Text)
    body = Column(Text)
    payload = Column(Text)

The payload field is just to make the object bigger, to simulate real life where objects can be much more complicated, and thus slower to save to the database.

Generating Random Object

The main idea for this test is to have a randomly generated object, however what I really check is the database speed, and the whole randomness is used at the client side, so having a randomly generated object doesn’t really matter at this moment. The overhead of a fully random function is the same regardless of the method of saving the data to the database. So instead of randomly generating the object, I will use a static one, with static data, and I will use the function below:

TITLE   = "title"      * 1764
BODY    = "body"       * 1764
PAYLOAD = "dummy data" * 1764

def generate_random_post():
    "Generates a kind of random blog post"
    return BlogPost(title=TITLE, body=BODY, payload=PAYLOAD)

Solution Ideas

Generally there are two main ideas for such a bulk inserting of multiple ORM objects:

  • Insert them one-by-one with autocommit
  • Insert them one-by-one in one transaction

Save One By One

This is the simplest way. Usually we don’t save just one object, but instead we save many different objects in one transaction, and making a couple of related changes in multiple transactions is a great way leading to a database with bad data.

For generating millions of unrelated objects this shouldn’t cause data inconsistency, but this is highly inefficient. I’ve seen this multiple times in code: create an object, save it to the database, commit, create another object and so on. It works, but is quite slow. Sometimes it is fast enough, but for the cost of making a very simple change in this algorithm we can make it 10 times faster.

I’ve implemented this algorithm in the function below:

def save_objects_one_by_one(count=MAX_COUNT):
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        session.add(post)
        session.commit()

Save All in One Transaction

This solution is as simple as: create objects, save them to the database, commit the transaction at the end, so do everything in one huge transaction.

The implementation differs only by four spaces from the previous one, just run commit() once, after adding all objects:

def save_objects_one_transaction(count=MAX_COUNT):
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        session.add(post)
    session.commit()

Time difference

I ran the tests multiple times, truncating the table each time. The average results of saving 10k objects were quite predictable:

  • Multiple transactions - 268 seconds
  • One transaction - 25 seconds

The difference is not surprising, the whole table size is 4.8MB, but after each transaction the database needs to write the changes on disk, which slows the procedure a lot.

Copy

So far, I’ve described the most common methods of generating and storing many ORM objects. I was wondering about another, which may seem surprising a little bit at the beginning.

PostgreSQL has a great COPY command which can copy data between a table and a file. The file format is simple: one table row per one file row, fields delimited with a defined delimiter etc. It can be a normal csv or tsv file.

My crazy idea was: how about using the COPY for loading all the generated ORM objects? To do that, I need to serialize them to a text representation, to create a text file with all of them. So I created a simple function, which does that. This function is made outside the BlogPost class, so I don't need to change the data model.

def serialize_post_to_out_stream(post, out):
    import csv
    writer = csv.writer(out, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
    writer.writerow([post.title, post.body, post.payload])

The function above gets two parameters:

  • post - the object to be serialized
  • out - the output stream where the row with the post object will be saved, in Python it is a file-like object, so an object with all the functions a file object has

Here I use a standard csv module, which supports reading and writing csv files. I really don’t want to write my own function for escaping all the possible forms of data I could have - this usually leads to many tricky bugs.

The only thing left is to use the COPY command. I don’t want to create a file with data and load that later; the generated data can be really huge, and creating temporary files can just slow things down. I want to keep the whole procedure in Python, and use pipes for data loading.

I will use the psql program for accessing the PostgreSQL database. Psql has a different command called \COPY, which can read the csv file from psql's standard input. This can be done using e.g.: cat file.csv | psql database.

To use it in Python, I’m going to use the subprocess module, and create a psql process with stdin=subprocess.PIPE which will give me write access to the pipe psql reads from. The function I’ve implemented is:

def save_objects_using_copy(count=MAX_COUNT):
    import subprocess
    p = subprocess.Popen([
        'psql', 'pgtest', '-U', 'pgtest',
        '-c', '\COPY posts(title, body, payload) FROM STDIN',
        '--set=ON_ERROR_STOP=true'
        ], stdin=subprocess.PIPE
    )
    for i in xrange(1, MAX_COUNT+1):
        post = generate_random_post()
        serialize_post_to_out_stream(post, p.stdin)
    p.stdin.close()

Results

I’ve also tested that on the same database table, truncating the table before running it. After that I’ve also checked this function, and the previous one (with one transaction) on a bigger sample - 100k of BlogPost objects.

The results are:

Sample size Multiple Transactions One Transaction COPY
10k 268 s 25 s 5 s
100k 262 s 51 s

I haven’t tested the multiple transactions version for 100k sample, as I just didn’t want to wait multiple hours for finishing that (as I run each of the tests multiple times to get more reliable results).

As you can see, the COPY version is the fastest, even 5 times faster than the full ORM version with one huge transaction. This version is also memory friendly, as no matter how many objects you want to generate, it always needs to store one ORM object in memory, and you can destroy it after saving.

The Drawbacks

Of course using psql poses a couple of problems:

  • you need to have psql available; sometimes that’s not an option
  • calling psql creates another connection to the database; sometimes that could be a problem
  • you need to set up a password in ~/.psql file; you cannot provide it in the command line

You could also get the pcycopg2 cursor directly from the SQLAlchemy connection, and then use the copy_from() function, but this method needs to have all the data already prepared in memory, as it reads from a file-like object, e.g. StringIO. This is not a good solution for inserting millions of objects, as they can be quite huge - streaming is much better in this case.

Another solution to this is to write a generator, which is a file like object, and the copy_from() method can read from it directly. This function calls the file's read() method trying to read 8192 bytes per call. This can be a good idea when you don't have access to the psql, however due to the overhead for generating the 8192 bytes strings, it should be slowever than the psql version.

9 comments:

Bartosz RadaczyƄski said...

Passwords can be provided by setting environment variables PGUSER and PGPASSWORD. Arguably that is better than the ~/.psql file (http://www.postgresql.org/docs/9.1/static/libpq-envars.html)

Andreas Kostyrka said...

Well, you've forgot two things:

1.) psycopg2 has support for COPY.
2.) If you have access to the server file system, you can write out the data into a temporary file, and use a COPY command with a filename.

Jon Jensen said...

I think Szymon meant ~/.pgpass for the password file.

Arguably, there's not signficant difference in security between environment variables and ~/.pgpass because Postgres requires that ~/.pgpass be readable only by the owner. Any user that can read that can also read (e.g. on Linux) /proc/$pid/environ and see your environment variables.

mike bayer said...

I can give you a trick that will probably get that 25s down to more like 10 or less: When you make your random objects, pre-assign the primary key integer value as well. The SQLAlchemy ORM will then insert the records using a single executemany() call to psycopg2, rather than calling execute() for individual INSERT statements, as it no longer needs to fetch newly generated SERIAL values. See http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 for a detailed discussion/demo of this.

veidelis said...

A great post!

Szymon Guz said...

@Bartosz: Yep, however this provides additional complexity to the project, and I wanted to have it as simple as possible.

This way I will need to store the passwords in a couple of places after deploying the application, as this will need to have its config with passwords, and another in ~/.pgpass.

Setting in the env vars is not better, you need to have a script for starting the app, which sets proper variables before starting that. I'd rather have all of them in ~/.pgpass, which allows to set different passwords for different users, databases and servers at the same time.

What's more ~/.pgpass is not universal, as e.g. PostgreSQL JDBC driver doesn't use it, nor the env variables, and you need to provide proper connection string.

Szymon Guz said...

@Andreas: I have written about the psycopg2's support for COPY (the last two paragraphs).

Saving the csv/tsv to a file is not an option when working with really huge number of rows. The file can be quite big, and the overhead of writing the file, and then reading, and writing the data again is just too big (yes, I've checked that a couple of times).

Using pipes between processes is usually much faster, as it uses much less memory, and, ideally, no disk at all.

Szymon Guz said...

@Jon: I read through some documentation, and it looks like sometimes the ~/.pgpass can be not accessible. Heroku doesn't give you access to this, and you need to use just the env variables. The PostgreSQL JDBC driver undertands only the jdbc connection strings, of course this can be made with env variables, but usually it is not.

Szymon Guz said...

@Mike: Thanks for your comment.

I'm not sure this will speed the things up. Currently there is generated a query like:

` INSERT INTO posts (title, body, payload) VALUES ('a', 'b', 'c') RETURNING posts.id`

This let's the database to insert the id into the inserted row, and then it returns that to the SQLAlchemy.

What you propose is to generate the id from the application. What about another application running in the same time, which wants to insert data into the same table? How should the application synchronize not to use the same ids? The serial column is very good in this case, as the nextval() call uses a lock internally, and only one process can call this in the same time.

However you could get a bunch of the numbers from the generator used in serial column, and insert this directly in the sqlalchemy. Unfortunately the only way to get 100 numbers from the generator is:

`SELECT nextval('posts_id_seq') FROM generate_series(1,100)`

so you need to call the generator 100 times. And you need to get the 100 numbers from the database to the SQLAlchemy. Insert to each object, and sent it in the generated query back to the database. This way you need to send the number twice instead of just one in the first case.

The problem with the generator is that it is not transactional (which is quite good, so when one transaction asks that to get the next value, then no other needs to wait for the first to finish). However due to the non transactional behaviour, you cannot make the queries like:

`SELECT setval('posts_id_seq', nextval('posts_id_seq')+1000);`

because there could be another nextval() call just between the nextval() and setval() in the above queries.

If I could have just one user using the database, then you're right, this could be faster, however I usually assume working on a real system, and in this case I cannot set the primary key outside the database.