End Point


Welcome to End Point's blog

Ongoing observations by End Point people.

2014 hack.summit() wrap-up #hacksummit

Seeing the proposed line-up for the 2014 hack.summit() virtual conference was the grown-up equivalent of seeing the line-up for some of the first Lollapalooza events. It was definitely an "All those people I want to see, and all in one place? *head asplode*" moments.

So, what is this conference with the incredibly nerdy name? In short, it's a selection of industry leading speakers presenting all on-line and streamed live. The "registration fee" was actually a choice between mentioning the conference on a few social media platforms, or making a donation to one of a number of programming non-profits. Seeing as I don't tweet, I made a donation, then signed in (using OAuth) via my Google+ account. It was a delightfully frictionless process.

The hack.summit() conference ran December 1st through December 4th, but I was only able to "attend" the last two days. Luckily for me, all of the live-streamed presentations are also available afterwards on the hacksummit site. They feel a little hidden away in the small menu in the upper left corner, but they're all there, available as YouTube videos.

So, why is was hack.summit() worth your time? It's got an amazing collection of very accomplished developers, thought leaders and experienced big cheeses of some companies that do some pretty impressive work. During the live event, the Crowdcast platform provided a great delivery mechanism for the streaming videos, as well as admin-created polls, a light-weight chat feature, and audience-voted questions for the presenters. Hack.summit() founder, Ed Roman, did a great job MC-ing the entire event, too. (And to whoever figured out how to game the voting system at a conference named hack.summit(), well played you rogue.)

In closing, I strongly recommend you do a few things: Go sign up right now to gain access to the presentation videos. Commit some time (make a deal with yourself, get approval to do a group viewing at work, whatever) to watch as many presentations as you can. Lastly, set a calendar reminder to keep an eye out for the hack.summit() 2015 that will hopefully happen.

Ecommerce in the Django World

Mezzanine (http://mezzanine.jupo.org/) is a powerful piece of software written for the Django Framework in Python that functions like a structured content management system similar to Drupal, WordPress, and others. Built on top of this CMS structure is a module which adds features for Ecommerce which is known as Cartridge (http://cartridge.jupo.org/).

Installing Cartridge/Mezzanine

Installing the Mezzanine CMS system from scratch can be accomplished in two methods, either by using the pip python package manager, or you can clone the current source from the git repository from the software maintainers. If you were planning to modify either Mezzanine or Cartridge to customize the setup for your own needs the latter would likely be preferable as you could then easily begin creating custom branches off the original source to track the customization work. However for this example I will show the pip method of installation:

pip install -U cartridge
Once installed, there is a mezzanine-project command which will allow you to create a new blank Mezzanine environment within a new directory, and in this case we will also send an option to instruct the command to install the Cartridge module as well.
mezzanine-project -a cartridge new_cartridge_project
At this point you will have a blank Mezzanine environment with the cartridge module installed, now the Django database must be populated with the model information for the application, and then the Django application server will be started.
cd new_cartridge_project
python manage.py createdb
python manage.py runserver
If all went well, you should see the startup messages for the Django application server which will list the version numbers of the various libraries it will use, and then should be up and listening on the loopback interface. At this point to complete the setup you would just need to point your httpd at this loopback socket to send connections there.

Cartridge Product Models

Products within Cartridge are defined within the python models for the application, there are three primary models to be concerned with in a Product definition, the Product, ProductVariation and ProductImage models specifically.

  • Product - Defines the primary attributes for a product like name, price, SKU, and can be populated with optional fields for sale prices, etc.
  • ProductVariation - Defines a variant of a product SKU, these would be most commonly used for things like product sizes and colors.
  • ProductImage - Defines the image for the picture of each product.
Each of these three models exist within Django as their own data sets, but then they have foreign key references from each Product to all of its Variants and Images. While you can alter these model definitions using the django shell and other methods within python, you can also update and manage the products within the admin interface for the django site.

Adding Products / Tracking Orders

Adding products through the admin interface for Django was relatively easy and quick:

Within this interface you would provide the details for the products name, published status, date range for the product to be on the site, and product description in addition to any Variants or Images for the product. This is where the Mezzanine foundation of Cartridge start to show through, where this process mirrors the creating of a content page within Mezzanine, but has added these attributes of the product to the definition.

In the same way, Product Variants, Discount Codes, and Sales can be created in much the same way within the admin interface. In this way, once you had a basic Cartridge setup in place on your server, within most use cases for setting up a simple web store, Cartridge would remove the need to do any further hacking of python code, and would allow any user familiar with a CMS workflow for creating and managing the objects within the environment to manage the store.

In addition, Cartridge provides order management within the admin interface as well:

From this interface you can update orders, create new orders, print the PDF invoices for the orders and track whether orders have been fulfilled and shipped to the customer.


Like other Ecommerce applications built on top of a Content Management System, Cartridge and Mezzanine work well together to create a stack that allows for both ease of getting started with the system, and a massive degree of customization possible. Each component of Cartridge is defined within Django, and can be rewritten and customized to serve any number of use cases. The setup process for Cartridge was pretty straightforward, but one recommendation I can make is that once you have a working system up, you should place all of the files for Mezzanine and Cartridge into a source code management system like Git. Like other web applications of this type, the complexity of each change to the system can create a possible point of failure. Once you start making changes to the Mezzanine configuration files and the python for the Models and Views for Cartridge, controlling those changes becomes critical. In this way Mezzanine and Cartridge are spanning the normal continuum of software with the polar extremes of ease of use and customization by providing a good example to start with in the setup, and also exposing the full range of customization to the developer.

Easier IE Site Testing With RemoteIE

Microsoft recently announced a new service which I'm finding very useful. RemoteIE lets you test your sites with IE (currently version 11) on Windows 10 Technical Preview. The service runs in Azure RemoteApp which is available for several clients including Android, iOS and Windows Phone. What's great about this is that you do not have to install and maintain your own virtual machine with VirtualBox or VMWare.


To use RemoteIE you'll need a valid Microsoft account — it's easy to sign up if you don't already one. Once you have an account and have downloaded & installed the Azure RemoteApp client of your choice it's just a matter of starting it up and logging in. Happy Testing!

MySQL to PostgreSQL Migration Tips

I recently was involved in a project to migrate a client's existing application from MySQL to PostgreSQL, and I wanted to record some of my experiences in doing so in the hopes they would be useful for others.

Note that these issues should not be considered exhaustive, but were taken from my notes of issues encountered and/or things that we had to take into consideration in this migration process.

Convert the schema

The first step is to convert the equivalent schema in your PostgreSQL system, generated from the original MySQL.

We used `mysqldump --compatible=postgresql --no-data` to get a dump which matched PostgreSQL's quoting rules. This file still required some manual editing to cleanup some of the issues, such as removing MySQL's "Engine" specification after a CREATE TABLE statement, but this resulted in a script in which we were able to create a skeleton PostgreSQL database with the correct database objects, names, types, etc.

Some of the considerations here include the database collations/charset. MySQL supports multiple collations/charset per database; in this case we ended up storing everything in UTF-8, which matched the encoding of the PostgreSQL database, so there were no additional changes needed here; otherwise, it would have been necessary to note the original encoding of the individual tables and later convert that to UTF-8 in the next step.

We needed to make the following modifications for datatypes:

MySQL Datatype PostgreSQL Datatype
tinyint int
int(NN) int
blob bytea*
datetime timestamp with timezone
int unsigned int**
enum('1') bool
longtext text
varbinary(NN) bytea

* Note: we ended up converting these specific fields to text, just given the data that was stored in these fields in actuality, which just goes to show you should review your data.

** Note: because PostgreSQL does not have unsigned numeric types, if this feature is an important part of your data model you can/should add a CHECK constraint to the column in question to check that the value is non-negative.

A few other syntactic changes; MySQL's UNIQUE KEY in the CREATE TABLE statement needs to just be UNIQUE.

Some of the MySQL indexes were defined as FULLTEXT indexes as well, which was a keyword PostgreSQL did not recognize. We made note of these, then created just normal indexes for the time being, intending to review to what extent these actually needed full text search capabilities.

Some of the AUTO_INCREMENT fields did not get the DEFAULT value set correctly to a sequence, because those types just ended up as integers without being declared a serial field, so we used the following query to correct this:

-- cleanup missing autoincrement fields

datasource AS (
        information_schema.key_column_usage k
        attrelid = k.table_name :: regclass AND
        attname = k.column_name
        adrelid = k.table_name :: regclass AND
        adnum = k.ordinal_position
        table_name IN (
            SELECT table_name::text FROM information_schema.key_column_usage WHERE constraint_name LIKE '%_pkey' GROUP BY table_name HAVING count(table_name) = 1
        ) AND
        adsrc IS NULL AND
        atttypid = 'integer' ::regtype
frags AS (
        quote_ident(table_name || '_' || column_name || '_seq') AS q_seqname,
        quote_ident(table_name) as q_table,
        quote_ident(column_name) as q_col
queries AS (
        'CREATE SEQUENCE ' || q_seqname || ';
' ||
        'ALTER TABLE ' || q_table || ' ALTER COLUMN ' || q_col || $$ SET DEFAULT nextval('$$ || q_seqname || $$');
    $$ ||
        $$SELECT setval('$$ || q_seqname || $$',(SELECT max($$ || q_col || ') FROM ' || q_table || '));
' AS query
    FROM frags
    COALESCE(string_agg(query, E'\n'),$$SELECT 'No autoincrement fixes needed';$$) AS queries FROM queries


Basically the idea is that we look for all table with a defined integer primary key (hand-waving it it by using the _pkey suffix in the constraint name), but without a current default value, then generate the equivalent SQL to create a sequence and set that table's default value to the nextval() for the sequence in question. We also generate SQL to scan that table and set that sequence value to the next appropriate value for the column in question. (Since this is for a migration and we know we'll be the only user accessing these tables we can ignore MVCC.)

Another interesting thing about this script is that we utilize psql's ability to store results in a variable, using the \gset command, then we subsequently execute this SQL by interpolating that corresponding variable in the same script.

Convert the data

The next step was to prepare the data load from a MySQL data-only dump. Using a similar dump recipe as for the initial import, we used: `mysqldump --compatible=postgresql --no-create-info --extended-insert > data.sql` to save the data in a dump file so we could iteratively tweak our approach to cleaning up the MySQL data.

Using our dump file, we attempted a fresh load into the new PostgreSQL database. This failed initially due to multiple issues, including ones of invalid character encoding and stricter datatype interpretations in PostgreSQL.

What we ended up doing was to create a filter script to handle all of the "fixup" issues needed here. This involved decoding the data and reencoding to ensure we were using proper UTF8, performing some context-sensitive datatype conversions, etc.

Additional schema modifications

As we were already using a filter script to process the data dump, we decided to take the opportunity to fixup some warts in the current table definitions. This included some fields which were varchar, but should have actually been numeric or integer; as this was a non-trivial schema (100 tables) we were able to use PostgreSQL's system views to identify a list of columns which should should be numeric and were currently not.

Since this was an ecommerce application, we identified columns that were likely candidates for data type reassignment based on field names *count, *qty, *price, *num.

Once we identified the fields in question, I wrote a script to generate the appropriate ALTER TABLE statements to first drop the default, change the column type, then set the new default. This was done via a mapping between table/column name and desired output type.

Convert the application

The final (and needless to say most involved step) was to convert the actual application itself to work with PostgreSQL. Despite the fact that these databases both speak SQL, we had to come up for solutions for the following issues:

Quotation styles

MySQL is more lax with its quoting styles, so some of this migration involved hunting down differences in quoting styles. The codebase contained lots of double-quoted string literals, which PostgreSQL interprets as identifiers, as well as the difference in quoting of column names (backticks for MySQL, double-quotes for PostgreSQL). These had to be identified wherever they appeared and fixed to use a consistent quoting style.

Specific unsupported syntax differences:


MySQL supports the INSERT ON DUPLICATE KEY syntax. Modifying these queries involved creating a special UPSERT-style function to support the different options in use in the code base. We isolated and categorized the uses of INSERT ON DUPLICATE KEY UPDATE into several categories: those which did a straight record replace and those which did some sort of modification. I wrote a utility script (detailed later in this article) which served to replicate the logic needed to handle this as the application would expect.

Upcoming versions of PostgreSQL are likely to incorporate an INSERT ... ON CONFLICT UPDATE/IGNORE syntax, which would produce a more direct method of handling migration of these sorts of queries.


MySQL's INSERT ... IGNORE syntax allows you to insert a row and effectively ignore a primary key violation, assuming that the rest of the row is valid. You can handle this case via creating a similar UPSERT function as in the previous point. Again, this case will be easily resolved if PostgreSQL adopts the INSERT ... ON CONFLICT IGNORE syntax.


MySQL's REPLACE INTO syntax effectively does a DELETE followed by an INSERT; it basically ensures that a specific version of a row exists for the given primary key value. We handle this case by just modifying these queries to do an unconditional DELETE for the Primary Key in question followed by the corresponding INSERT. We ensure these are done within a single transaction so the result is atomic.


Date interval syntax can be slightly different in MySQL; intervals may be unquoted in MySQL, but must be quoted in PostgreSQL. This project necessitated hunting down several instances to add quoting of specific literal INTERVAL instances.

Function considerations


Many times when you insert a records into a MySQL table, later references to this are found using the last_insert_id() SQL function. These sorts of queries need to be modified to utilize the equivalent functionality using PostgreSQL sequences, such as the currval() function.


MySQL has the GROUP_CONCAT function, which serves as a string "join" of sorts. We emulate this behavior in PostgreSQL by using the string_agg aggregate function with the delimiter of choice.

CONCAT_WS() - expected to be but not an issue; PG has this function

PostgreSQL has included a CONCAT_WS() function since PostgreSQL 9.1, so this was not an issue with the specific migration, but could still be an issue if you are migrating to an older version of PostgreSQL.


This function does not exist directly in PostgreSQL, but can be simulated using to_date(). Note however that the format string argument differs between MySQL and PostgreSQL's versions.


MySQL has a date_format() function which transforms a date type to a string with a given format option. PostgreSQL has similar functionality using the to_char() function; the main difference here lies in the format string specifier.


DateDiff() does not exist in PostgreSQL, this is handled by transforming the function call to the equivalent date manipulation operators using the subtraction (-) operator.

rand() to random()

This is more-or-less a simple function rename, as the equivalent functionality for returning a random float between 0.0 <= x <= 1.0 exists in PostgreSQL and MySQL, it's just what the function name itself is. The other difference is that MySQL supports a scale argument so the random number for rand(N) will be returned between 0.0 <= x <= N, whereas you'd have to scale the result in PostgreSQL yourself, via random() * N.


MySQL has an IF() function which returns the second argument in the case the first argument evaluates to true otherwise returns the third argument. This can be trivially converted from IF(expression1, arg2, arg3) to the equilvalent PostgreSQL syntax: CASE WHEN expression1 THEN arg2 ELSE arg3.


MySQL has a function IFNULL() which returns the first argument if it is not NULL, otherwise it returns the second argument. This can effectively be replaced by the PostgreSQL COALESCE() function, which serves the same purpose.


MySQL has a built-in function called split_part() which allows you to access a specific index of an array delimited by a string. PostgreSQL also has this function, however the split_part() function in MySQL allows the index to be negative, in which case this returns the part from the right-hand side.

in MySQL:
split_part('a banana boat', ' ', -1) => 'boat'
in PostgreSQL:
split_part('a banana boat', ' ', -1) => // ERROR:  field position must be greater than zero

I fixed this issue by creating a custom plpgsql function to handle this case. (In my specific case, all of the negative indexes were -1; i.e., the last element in the array, so I created a function to return only the substring occurring after the last instance of the delimiter.)

Performance considerations

You may need to revisit COUNT(*) queries

MySQL MyISAM tables have a very fast COUNT(*) calculation, owing to queries taking a table lock (which means MySQL can cache the COUNT(*) result itself, since there can be no concurrent updates), while PostgreSQL utilizes MVCC in order to calculate table counts which necessitates a full table scan to see which rows are visible to the specific calling snapshot, so this assumption may need to be revisited in order to calculate an equivalent performant query.


MySQL is much more (*ahem*) flexible when it comes to GROUP BY/aggregate queries, allowing some columns to be excluded in a GROUP BY or an aggregate function. Making the equivalent query in PostgreSQL involves transforming the query from SELECT ... GROUP BY (cols) to a SELECT DISTINCT ON (cols) ... and providing an explicit sort order for the rows.

More notes

Don't be afraid to script things; in fact, I would go so far as to suggest that everything you do should be scripted. This process was complicated and there were lots of moving parts to ensure moved in tandem. There were changes being made on the site itself concurrently, so we were doing testing against a dump of the original database at a specific point-in-time. Having everything scripted ensured that this process was repeatable and testable, and that we could get to a specific point in the process without having to remember anything I'd done off-the-cuff.

In addition to scripting the actual SQL/migrations, I found it helpful to script the solutions to various classifications of problems. I wrote some scripts which I used to create some of the various scaffolding/boilerplate for the tables involved. This included a script which would create an UPSERT function for a specific table given the table name, which was used when replacing the INSERT ON DUPLICATE KEY UPDATE functions. This generated script could then be tailored to handle more complex logic beyond a simple UPDATE. (One instance here is an INSERT ON DUPLICATE KEY UPDATE which increased the count of a specific field in the table instead of replacing the value.)

#!/usr/bin/env perl
# -*-cperl-*-

use strict;
use warnings;

use Data::Dumper;

my $table = shift or die "Usage: $0 <table>\n";
my @cols = @ARGV;

my $dbh = DBI->connect(...);

my @raw_cols = @{ $dbh->column_info(undef, 'public', $table, '%')->fetchall_arrayref({}) };
my %raw_cols = map { $_->{COLUMN_NAME} => $_ } @raw_cols;

die "Can't find table $table\n" unless @raw_cols;

my @missing_cols = grep { ! defined $raw_cols{$_} } @cols;

die "Referenced non-existing columns: @missing_cols\n" if @missing_cols;

my %is_pk;

unless (@cols) {
    @cols = map { $_->{COLUMN_NAME} } @raw_cols;

my @pk_cols = $dbh->primary_key(undef, 'public', $table);

@is_pk{@pk_cols} = (1)x@pk_cols;

my @data_cols = grep { ! $is_pk{$_} } @cols;

die "Missing PK cols from column list!\n" unless @pk_cols == grep { $is_pk{$_} } @cols;
die "No data columns!\n" unless @data_cols;

print <<EOF
    upsert_$table (@{[
    join ', ' => map {
        "_$_ $raw_cols{ $_ }->{pg_type}"
    } @cols
LANGUAGE plpgsql
        UPDATE $table SET @{[
    join ', ' => map { "$_ = _$_" } @data_cols
]} WHERE @{[
    join ' AND ' => map { "$_ = _$_" } @pk_cols
        END IF;
            INSERT INTO $table (@{[join ',' => @cols]}) VALUES (@{[join ',' => map { "_$_" } @cols]});
        EXCEPTION WHEN unique_violation THEN
            -- Do nothing, and loop to try the UPDATE again.

This script created an upsert function from a given table to update all columns by default, also allowing you to create one with a different number of columns upserted.

I also wrote scripts which could handle/validate some of the column datatype changes. Since there were large numbers of columns which were changed, often multiple in the same table, I was able to have this script create a single ALTER TABLE statement with multiple ALTER COLUMN TYPE USING clauses, plus be able to specify the actual method that these column changes were to take place. These included several different approaches, depending on the target data type, but generally were to solve cases where there were fairly legitimate data that was not picked up by PostgreSQL's input parsers. These included how to interpret blank fields as integers (in some cases we wanted it to be 0, in others we wanted it to be NULL), weird numeric formatting (leaving off numbers before or after the decimal point), etc.

We had to fix up in several locations missing defaults for AUTO_INCREMENT columns. The tables were created with the proper datatype, however we had to find tables which matched a specific naming convention and create/associate a sequence/serial column, set the proper default here, etc. (This was detailed above.)

There was a fair amount of iteration and customization in this process, as there was a fair amount of data which was not of the expected format. The process was iterative, and generally involved attempting to alter the table from within a transaction and finding the next datum which the conversion to the expected type did not work. This would result in a modification of the USING clause of the ALTER TABLE ALTER COLUMN TYPE to accommodate some of the specific issues.

In several cases, there were only a couple records which had bad/bunko data, so I included explicit UPDATE statements to update those data values via primary key. While this felt a bit "impure", it was a quick and preferred solution to the issue of a few specific records which did not fit general rules.

Integrate Twilio in Django


Twilio is a powerful HTTP API that allows you to build powerful voice and SMS apps. The goal of this blog post is to help make building the SMS applications as simple as possible in django.

There is a already Twilio Python help library available. The open source twilio-python library lets us to write python code to make HTTP requests to the Twilio API.


The easiest way to install twilio-python library is using pip. Pip is a package manager for python.

Simply run following command in terminal.

$ pip install twilio

Twilio API Credentails

To Integrate twilio API in django application, we need TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN variables. These variables can be found by logging into your Twilio account dashboard. These variables are used to communicate with the Twilio API.

You’ll need to add them to your settings.py file:


Create a New App

We are going to interact with people using SMS, so I prefer to create an app named communication. I am assuming, you've already installed Django.

Run following command in terminal.

$ django-admin.py startapp communcation

We will need to register the new app in our django project.
Add it to your INSTALLED_APPS tuple in your settings.py file:



Create the Model

Now we’ll open up communication/models.py to start creating models for our app.

class SendSMS(models.Model):
    to_number = models.CharField(max_length=30)
    from_number = models.CharField(max_length=30)
    sms_sid = models.CharField(max_length=34, default="", blank=True)
    account_sid = models.CharField(max_length=34, default="", blank=True)
    created_at = models.DateTimeField(auto_now_add=True)
    sent_at = models.DateTimeField(null=True, blank=True)
    delivered_at = models.DateTimeField(null=True, blank=True)
    status = models.CharField(max_length=20, default="", blank=True)

and run the syncdb command after defining the model:

$ python manage.py syncdb

It will create the necessary database tables for our app.

Create utils.py file

Create a new file named utils.py and save in communication/utils.py.

Put the following code in communication/utils.py:

from django.conf import settings

import twilio
import twilio.rest

def send_twilio_message(to_number, body):
    client = twilio.rest.TwilioRestClient(
        settings.TWILIO_ACCOUNT_SID, settings.TWILIO_AUTH_TOKEN)

    return client.messages.create(

Testing send_twilio_message

Open the shell and run following commands.

>>> from communication.utils import send_twilio_message
>>> sms = send_twilio_message('+15005550006', 'Hello Endpointer,')
>>> print sms.sid

Having the sid means that everything in the backend is working fine. And we can proceed to work on the front end.

Create Form

Lets create a form to gather the data.  Now open/create up communication/forms.py to start creating forms for our app. And paste the following code into it:

class SendSMSForm(forms.ModelForm):

    class Meta:
        model = SendSMS
        fields = ('to_number', 'body')

The View CreateView

class SendSmsCreateView(CreateView):
    model = SendSMS
    form_class = SendSMSForm
    template_name = 'communication/sendsms_form.html'
    success_url = reverse_lazy('send_sms')

    def form_valid(self, form):
        number = self.cleaned_data['to_number']
        body = self.cleaned_data['body']
        # call twilio
        sent = send_twilio_message(number, body)

        # save form
        send_sms = form.save(commit=False)
        send_sms.from_number = settings.TWILIO_PHONE_NUMBER
        send_sms.sms_sid = sent.sid
        send_sms.account_sid = sent.account_sid
        send_sms.status = sent.status
        send_sms.sent_at = now()
        if sent.price:
            send_sms.price = Decimal(force_text(sent.price))
            send_sms.price_unit = sent.price_unit

    return super(SendSmsCreateView, self).form_valid(form)

Defining URLS

The URL configuration tells Django how to match a request’s path to your Python code. Django looks for the URL configuration, defined as urlpatterns, in the urls.py file in your project:

from django.conf.urls import patterns, url

from .views import SendSmsCreateView

urlpatterns = patterns('',

Creating the Template

Now that we’ve defined a URL for our list view, we can try it out. Django includes a server suitable for development purposes that you can use to easily test your project:

If you visit the in your browser, though, you’ll see an error: TemplateDoesNotExist.

This is because we have not defined the template file yet. So now create sendsms_form.html file in templates/communication/ and put the following code in it:

{% csrf_token %} {% for field in form %}
{{ field }} {{ field.errors }}
{% endfor %}

Now reload the in your browser. Assuming everything is okay, you should then see the following form:

Fill out the form, and hit the submit button to send your SMS.


Congratulations, your SMS is successfully sent. Good luck!

When Postgres will not start

One of the more frightening things you can run across as a DBA (whether using Postgres or a lesser system) is a crash followed by a complete failure of the database to start back up. Here's a quick rundown of the steps one could take when this occurs.

The first step is to look at why it is not starting up by examining the logs. Check your normal Postgres logs, but also check the filename passed to the --log argument for pg_ctl, as Postgres may not have even gotten far enough to start normal logging. Most of the time these errors are not serious, are fairly self-explanatory, and can be cured easily - such as running out of disk space. When in doubt, search the web or ask in the #postgresql IRC channel and you will most likely find a solution.

Sometimes the error is more serious, or the solution is not so obvious. Consider this problem someone had in the #postgresql channel a while back:

LOG: database system was interrupted while in recovery at 2014-11-03 12:43:09 PST
HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 1883/AF9458E8
LOG: unexpected pageaddr 1882/BAA7C000 in log file 6275, segment 189, offset 10993664
LOG: redo done at 1883/BDA7A9A8
LOG: last completed transaction was at log time 2014-10-25 17:42:53.836929-07
FATAL: right sibling's left-link doesn't match: block 6443 links to 998399 instead of expected 6332 in index "39302035"

As you can see, Postgres has already hinted you may be in deep trouble with its suggestion to use a backup. The Postgres daemon completely fails to start because an index is corrupted. Postgres has recognized that the B-tree index no longer looks like a B-tree should and bails out.

For many errors, the next step is to attempt to start Postgres in single-user mode. This is similar to "Safe mode" in Windows - it starts Postgres in a simplified, bare-bones fashion, and is intended primarily for debugging issues such as a failed startup. This mode is entered by running the 'postgres' executable directly (as opposed to having pg_ctl do it), and passing specific arguments. Here is an example:

$ /usr/bin/postgres --single -D /var/lib/pgsql93/data -P -d 1

This starts up the 'postgres' program (used to be 'postmaster'), enters single-user mode, specifies where the data directory is located, turns off system indexes, and sets the debug output to 1. After it is run, you will have a simple prompt. From here you can fix your problem, such as reindexing bad indexes, that may have caused a normal startup to fail. Use CTRL-d to exit this mode:

$ /usr/bin/postgres --single -D /var/lib/pgsql93/data -P -d 1
NOTICE:  database system was shut down at 2014-11-20 16:51:26 UTC
DEBUG:  checkpoint record is at 0/182B5F8
DEBUG:  redo record is at 0/182B5F8; shutdown TRUE
DEBUG:  next transaction ID: 0/1889; next OID: 12950
DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
DEBUG:  oldest unfrozen transaction ID: 1879, in database 1
DEBUG:  oldest MultiXactId: 1, in database 1
DEBUG:  transaction ID wrap limit is 2147485526, limited by database with OID 1
DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1

PostgreSQL stand-alone backend 9.3.5
backend> [CTRL-d]
NOTICE:  shutting down
NOTICE:  database system is shut down

If you are not able to fix things with single-user mode, it's time to get serious. This would be an excellent time to make a complete file-level backup. Copy the entire data directory to a different server or at least a different partition. Make sure you get everything in the pg_xlog directory as well, as it may be symlinked elsewhere.

Time to use pg_resetxlog, right? No, not at all. Use of the pg_resetxlog utility should be done as an absolute last resort, and there are still some things you should try first. Your problem may have already been solved - so the next step should be to upgrade Postgres to the latest revision. With Postgres, a revision (the last number in the version string) is always reserved for bug fixes only. Further, changing the revision is almost always as simple as installing a new binary. So if you are running Postgres version 9.0.3, upgrade to the latest in the 9.0 series (9.0.18 as of this writing). Check the release notes, make the upgrade, and try to start up Postgres.

Still stumped? Consider asking for help. For fast, free help, try the #postgresql IRC channel. For slightly slower free help, try the pgsql-general mailing list. For both of these options, the majority of the subscribers are clustered near the US Eastern time zone, so response times will be faster at 3PM New York time versus 3AM New York time. For paid help, you can find a Postgres expert (such as End Point!) at the list of professional services at postgresql.org,

The next steps depend on the error, but another route is to hack the source code for Postgres to work around the error preventing the startup. This could mean, for example, changing a FATAL exception to an WARNING, or other trickery. This is expert-level stuff, to be sure, but done carefully can still be safer than pg_resetxlog. If possible, try this on a copy of the data!

If you have done everything else, it is time to attempt using pg_resetxlog. Please make sure you read the manual page about it before use. Remember this is a non-reversible, possibly data-destroying command! However, sometimes it is the only thing that will work.

If you did manage to fix the problem - at least enough to get Postgres to start - the very next item is to make a complete logical backup of your database. This means doing a full pg_dump right away. This is especially important if you used pg_resetxlog. Dump everything, then restore it into a fresh Postgres cluster (upgrading to the latest revision first if needed!). The pg_dump will not only allow you to create a clean working version of your database, but is a great way to check on the integrity of your data, as it by necessity examines every row of data you have. It will *not* check on the sanity of your indexes, but there are other ways to do that, the simplest being to do a REINDEX DATABASE on each database in your cluster.

All of these steps, including pg_resetxlog, may or may not help. In the "left-link doesn't match" example at the top, nothing was able to fix the problem (not single-user mode, nor a more recent revision, nor pg_resetxlog). It's possible that the data could have been recovered by hacking the source code or using tools to extract the data directly, but that was not necessary as this was a short-lived AWS experiment. The consensus was it was probably a hardware problem. Which goes to show that you can never totally trust your hardware or software, so always keep tested, frequent, and multiple backups nearby!

MongoDB and OpenStack - OSI Days 2014, India

The 11th edition of Open Source India, 2014 was held at Bengaluru, India. The two day conference was filled with three parallel tech talks and workshops which was spread across various Open Source technologies.


In-depth look at Architecting and Building solutions using MongoDB

Aveekshith Bushan & Ranga Sarvabhouman from MongoDB started off the session with a comparison of the hardware cost involved with storage systems in earlier and recent days. In earlier days, the cost of storage hardware was very expensive, so the approach was to filter the data to reduce the size before storing into the database. So we were able to generate results from filtered data and we didn’t have option to process the source data. After the storage became cheap, we can now store the raw data and then we do all our filter/processing and then distribute it.

        Filter -> Store -> Distribute
        Store -> Filter -> Distribute

Here we are storing huge amount of data, so we need a processing system to handle and analyse the data in efficient manner. In current world, the data is growing like anything and 3Vs are phenomenal of growing (Big)Data. We need to handle the huge Volume of Variety of data in a Velocity. MongoDB follows certain things to satisfy the current requirement.

MongoDB simply stores the data as a document without any data type constraints which helps to store huge amount of data quickly. It leaves the constraints checks to the application level to increase the storage speed in database end. But it does recognises the data type after the data is stored as document. In simple words, the philosophy is: Why do we need to check the same things (datatype or other constraints) in two places (application and database)?

MongoDB stores all relations as single document and fetches the data in single disk seek. By avoiding multiple disk seeks, this results in the fastest retrieval of data. Whereas in relational database the relations stored in different tables which leads to multiple disk seek to retrieve the complete data of an entity. And MongoDB doesn’t support joins but it have Reference option to refer another collection(Table) without imposing foreign key constraints.

As per db-engines rankings, MongoDB stays in the top of NoSQL database world. Also it provides certain key features which I have remembered from the session:
    • Sub-documents duplicates the data but it helps to gain the performance(since the storage is cheap, the duplication doesn’t affect much)
    • Auto-sharding (Scalability)
    • Sharding helps parallel access to the system
    • Range Based Sharding 
    • Replica Sets (High availability)
    • Secondary indexes available
    • Indexes are single tunable part of the MongoDB system 
    • Partition across systems 
    • Rolling upgrades
    • Schema free
    • Rich document based queries
    • Read from secondary
When do you need MongoDB?
    • The data grows beyond the system capacity in relational database
    • In a need of performance in online requests
Finally, speakers emphasized to understand use case clearly and choose right features of MongoDB to get effective performance.

OpenStack Mini Conf

A special half day OpenStack mini conference was organised at second half of first day. The talks were spread across basics to in depth of OpenStack project. I have summarised all the talks here to give an idea of OpenStack software platform.

OpenStack is a Open Source cloud computing platform to provision the Infrastructure as a Service(IaaS). There is a wonderful project DevStack out there to set up the OpenStack on development environment in easiest and fastest way. A well written documentation of the OpenStack project clearly explains everything. In addition, anyone can contribute to OpenStack with help of How to contribute guide, also project uses Gerrit review system and Launchpad bug tracking system.

OpenStack have multiple components to provide various features in Infrastructure as a Service. Here is the list of OpenStack components and the purpose of each one.

Nova (Compute) - manages the pool of computer resources
Cinder (Block Storage) - provides the storage volume to machines
Neutron (Network) - manages the networks and IP addresses
Swift (Object Storage) - provides distributed high availability(replication) on storage system.
Glance (Image) - provides a repository to store disk and server images
KeyStone (Identity) - enables the common authentication system across all components
Horizon (Dashboard) - provides GUI for users to interact with OpenStack components
Ceilometer (Telemetry) - provides the services usage and billing reports
Ironic (Bare Metal) - provisions bare metal instead of virtual machines
Sahara (Map Reduce) - provisions hadoop cluster for big data processing

OpenStack services are usually mapped to AWS services to better understand the purpose of the components. The following table depicts the mapping of similar services in OpenStack and AWS:

AWS Console
Elastic Mapreduce

Along with the overview of OpenStack architecture, there were couple of in-depth talks which are listed below with slides.
That was a wonderful Day One of OSI 2014 which helped me to get better understanding of MongoDB and OpenStack.