Welcome to End Point’s blog

Ongoing observations by End Point people

PostgreSQL UTF-8 Conversion

It's becoming increasingly common for me to be involved in conversion of an old version of PostgreSQL to a new one, and at the same time, from an old "SQL_ASCII" encoding (that is, undeclared, unvalidated byte soup) to UTF-8.

Common ways to do this are to run pg_dumpall and then pipe the output through iconv or recode. When your source encoding is all pure ASCII, you don't need to do even that. When it's really all Windows-1252 (a superset of Latin-1 aka ISO-8859-1) it's easy.

But often, the data is stored in various unknown encodings from several sources over the course of years, including some that's already in UTF-8. When you convert with iconv, it dies with an error at the first problem, whereas recode will let you ignore encoding problems, but that leaves you with junk in your output.

The case I'm often encountering is fairly easy, but not perfect: Lots of ASCII, some Windows-1252, and some UTF-8. Since both pure ASCII and UTF-8 can be mechanistically detected, I put together this script to do the detection. It's Perl and uses the nice IsUTF8 module to do its character encoding detection:

Pipe input to the script. It handles one line at a time. When run with any arguments (such as --test) it will swallow pure ASCII lines, write lines it thinks are valid UTF-8 to stderr, and will convert the remaining presumed Windows-1252 lines to stdout, for manual examination.

If its guesses look correct, run it again with no arguments, and it will write all 3 types of encoding to stdout, ready for input to psql in your new UTF-8 encoded database.

(Don't forget to munge your pg_dump file to remove any hardcoded declarations of "SQL_ASCII" encoding from CREATE DATABASE statements, or otherwise make sure your database actually is created with UTF-8 encoding!)


David Christensen said...

Hi Jon,

I've written this script before myself, and it can be simplified into a one-liner:

$ perl -MEncode -ne 'print encode_utf8(encode_utf8(eval { decode_utf8($_) } || decode(q{cp-1252}, $_)))'

This of course assumes the only errors thrown by decode_utf8() are malformed character issues, but you end up doing the same thing, piping the data through the script, etc. And of course the same caveats exist as far as making sure the database encoding is setup correctly.

Since cp-1252 covers all 256 octets structurally, the decode will always be successful based on range, so practically speaking, that's as good as we can get without any other knowledge of any other underlying encodings. And for a US english environment, particularly in a world that interfaces heavily with certain OSes, this will cover all your common cases.

It would also be nice to have a variant that repeatedly decodes the UTF8 until it won't decode any more, to handle repeatedly-UTF8-encoded data; I've seen this situation arise when the backing app has been known to not handle its encodings properly.



Jon Jensen said...

David, your one-liner is definitely more elegant than my little script.

The one thing I miss in it is the ability to preview the lines of interest. I haven't yet run into a case where it saved me from passing on junk, but being able to manually review only the lines that the script converted was a nice reassurance that I was making things better, not worse.

By blindly converting the whole file I lose that little sanity check. Otherwise, the one-liner is way better.

I agree that double-decoding UTF-8 will come in handy at times too. Especially if it can be autodetected somehow. :)

David Christensen said...

...and of course, speaking of doubly-encoded utf8, that should be:

$ perl -MEncode -ne 'print encode_utf8(eval { decode_utf8($_) } || decode(q{cp-1252}, $_))'

I can understand liking the "preview"/"changes only" mode; perhaps using the Encode::is_utf8 function would serve this purpose? (I'd be surprised if the IsUTF8 module did little more than expose the SV UTF8 flag.)

Jon Jensen said...

No, that's the big difference. IsUTF8 actually checks the bytes and does what I want; I have not found &Encode::is_utf8 useful in the past.

Jon Jensen said...

Just to clarify, take a look at the very simple source:

David Christensen said...

There's a second argument to Encode::if_utf8 which does a deep-structure check on the underlying bytes. However, I'm not sure when this was introduced; I only actually saw this today.

From Encode(3pm):

is_utf8(STRING [, CHECK])

[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise.

David Christensen said...

And for one final followup on this post, I forgot the default action for decode_utf8 was to put a substitution character in-place; in order to actually handle this as expected, you'd need to provide the check parameter 1 (== FB_CROAK) for this one-liner to work as expected, leading to a final command of:

$ perl -MEncode -ne 'print encode_utf8(eval { decode_utf8($_, 1) } || decode(q{cp-1252}, $_))' infile.mixed > outfile.utf8