End Point

News

Welcome to End Point's blog

Ongoing observations by End Point people.

Sanitizing supposed UTF-8 data

As time passes, it's clear that Unicode has won the character set encoding wars, and UTF-8 is by far the most popular encoding, and the expected default. In a few more years we'll probably find discussion of different character set encodings to be arcane, relegated to "data historians" and people working with legacy systems.

But we're not there yet! There's still lots of migration to do before we can forget about everything that's not UTF-8.

Last week I again found myself converting data. This time I was taking data from a PostgreSQL database with no specified encoding (so-called "SQL_ASCII", really just raw bytes), and sending it via JSON to a remote web service. JSON uses UTF-8 by default, and that's what I needed here. Most of the source data was in either UTF-8, ISO Latin-1, or Windows-1252, but some was in non-Unicode Chinese or Japanese encodings, and some was just plain mangled.

At this point I need to remind you about one of the most unusual aspects of UTF-8: It has limited valid forms. Legacy encodings typically used all or most of the 255 code points in their 8-byte space (leaving point 0 for traditional ASCII NUL). While UTF-8 is compatible with 7-bit ASCII, it does not allow any possible 8-bit byte in any position. See the Wikipedia summary of invalid byte sequences to know what can be considered invalid.

We had no need to try to fix the truly broken data, but we wanted to convert everything possible to UTF-8 and at the very least guarantee no invalid UTF-8 strings appeared in what we sent.

I previously wrote about converting a PostgreSQL database dump to UTF-8, and used the Perl CPAN module IsUTF8.

I was going to use that again, but looked around and found an even better module, exactly targeting this use case: Encoding::FixLatin, by Grant McLean. Its documentation says it "takes mixed encoding input and produces UTF-8 output" and that's exactly what it does, focusing on input with mixed UTF-8, Latin-1, and Windows-1252.

It worked as advertised, very well. We would need to use a different module to convert some other legacy encodings, but in this case this was good enough and got the vast majority of the data right.

There's even a standalone fix_latin program designed specifically for processing Postgres pg_dump output from legacy encodings, with some nice examples of how to use it.

One gotcha is similar to a catch that David Christensen reported with the Encode module in a blog post here about a year ago: If the Perl string already has the UTF-8 flag set, Encoding::FixLatin immediately returns it, rather than trying to process it. So it's important that the incoming data be a pure byte stream, or that you otherwise turn off the UTF-8 flag, if you expect it to change anything.

Along the way I found some other CPAN modules that look useful for cases where I need more manual control than Encoding::FixLatin gives:

  • Search::Tools::UTF8 - test for and/or fix bad ASCII, Latin-1, Windows-1252, and UTF-8 strings
  • Encode::Detect - use Mozilla's universal charset detector and convert to UTF-8
  • Unicode::Tussle - ridiculously comprehensive set of Unicode tools that has to be seen to be believed

Once again Perl's thriving open source/free software community made my day!

3 comments:

Shane said...

Awesome tip. Nate Brunson and I made heavy use of Encoding::fix_latin for the same use case. We were exporting SQL_ASCII 'encoded' data from postgres into Oracle. It was a lifesaver. Beware of performance gotchas, I seem to remember perl trying to do some sort of ioctl operation on every line that was really slowing down our conversions.

chansen said...

You can perform the above much more efficiently without any performance hit unless the provided data is ill-formed (fallbacks to CP1252 (Windows-1252)). This solution will process the data regardless of the state of the SvUTF8 flag.

https://gist.github.com/1522213

Full disclosure, I'm the author of Unicode::UTF8.

Jon Jensen said...

Shane: Interesting. I'm not currently using it in an I/O-heavy situation but will watch out for that.

Christian, thanks for your comment and example code. If you were to add a fix_latin function to your module, I'd be happy to use that, but as it stands we don't have any performance problems and a single function call is a lot simpler than what's in your gist. :)