Welcome to End Point’s blog

Ongoing observations by End Point people

DBD::Pg, UTF-8, and Postgres client_encoding

Photo by Roger Smith

I've been working on getting DBD::Pg to play nicely with UTF-8, as the current system is suboptimal at best. DBD::Pg is the Perl interface to Postgres, and is the glue code that takes the data from the database (via libpq) and gives it to your Perl program. However, not all data is created equal, and that's where the complications begin.

Currently, everything coming back from the database is, by default, treated as byte soup, meaning no conversion is done, and no strings are marked as utf8 (Perl strings are actually objects in which one of the attributes you can set is 'utf8'). If you want strings marked as utf8, you must currently set the pg_enable_utf8 attribute on the database handle like so:

$dbh->{pg_enable_utf8} = 1;

This causes DBD::Pg to scan incoming strings for high bits and mark the string as utf8 if it finds them. There are a few drawbacks to this system:

  • It does this for all databases, even SQL_ASCII!
  • It doesn't do this for everything, e.g. arrays, custom data types, xml.
  • It requires the user to remember to set pg_enable_utf8.
  • It adds overhead as we have to parse every single byte coming back from the database.

Here's one proposal for a new system. Feedback welcome, as this is a tricky thing to get right.

DBD::Pg will examine the client_encoding parameter, and see if it matches UTF8. If it does, then we can assume everything coming back to us from Postgres is UTF-8. Therefore, we'll simply flip the utf8 bit on for all strings. The one exception is bytea data, of course, which we'll read in and dequote into a non-utf8 string. Any non-UTF8 client_encodings (e.g. the monstrosity that is SQL_ASCII) will simply get back a byte soup, with no utf8 markings on our part.

The pg_enable_utf8 attribute will remain, so that applications that do their own decoding, or otherwise do not want the utf8 flag set, can forcibly disable it by setting pg_enable_utf8 to 0. Similarly, it can be forced on by setting pg_enable_utf8 to 1. The flag will always trump the client_encoding parameter.

A further complication is client_encoding: What if it defaults to something else? We can set it ourselves upon first connecting, and then if the program changes it after that point, it's on them to deal with the issues. (As DBD::Pg will still assume it is UTF-8, as we don't constantly recheck the parameter.)

Someone also raised the issue of marking ASCII-only strings as utf8. While technically this is not correct, it would be nice to avoid having to parse every single byte that comes out of the database to look for high bits. Hopefully, programs requesting data from a UTF-8 database will not be surprised when things come back marked as utf8.

Feel free to comment here or on the bug that started it all. Thanks also to David Christensen, who has given me great input on this topic.


Theory said...

Is there any reason it coudn't support other client_encodings? If I set it to latin-1, then DBD::Pg could just decode it to utf8, right?

I know, if it's going to be decoded to Perl's internal utf8 format, one might as well set client_encoding to UTF-8. So maybe it's not worth it to support other encodings?

David Christensen said...


As I see it, there's really no reason to do anything more than set the client_encoding to 'utf-8' in the PQconnect() call; since Postgres will support converting any server_encoding to UTF-8, this is an easy way to avoid needing to maintain some sort of mapping between Postgres' concept of the encoding names and's naming of them. Anything other than SQL_ASCII (aka byte_soup or pg_enable_utf8) can be sensibly converted with minimal changes to DBD::Pg.

My personal concerns are be that applications would be unprepared to deal with data that has to this point been returned in the raw. I think that naïve applications will work fine, but those that implement application workarounds to support conversion to perl's internal format would possibly be affected the most by the change in behavior. I also think this is too useful of a change to *not* be included and enabled by default, so perhaps a major version bump of DBD::Pg would help indicate that something fairly substantial is changing in the interface.



Jon Jensen said...

Greg, you said someone "raised the issue of marking ASCII-only strings as utf8".

What is the issue there? If the client_encoding is set to UTF-8, then what comes out of the database should be marked as UTF-8 in Perl, even if it happens to be using only the ASCII subset of UTF-8.

Or am I missing something there?

Greg Sabino Mullane said...

An issue with setting client_encoding ourselves is when do we do it? And what if the client does not want us to (a reasonable request)? If we set it on startup, and then the client requests pg_enable_utf8 = 0, we should really revert to the default client_encoding (which we'd have to look up and then apply). We could connect, check what it is, and then change it UTF-8 if needed, after storing the old value, but that would be a separate transaction/command on every startup. Perhaps neither is a big deal. But I like the idea of "forcing" UTF-8 and then marking everything as utf8. If someone really wants a separate non-ASCII and non-UTF-8 encoding returned, they can set pg_enable_utf8, we revert the client_encoding, and return the raw bytes to them so they can do what they want with it. Of course, we'll also have to check the server_encoding and simply do nothing if it's SQL_ASCII: no client_encoding setting, and no utf8 string setting. David C, interesting point about the version number bump, please remind me of that when we get ready to release whatever we finally come up with. :)

Greg Sabino Mullane said...

Jon, I agree, it's almost a silly concern, but it was raised on the bug. Personally, I think scanning for high bits is pointless: it's still UTF-8 (and utf8), even if it's only using a small subset of the available characters (e.g. ASCII)

David Christensen said...
This comment has been removed by the author.
David Christensen said...


"If we set it on startup, and then the client requests pg_enable_utf8 = 0, we should really revert to the default client_encoding (which we'd have to look up and then apply)."

So is the issue here that pg_enable_utf8 is currently a handle-level attribute, which can be set at any time, not just as a connection parameter? In my personal use, I've always set it up with the ->connect() options hashref, not changed it dynamically.

Perhaps the answer here is to allow the specification of the client_encoding as the connection option/handle attribute, which when set would then issue the 'SET client_encoding' (when used as a connected $dbh attribute) or the appropriate "options=..." call to PQconnect() if specified in the ->connect call. However, I'm not sure of the general reason why you would want to support the specific client_encoding output. It seems to me that the only reason to care about the database encoding would be that *you* (think) you know what the encoding is but the database doesn't (aka SQL_ASCII). If you care otherwise, just ask for raw output and handle it yourself at the app-level using

I'm fairly confident that we could get things working relatively seamlessly with the data coming back *from* the database, however, have we given consideration into what happens on input? If client_encoding is UTF-8, we'd just need to check the utf8 flag on the way in and/or set :utf8 on the filehandle's binmode (assuming of course that the app data is going to be sending us perl internal character data). However it seems that there would be a window for raw hi-bit data (particularly in apps which haven't given particular consideration to the character encoding) to sneak in on input, and users would presumably get unexpected "invalid encoding for UTF8" errors when they were not using data explicitly. (Maybe the answer here is to call utf8::upgrade on any inputted string, or to append to an empty string with the is_utf8 flag set.) The first option would taint the data from the caller's perspective, so seems like it's out as an option; the second would incur (I believe) a copy/possible reencode for the concatenation, which obviously hurts performance. Maybe we'd have to resort to saving the utf8 flag and restoring it when processing a string; I dunno, not a lot of great options here wrt backwards-compatiblity that don't tank performance. I'll mull about it for a bit more.


David Christensen said...


"What is the issue there? If the client_encoding is set to UTF-8, then what comes out of the database should be marked as UTF-8 in Perl, even if it happens to be using only the ASCII subset of UTF-8.

Or am I missing something there?"

This is one of the more confusing parts of the Perl Unicode handling, IMHO; the UTF-8 flag should have been named something else entirely, as it really just indicates that Perl will see the data as *internally* encoded as UTF-8, specifically to do with the handling of hi-bit characters in the data. The flag does not actually indicate that the data itself is valid UTF-8, as it can be set independently of the data (not recommended unless you know what you're doing, as you can flag an SV with arbitrary data as utf8, which does not automatically convert the octets to a valid UTF-8 representation). The concatenation of strings takes into consideration the UTF-8 flag in how it processes the requested action; if both strings have the same state for the UTF-8 flag, it's essentially a copy only for the underlying data, but if one of the strings has the flag set and the other does not, the concatenation has to first upgrade the non-utf-8-marked string to actual UTF-8, then do the concatenation, with the result tagged with the UTF-8 flag.

I suspect for most use cases, it would be fine to have the flag set on pure ASCII (encoding-wise, there's no issue); however there are some modules that can change their behavior depending on the state of the flag though, so there could be different code paths being taken that wouldn't strictly be needed if dealing with ASCII-only (or legacy 8-bit) or modules that may refuse to process data with the utf-8 flag set (ISTR some issues in the past with Digest::SHA1 as an example; since the algorithm is defined only on bytes and not characters, it's an error to pass wide-character data in). It may be that these modules have been updated to not care about the state of the flag or to ignore it and only throw an error when encountering a character with code point > 0xFF, so it may not be an actual issue, but at least the potential exists to cause some unexpected behavior.


Jon Jensen said...

David, to me the argument that some Perl modules (still?) can't handle UTF-8 data is all the more reason why the flag should be consistently set.

If you don't set the UTF-8 flag when only ASCII-subset data is present, you're likely to have code that works most of the time, but when the occasion arises that the database returns some more-than-ASCII string, the code will fail.

Better to fail early and recognize that the module you're depending on isn't suitable, or else switch and request always ASCII encoding, isn't it?

Darren Duncan said...

I'm with those that believe it is best to always set the UTF-8 flag to true when UTF-8 data is requested, regardless of whether only the subset ASCII repertoire is used.

Darren Duncan said...

As an addendum, and I could suggest this to p5p, if the issue with marking ASCII as UTF-8 is about performance (optimized single-byte code paths would be skipped), one possibility that might work is adding another flag for Perl strings that says it is known that the ASCII subset is in use. This extra flag would be false by default, but if some operation in Perl decides to go through the work to check that no high bits are set, it can set the flag to true ... or it could be 3-valued, for known-high-set, known-high-not-set, not-known. Then code deciding to use an optimized path later can just look at the flag to help it decide what to do. Presumably such a change may not be binary compatible so would only come in a major release, or it might not be that useful. But in principle for a type system, I think it would be useful for implementations to be able to mark a value as being known to be of a particular subset of its otherwise declared type, which could help optimization greatly.

Greg Sabino Mullane said...

David: Looks like the canonical way is to transform the data before hitting Digest::SHA1; see the recipe here:

Greg Sabino Mullane said...

Darren: interesting idea, but I doubt it would go over well for Perl 5. Wonder if Perl 6 handles utf the same way as P5?