Welcome to End Point’s blog

Ongoing observations by End Point people

Tail_n_mail and the log_line_prefix curse

One of the problems I had when writing tail_n_mail (a program that parses log files and mails interesting lines to you) was getting the program to understand the format of the Postgres log files. There are quite a few options inside of postgresql.conf that control where the logging goes, and what it looks like. The basic three options are to send it to a rotating logfile with a custom prefix at the start of each line, to use syslog, or to write it in CSV format. I'll save a discussion of all the logging parameters for another time, but the important one for this story is log_line_prefix. This is what gets prepended to each log line when using 'stderr' mode (e.g. regular log files and not syslog or csvlog). By default, log_line_prefix is an empty string. This is a very useless default.

What you can put in the log_line_prefix parameter is a string of sprintf style escapes, which Postgres will expand for you as it writes the log. There are a large number of escapes, but only a few are commonly used or useful. Here's a log_line_prefix I commonly use:

log_line_prefix = '%t [%p] %u@%d '

This tells Postgres to print out the timestamp, the PID aka process id (inside of square brackets), the current username and database name, and finally a single space to help separate the prefix visually from the rest of the line. The above will generate lines that look like this:

2010-08-06 09:24:57.714 EDT [7229] joy@joymail LOG: execute dbdpg_p7228_5: SELECT count(id) FROM joymail WHERE folder = $1
2010-08-06 09:24:57.714 EDT [7229] joy@joymail DETAIL:  parameters: $1 = '4'

As you might imagine, the customizability of log_line_prefix makes parsing the log files all but impossible without some prior knowledge. I didn't want to go the pgfouine route and make people change their log_line_prefix to a specific setting. I think it's kind of rude to force your database to change its logging to accommodate your tools :). The original quick solution I came up with was to have a set of predefined regular expressions and the user would pick one that most closely matched their logs. For tail_n_mail to work properly, it needs to pick up at least the PID so it can tell when one statement ends a new one begins. For example, if you chose "regex #1", the log parsing regex would look like this:

(\d\d\d\d\-\d\d\-\d\d \d\d:\d\d:\d\d).+?(\d+)

This works fine on the example above, and gets us the timestamp and the PID from each line. The stock regexes worked for many different log_line_prefixes I came across that our clients were using, but I was never very happy with this solution. Not only was it susceptible to failing completely when a client was using a log_line_prefix not fitting into the current list of regexes, but there was no way to know exactly where the prefix ended and the statement began, which is important for the formatting of the output and the canonicaliztion of similar queries.

Enter the current solution: building a regex on the fly. Since we don't have a connection to the database at all, merely to the the log files, this requires that the user enter in their current log_line_prefix. This is a simple entry into the tailnmailrc file that looks just like the entry in postgresql.conf, e.g.:

log_line_prefix = '%t [%p] %u@%d '

The tail_n_mail script uses that variable to build a custom regex specifically tailored to that log_line_prefix and thus to the Postgres logs being used. Not only can we grab whatever bits we want (currently we only care about the timestamp (%t and %m) and the PID (%p)), but we can now cleanly break apart each line in the log into the prefix and the actual statement. This means the canonicalization/flattening of the queries is more effective, and allows us to only output the prefix information once. The output of tail_n_mail looks something like this:

Date: Fri Aug  6 11:01:03 2010 UTC                                                        
Unique items: 7
Total matches: 85
Matches from [A] /var/log/pg_log/postgresql-2010-08-05.log: 61
Matches from [B] /var/log/pg_log/postgresql-2010-08-06.log: 24

[1] From files A to B (between lines 14,205 of A and 527 of B, occurs 64 times)
First: [A] 2010-08-05 16:52:11 UTC [1602]  postgres@mydb
Last:  [B] 2010-08-06 01:18:14 UTC [20981] postgres@mydb
ERROR: syntax error at or near ")" 
STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (?,?,?))
ERROR: syntax error at or near ")"
STATEMENT: INSERT INTO mytable (id, foo, bar) VALUES (123,'chocolate','donut'));

[2] From file A (line 12,172)                                                                                                
2010-08-05 12:27:48 UTC [2906] bob@otherdb
ERROR: invalid input syntax for type date: "May" 
STATEMENT: UPDATE personnel SET birthdate='May' WHERE id = 1234;

(plus five other entries)

For the entry in the above example, we are able to show the complete prefix of the log lines where the error first occurred and where it most recently occurred. The next two lines show the "flattened" version of the query that tail_n_mail uses to group together similar errors. We then show a non-flattened example of an actual query from that group. In this case, someone added an extra closing paren in their application somewhere, which gives the same error each time, although the exact output changes depending on the values used. In the second example, because there is only one match, we don't bother to show the flattened version at all.

So in theory tail_n_mail should be now be able to handle any Postgres log you care to throw at it (yes, it can read syslog and csvlog format as well). As my coworker pointed out, parsing log files in this way is something that should probably be abstracted into a common module so other tools like pgsi can take advantage of it as well.


Bruce Momjian said...

Great idea to create a regex based on the log_line_prefix. I was wondering if you should just query the server and get that setting yourself, rather than require the user to supply it.

Also, why did you use the pid and not the session id. The session id will be unique even after pid wraparound.

Johan said...


Seconded that it's a great idea. Simplifies things a lot. However, I'm still unable to get tnm to trigger on anything:

Our log_line_prefix looks like this: log_line_prefix = '%t %h %u %d '

Producing log lines such as this (slightly anonymized):
2010-08-06 21:50:05 CEST user DatabaseName ERROR: record "new" has no field "stuff"

We don't use rotating logs, but instead one for each day (log file naming is set up like this: log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log') but that doesn't seem to be the problem since even if I specify a file (with known ERRORs) nothing triggers.

The tnm config file is the default file output with "tail tail_n_mail > tnm.config.txt", with only the FILE part modified (as such: FILE: /var/lib/postgresql/data/pg_log/postgresql-2010-08-07_000000.log).

Using --debug doesn't yield anything useful, beyond "nothing was found". I am using the --dryrun option though, so maybe that has something to do with it. Any ideas?

Greg Sabino Mullane said...

Bruce: This is a log parser program and not designed to query the database (indeed, some of our clients ship logs to another box so contacting the database is impossible). It would also add a lot of overhead and complexity just so we could grab the log_line_prefix. If it were something that changed often, I'd consider adding such a feature, but most log_line_prefixes are "set it and forget it".

I prefer the PID because it is far more useful for day to day DBA tasks; it allows for lookups via pg_stat_activity and the ps command. I'm not worried about PID wraparound, TNM only saves the PIDs internally long enough to parse the query, so it should not be a problem.

Greg Sabino Mullane said...

Johan: try running with a --debug flag and see if the regex it is looking for matches your lines. You might also want to edit the source and dump out the regex generated by log_line_prefix. Try adding "die $pgpidre" after line 215 or so.

Johan said...

Hi again. Thank you for taking the time to help me.

The inclusion regex looks like this: Inclusion: (?-xism:ERROR: )|(?-xism:FATAL: )|(?-xism:PANIC: )

I've tried plugging one of the regexes into an egrep, like this:
less logfile.log|egrep (?-xism:ERROR: ) but that didn't work at all. I'm however not very good at grep-ing, so it might as well just be my fault. (greping for just "ERROR: " works, though)

The $pgpidre after line 215 looks like this: (?-xism:((\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d \w\w\w) \S+ \w* \w* )(.+))
Is that what you expected, or should I dump somewhere else?

Also, let me know if you want to do this over email or elsewhere.

Greg Sabino Mullane said...

Johan: I see the problem now, the regex does not account for four-character timezone strings. You can fix by adding \\w? to the end of the string of \\w's at line 206. I'll also roll out a new version shortly with the fix.

Yes, there is also a mailing list: see

Johan said...

I tried adding \\w? (backslash, backslash, w, question mark - in case the reply gets borked) which didn't quite work. The script started puking error messages like this one: Use of uninitialized value in substitution iterator at line 902, <$fh> line 365.

But with increasing (consecutive) line numbers.

Figuring that the regex might have been disfigured I also tried with a plus, an asterisk and nothing at all (just \\w) but all with the same result.

Greg Sabino Mullane said...

Johan: What version is that? Please try the latest, 1.16.3 and let me know what line, if any, it warns on.

Johan said...

Okay, now I've tried with 1.16.3 (had to download via Github since the regular download link says permission denied), and the error message(s) start with:
"Use of uninitialized value in substitution iterator at line 917, <$fh> line 2."
And continue rolling up for each line (3 and up).

BTW, I've found my way to the IRC channel, so if you want to debug in real time I'm there (on Swedish time, though). Nick is JohanW.

Abbas said...

As it can be used for other logs, how can I use it for those logs which dosen't have time stamps in it, and its no where concern with log_line_prefix. I have already tried with --pgmode=0 and the latest version 1.26.1