Getting more clues from Python's logging module

During development of Python web applications, there are a lot of tools that can help provide clues about what caused a bug or exception. For example, Django’s default debug error page prints out local variables at every stack frame when there’s an unhandled exception. The popular django_debugtoolbar adds more information. In a similar vein there […]

During development of Python web applications, there are a lot of tools that
can help provide clues about what caused a bug or exception. For
example, Django’s default debug error page prints out local variables at
every stack frame when there’s an unhandled exception. The popular
django_debugtoolbar adds more information. In a similar vein there are things
like Pyramid’s pyramid_debugtoolbar and its predecessor WebError.

But when troubleshooting production issues, those tools aren’t available, for
good reason: We need the useful information to be exposed to developers, not
overwhelming our end users nor exposing sensitive information to malicious eyes.

So, we turn to logging instead. But that provides a lot less information out
of the box. So at bitly, we often found ourselves in an irritating cycle that
went something like this:

  • Sometime soon after a deploy, we notice an unusual number of exceptions being logged. (Let’s say that old Python chestnut, 'NoneType' object is unsubscriptable.) So we know that we have some code that is expecting a dict or list and getting None instead.

  • We may decide the error is not critical enough to roll back without first trying to diagnose the underlying cause, but of course we want to fix it quickly. The pressure is on.

  • We find the traceback in our logs. It looks like:

Python traceback example

  • But where’s the None? Is it response['data'] or is it one of the entry values? We have no way to see. (Tip: We can actually tell that it’s not response['data']['subtotal']. Do you know why? Would you remember that while under pressure to fix a production bug?)

  • We try, and fail, to provoke the error on our local development or staging environments.

  • Then we add some logging code.

  • We get a quick code review and deploy the logging code.

  • We wait for the error to happen again.

  • We find the log message and realize we forgot to log everything we wanted to see, or we need more contextual information. Go back 3 steps and repeat as needed.

  • Finally we can see what the bad value was and we finally have enough clues that we can understand the cause of the error and fix it.

This is a frustrating amount of overhead when you’re trying to diagnose a
production issue as quickly as possible, even with multiple people helping out.

Fortunately, if you make liberal use of logging.exception() – for example,
if your web framework calls it on any unhandled exception, and if you take care
to call it on unknown exceptions that you deliberately catch – then it’s
actually easy to change our logging configuration such that we get a lot more
useful information logged. And we can do this globally without sprinkling
logging changes throughout our code.

Here’s one possible way to hack your logging config to do this:

Python log handler example

Then, in whatever code sets up your logging configuration, you can add this to
the relevant handler like so:

handler.setFormatter(VerboseExceptionFormatter(log_locals_on_exception=True))

The MAX_LINE_LENGTH and MAX_VARS_LINES values are used to limit the amount
of data we dump out. And so far we’re only logging the local variables from
the innermost frame, on the assumption that those are the most likely to be
useful.

Having deployed this logging configuration, our hypothetical traceback would
look more like this:

Python traceback
example, with locals

Aha. Now we can clearly see that some upstream service failed, and our code is
failing to handle that. And we can take action immediately.

This technique is not a magic crystal ball, certainly; sometimes the important
clues are elsewhere in the stack, or they’re in the current global scope. If
we wanted to, it would be trivial to log locals and globals at every frame,
just by grabbing the values of tb.tb_frame.f_locals and tb.tb_frame.f_globals
during the while tb.tb_next loop. But that could get very verbose, and some
of our services get hit very hard and log a lot, so we haven’t gone that far
yet.

We have already used these enhanced log entries to quickly diagnose a couple of
bugs in production, so we’re happy with this little tweak. Hopefully you will
find it useful too.

If you’d like to get your hands directly on our code, we’re hiring.

Source: Bitly