During development of Python web applications, there are a lot of tools that can help provide clues about what caused a bug or exception. For example, Django’s default debug error page prints out local variables at every stack frame when there’s an unhandled exception. The popular django_debugtoolbar adds more information. In a similar vein there […]
During development of Python web applications, there are a lot of tools that
can help provide clues about what caused a bug or exception. For
example, Django’s default
debug error page prints out local variables at
every stack frame when there’s an unhandled exception. The popular
django_debugtoolbar adds more information. In a similar vein there are things
like Pyramid’s pyramid_debugtoolbar and its predecessor WebError.
But when troubleshooting production issues, those tools aren’t available, for
good reason: We need the useful information to be exposed to developers, not
overwhelming our end users nor exposing sensitive information to malicious eyes.
So, we turn to logging instead. But that provides a lot less information out
of the box. So at bitly, we often found ourselves in an irritating cycle that
went something like this:
Sometime soon after a deploy, we notice an unusual number of exceptions being logged. (Let’s say that old Python chestnut,
'NoneType' object is unsubscriptable.) So we know that we have some code that is expecting a dict or list and getting
We may decide the error is not critical enough to roll back without first trying to diagnose the underlying cause, but of course we want to fix it quickly. The pressure is on.
We find the traceback in our logs. It looks like:
But where’s the
None? Is it
response['data'] or is it one of the
entry values? We have no way to see. (Tip: We can actually tell that it’s not
response['data']['subtotal']. Do you know why? Would you remember that while under pressure to fix a production bug?)
We try, and fail, to provoke the error on our local development or staging environments.
Then we add some logging code.
We get a quick code review and deploy the logging code.
We wait for the error to happen again.
We find the log message and realize we forgot to log everything we wanted to see, or we need more contextual information. Go back 3 steps and repeat as needed.
Finally we can see what the bad value was and we finally have enough clues that we can understand the cause of the error and fix it.
This is a frustrating amount of overhead when you’re trying to diagnose a
production issue as quickly as possible, even with multiple people helping out.
Fortunately, if you make liberal use of
logging.exception() – for example,
if your web framework calls it on any unhandled exception, and if you take care
to call it on unknown exceptions that you deliberately catch – then it’s
actually easy to change our logging configuration such that we get a lot more
useful information logged. And we can do this globally without sprinkling
logging changes throughout our code.
Here’s one possible way to hack your logging config to do this:
Then, in whatever code sets up your logging configuration, you can add this to
the relevant handler like so:
MAX_VARS_LINES values are used to limit the amount
of data we dump out. And so far we’re only logging the local variables from
the innermost frame, on the assumption that those are the most likely to be
Having deployed this logging configuration, our hypothetical traceback would
look more like this:
Aha. Now we can clearly see that some upstream service failed, and our code is
failing to handle that. And we can take action immediately.
This technique is not a magic crystal ball, certainly; sometimes the important
clues are elsewhere in the stack, or they’re in the current global scope. If
we wanted to, it would be trivial to log locals and globals at every frame,
just by grabbing the values of
while tb.tb_next loop. But that could get very verbose, and some
of our services get hit very hard and log a lot, so we haven’t gone that far
We have already used these enhanced log entries to quickly diagnose a couple of
bugs in production, so we’re happy with this little tweak. Hopefully you will
find it useful too.
If you’d like to get your hands directly on our code, we’re hiring.