Thoughts on Logging - Part 1 - Structured Logging
December 06, 2012 at 09:55 PM | categories: Mozilla, loggingForward
For reasons I can't explain, I have a keen interest in computer logging systems. Perhaps it stems from my frustration that such a seemingly trivial task can often be quite complicated. Perhaps it's from the scaling challenges inherent with inevitably needing to analyze terabytes of logged data. Perhaps it's from my experience designing highly-available distributed server application and needing to use logging for real-time monitoring and decision making. Whatever the root causes, I've always wanted to create a brain dump of my thoughts on logging. So, here we go!
Introduction
Logging. Practically every computer program does it in some form or another. The standard libraries for many programming languages provide a logging facility. But, have you ever stopped to think about the side effects of logging or why you log the way you do?
In this first post, I will briefly define what I consider logging and will talk about something called structured logging.
The Definition of Logging
For the scope of these posts, I will define logging to be:
An emitted stream of distinct messages from a running application
A message can be pretty much anything. They are often strings. But, they could be binary data. The transport mechanism is undefined. Although, most logging systems typically write to a file (eventually).
I should also call out the difference between reading/interpreting logs and transporting them. When I say reading a log, I'm referring to a process where an entity (either human or machine) reads the individual message content and infers something from it. By contrast, a transporter treats messages as a black box and just shuffles bits around. Transporters also typically send or store logging messages somewhere.
It's worth explicitly noting that this definition encompasses both tradtional print() based logging (writing a line to a file or output) as well as things like transaction logs or recording events as new rows in a database.
The Multiple Use Cases of Logging
Before I get to some of the interesting aspects of logging, let's identify some of the use cases of logging (in no particular order):
- Recording application errors (e.g. exceptions and stack traces)
- Low-level information to aid in debugging or human observation
- Application monitoring (includes metrics and alerting)
- Business analytics (use the logged data to help make decisions)
- Security auditing
Now that we have the basics out of the way, let's move on to the crux of this post.
Logging for Humans or Machines
I would argue that every log message is intended to be interpreted by either a human or a machine.
If a log message is only going to be used to help a human debug a system or casually see what's going on, it's intended for humans. If a log message is going to be used for machine analysis later (e.g. monitoring, alerting, business analytics, security auditing, etc), it's intended for machines.
This is an important distinction. If it isn't clear, please re-read the prior paragraphs. Now might also be a good time for you to audit the logs in your life and assign human or machine labels to them.
The Problem
The problem with logging as most of us know it is that it is optimized for human consumption, even if machines are the intended consumer.
At the heart of most logging systems we log a string, a char *, an array of characters that can easily be read by humans. For example, a simple web application might produce log files containing lines like:
2012:11:24T17:32:23.3435 INFO gps@mozilla.com successfully logged in
Human readability is a great feature to have. However, it comes at a price: more difficult machine consumption.
By optimizing for human consumption, we've introduced a new problem: log parsing.
In the above example log message, what the program has done is combined a few distinct fields of data (an event - logged in, the username, and the time) into a single text string. This is great for humans glancing at a log file. But that decision now necessitates downstream machine consumers to parse/decode that text back into its original, distinct fields.
I call this style of logging destructured or unstructured logging. We start with distinct fields of data and then destructure them into a fully free-form or semi-formed text. By doing so, we lose part of the original data structure and necessitate the need to reconstruct it (via parsing).
This seems silly to me.
Why are we formatting distinct fields of data into unified strings only to have them parsed into their original components again? If lossless and simple round-trip encoding/decoding is your goal, there are much better ways than freely-formed text! Going to text is easy (printf() or string concatenation are how it is commonly performed). Getting back to the original representation is the real problem.
Every persisted log message destined for machines will need to be decoded to some degree to facilitate analysis. However, by using destructured text, nearly every program and logging system forces this decoding problem onto its users by using an encoding that is difficult to code.
Every time you are analyzing a log file containing destructured text you need to solve the parsing/decoding problem. Every time you write an awk, grep, sed, perl, etc script, use regular expressions, etc to scrape logs you are solving this problem. How much time do you spend decoding/parsing logs versus actually doing something useful with the data? From my personal experience, a lot. And, parsing destructured text is a very fragile solution. What happens when the string changes, even by just a little? Does that throw off your decoder/parser?
Decoding log messages is a hard problem. Furthermore, it's a problem I'd rather not have.
Structured Logging to the Rescue
One of my early professional mentors taught me that one of the best ways to solve a problem is to change the requirements so that the problem no longer exists or becomes easier to solve. (If that didn't immediately click with you, you might want to read it again because I feel it's an invaluable skill and one that many engineers don't employ). Applying that lesson here, how do we eliminate the problem of parsing destructured textual data back into its original fields? Simple: we don't emit destructured text! Instead, we emit data that is easily machine readable, where each individual field is preserved. This is called structured logging.
The primary benefit of structured logging is that you eliminate the parsing problem (or at least you change it to be a solved problem by using a common or well-defined encoding format). No tedious searching for data within free-form text and extracting it. Instead, the fields from the event-producing application are seamlessly preserved and loaded into downstream machine consumers.
A Closer Look
What does structured logging look like? Well, it depends. There is no standard for structured logging. When I first started writing this post in June 2011, a Google search for structured logging didn't yield much. Today, things aren't that much better. Although, there has been significant activity in the last 15 months. rsyslog has launched project Lumberjack, which has structured logging as a pillar. There is even someone sitting 10 ft from me at Mozilla's SF office doing structured logging for server applications (metlog).
All structured logging systems are different. But, the concepts are mostly the same.
From the perspective of a log-producing application, a function call to produce an unstructured message may look something like:
logger.info(username + " successfully logged in")
Whereas a call to produce that same message in structured form may look like:
logger.info("successful_login", who=username)
You'll immediately notice that the API retains the original pieces of data. In this example, we just have the event name (successful_login) and who (some variable value).
This will look different in every programming language, of course. And, depending on the structured logging mechanism, the common metadata produced with each message (e.g. time and log level) may differ. But the general idea is you preserve the rich data fields as they cross over into the logging facility rather than casting them to a string. Behind the scenes, the structured logger retains the fields (the original structure) and encodes them into something machine readable.
Put another way, structured loggers log objects, not strings.
How objects are encoded again depends on the structured logger implementation. JSON is a well-understood encoding format. An example JSON encoding of our example message might look like:
[1354317157.034534, "INFO", "sucessful_login", {"who": "gps@mozilla.com"}]
That's an array of current time, log level, event name/message type, and a mapping of the data fields.
Or, you could opt for a more efficient encoding format, such as protocol buffers (which is binary).
The actual encoding theoretically doesn't really matter as long as it can easily be decoded back into its original data. A good way to identify this trait is whether you need to implement a decoder/parser more than once per programming language or if you need to apply different decoding mechanisms based on the content of a message. If you do, you probably don't have a generic enough encoding format.
Going back to implementation details, what about a consumer? In JavaScript, our JSON structured logger from above might look something like:
for (let line of log) { let [time, level, action, fields] = JSON.parse(line) if (action == "successful_login") { let who = fields.who; // Do something interesting. } else if (action == "XXX") { // ... } }
Compare this with a destructured parser:
for (let line of log) { if (line.endsWith("successfully logged in")) { // Find username. let spaceIndex = line.indexOf(" "); let username = line.substr(0, spaceIndex - 1); // Do something interesting. } else if (line.indexOf(...)) { // } }
This example admittedly isn't very complex. But, you can already see some differences. Notably, it takes 1 line to completely parse any message using the structured logger. The unstructured text parser requires multiple lines of code per message type. One of those scales better than the other.
Before I conclude, it's worth mentioning something about event names/enumerations. You may notice I gave event name a dedicated field in my example. This is intentional. In a subsequent post, I will talk about the importance of event names/enumerations and schemas. Until then, just assume that having a unique event name is an important aspect of a well-designed structured logging system. Hence its inclusion in my examples above.
Conclusion
I think the vast majority of application logging is done suboptimally because it forces a hard problem of decoding/parsing onto consumers. Given that many logs are intended for machines, I think this is a problem that shouldn't need to exist and the software industry should rethink its current approach to logging by offering first-class structured logging facilities and subsequently encouraging their use.
If application logging retained the data behind messages, the problem of parsing/decoding log messages would largely vanish and we would be left with the real problem: actually doing something useful with logged data. That is the point of logging after all, isn't it?
Now, this may sound like I'm waiving my hands a bit. You are right: structured logging by itself is no silver bullet. There are many ways to do it suboptimally and to use it when you shouldn't. In subsequent posts to this series, I will dive deeper into how to properly design a (structured) logging system. I will also investigate some of the performance implications associated with different logging mechanisms. Stay tuned.