I increasingly feel that there should be two separate logs, one for the user with the Error/Warning/Info information (what the user wanted to be done and how well it went) and the other for the developer with the Debug/Trace information (capture key information about what the program was doing, especially on the interface - what files it was looking for, what database calls it did etc.).
I always get complaints from users that there is too much useless stuff in the logs through which they need to go. I also think that the developer log should be structured differently, possibly to be easily read by some automation.
This is a very common pattern on Node.js... look into the package debug, it's structured to work as a dev help and can be enabled/disabled per part. So you only enable it for the bit you are debugging and unrelated info gets filtered out.
Info: What is the program up to. Occasional messages that indicate it is alive and doing what is expected.
Warn: Problems that you know about and wish would not happen, but do anyways. A service call has failed, unexpected input that violates the contract of the method, etc. You may be handling these conditions gracefully and/or returning an error to the caller, but still want to know when they happen. Usually you should have a ticket open in your bug tracker for these.
Error: Problems that you do not know about. When an error is logged, a developer should look at it. Some kind of action should be taken immediately. Put the fire out if it is an emergency. If not an emergency, create a ticket in your bug tracker, then change the level to warn.
I keep warn and info separate so that I can easily grep/filter on one. I often want to just see info messages to know what the status of things are.
In an enterprise setting, treating one level of logs as actionable, as GP mentions, is a very important for clarifying responsibilities.
If it takes the right pair of eyes on the logs to see a problem for what it is, the whole process slows down and doesn't scale as high. All errors and very chatty warnings should induce me to ask someone what's going on, to make sure it hasn't slipped through the cracks. And they should thank me for pointing it out. Otherwise you reinforce some bad patterns.
I support these 5 levels of logging, but given the discussion below I also would like to highlight a lesson from an SRE talk I once attended that really there are only three levels of alert:
1. DEBUG - Look at this when debugging, otherwise no one will ever see it.
2. TICKET - Open a ticket and have someone on your team look at this when it gets to the top of your queue.
3. PAGER - Drop what you're doing and handle this, now.
Of course, you want to track INFO and ERROR type messages because a sufficient number of them might cause someone to raise a ticket... but at scale, you probably should've built that monitoring already, and just drop INFO/ERROR down to DEBUG.
It's nice to have articles like this to give structure to the conversation, but there's no single solution. Python has default levels of; debug, info, warning, error, and critical, but doesn't really give any guidance to what each means. At one company we tried to create some structure, but 3rd party libraries would flood messages at the "wrong" logging level.
I like these 5-levels and the 3-levels you talk about seem like a paired down version that reenforces what each level means.
To illustrate the "no single solution," we tried to use logging for command line output, neither of those systems really has a level for that. We probably should have created a new level between INFO and WARNING because of chatty libraries.
In practice I've found the DEBUG and TRACE are usually best added to code when troubleshooting. When you're writing the code you don't have a good idea what info you'd need when troubleshooting, so it's overly verbose but still missing what you'd want. When trying to stash all logging either locally or centrally (because adding logging and re-running isn't always an option) it can affect performance.
I think a good log system should give you the ability to store everything but filter quickly. So for example I'll set up my systems to store everything down to LOG_DEBUG because I might want that information. But when I'm actually looking at the logs I like being able to exclude the DEBUG messages and look at the general behavior, then drill down when I think I understand what the sequence of events is. Good log systems should support this without making me filter as I'm storing the data.
Also in UNIX systems I wish there was a way to generically pipe the output of a command to the system log while preserving the return code in case the command exits. In BASH you can configure the pipe behavior to propagate errors to the calling shell, but in normal shells the last return code is the one that gets stored, which is not usually what you want. It would be really cool if the logger utility captured return codes from upstream and returned them so that I can use it in the antiquated SystemV init scripts I have to maintain.
Talking about logging got me a rejection at Google. The very kind 22 year old interviewing me asked what my thoughts on commenting code was and I said most every time it's used as an excuse to show off a piece of overly clever code, so normally I discourage it in favor of logging, as they are code comments you can use in production. I realize now that was my first graybeard moment.
The article is a bit confusing, because the title says 5, the article gives 6 (Error, Warning, Info, Debug, Trace, and Fatal), but the article casts doubt on whether either of those is right when it says this:
> I would like to talk about the five (or six, or four) that I find are the most crucial, the fundamental set to base your own levels (and your thinking) on.
Which are the 2 from the 6 given whose status as levels of logging is in doubt?
He orders it in terms of relevance too. Note that the last two levels he states are Trace and Fatal. He talks about how Trace may not always be enabled, or how Fatal doesn't necessarily make sense if such an error needs to be handled manually anyways.
Everybody developer knows that they should include logging in their systems, however how and what to log is not something so well documented.
When you start working in a system (i.e. at a startup) you put the odd log here and there, without too much attention. However, as your system(s) starts growing, you have to start paying attention at your logs' structure and semantics, so that they are more useful in ELK, Sumologic and similar log searching engines.
What's nice about something like Elasticsearch is if you categorize your logging levels you can more easily drop logs of a certain level at varying times. Debug/verbose/trace is a lot less useful after, say, 24hrs and takes up a lot of space while Fatal errors should be rare and probably want to be tracked year-over-year or between releases.
I really believe only two levels are necessary most of the time: error, and info. Error is for anything that a dev should be notified about (AKA if you send an error log message, assume someone will be notified about it immediately), and info is for any other useful debugging information, like "we got a request for /foo/bar at 10 pm". I also think structured logging is absolutely essential. Log levels are really just a way to get some of the benefits of structured logging.
Error and fatal are the same thing. If an operation fails and cannot be recovered, that's an error. There is no meaningful difference between an operation failing and the entire program failing. Or, from another perspective, keeping the program running is just another operation that can have fatal or non-fatal errors.
That depends on the nature of the program. We use error and critical for server software. If there is a protocol violation on a particular connection, log an error and drop the connection, but otherwise continues. If the `accept` system call fails that's critical and the program shuts down. Depending on what we're doing there may be other critical operations, but some software doesn't have any.
A request that causes a 500 error response to the client, should also produce an Error log output on the backend. The request crashed. But, because this is an actor-based system, that crash didn't affect anything other than the request. It was Fatal to that interaction, but not to the system.
A request that causes the entire server process to spontaneously halt, should generate a Fatal log message. That is never supposed to be something that happens. The system is built, on a number of levels, to ensure that it doesn't happen. But it'd be really great, if it does happen, to have enough information to know why.
Ideally, a Fatal log message is accompanied by a crash dump.
This works at a very, very small scale where you are reading your logs by hand. At this point, you don't even need log levels.
In the real world, you're not going to read your logs unless you're trying to investigate an issue. Instead, you're going to monitor your logs. And this kind of monitoring is where WARN and ERROR become helpful.
WARNs are useful because if you run into a WARN-level issue every now and again, it's no big deal, but if you run into a ton of them, that might indicate a problem that's worth fixing. A single WARN log isn't that useful, but a statistically significant number of WARN logs is a monitorable metric. Being able to detect and address issues after they're serious enough to cause elevated WARNs but before they're serious enough to cause customer impact is worthwhile.
ERRORs are useful because then you have a one-size-fits-all way of monitoring for error conditions, if for example, it's an offline job that doesn't result in a 5xx response. You should still monitor for 5xx responses, pipeline job failures, non-zero exit codes from processes, or whatever your application-level "this didn't end well" signal is. But ERROR logging is also a good metric to have if you want to have multiple log-based metrics all in one place.
The author argues that the error level should not exist because according to him, an error can either be handled or bubble up and logging an error is in a way handling it. If an error is handled, it is not an error anymore.
I think he is overthinking it a bit. Yes there should be very little places where an error is actually logged, yet is does not mean that it should not exist.
In web applications you very often have one instance of:
logger.exception('Something bad happened')
Sure you can argue that as the client received a 500 it is not an error anymore, but this doesn't help you notice and fix the problem.
Any exception that bubbles all the way up to a request-scoped catch block indicates a bug in your handler logic. It should probably crash your program, or at least be reported as a structured event to your exception tracking system. In either case, logger.Info is sufficient to record this event in the application logs.
Why should exception tracking and logging be two separated things? They perform the same job, an exception tracking system is just another way to present log records.
The real problem is that people still associate logs with lines of text in a file, whereas most logging frameworks create full objects containing a ton of information. Once you stop considering logs as text, exception tracking and logging really become the same thing.
Conceptually they are very similar, in the same way that metric data and log data are also very similar. But we don't have good technological solutions to support muxing all streams of observability data together and dealing with them efficiently, either in transport or in storage/query.
Good program architecture in 2018 demands some degree of compromise. In my experience it's best to separate console logging (as in this article) from structured logging, from metrics, from distributed tracing, from exception tracking.
- Debug: should not show up in normal logs but you can turn it on down to a specific log message in a running process without restarting. We use structured logging so we have the additional ability to turn on debug messages down to a specific key=value pair (or set of key=value pairs).
- Info: An interesting event occurred that someone might care about. This could be an error that we were able to handle or just some important event.
- Alert: An error case that should not happen has occurred and someone needs to look into it. In addition to logging the event, an external system should be notified (e.g. create a ticket, page someone, send an email, etc).
"- Debug: should not show up in normal logs but you can turn it on down to a specific log message in a running process without restarting. We use structured logging so we have the additional ability to turn on debug messages down to a specific key=value pair (or set of key=value pairs)."
I always turn on the highest log level unless there is a very strong performance reason against doing so. I have been able to resolve a lot of issues over the years because I had good logging when something went wrong.
I don't really see alerts as logging. They are an integral part of the application's behavior.
I've always preferred levels of logging rather than these arbitrary names (what's the difference between trace and debug, and which one is lower level?).
Give me three levels of logging, 1, 2, and 3, and then warning, and error. Maybe allow to mix and match them, if this makes sense (the typical nomenclature makes them all mutually exclusive, which is dumb).
Depending on your Coding Culture, you order them based on "priority" (FATAL would have priority over all others, and therefore numbered as 1), or "importance" (FATAL would be most important, and therefore numbered 5.)
Alternatively, each is a flag in a number so that they can be combined, with FATAL being 1 or 16 depending on priority vs. importance.
I would argue that the names do indeed suffice to tell you their "level" - TRACE is the noisiest, least important, least priority level. DEBUG is next, followed by INFO, WARN, ERROR, FATAL.
The multiple levels are nice, especially for logging errors in production and tracking general information of your software. But in development, logging is often about tracking a particular aspect in your code. You need something that logs to a particular aspect that you are working on, and then can turn off that aspect once you move on. In the future, you often find yourself back to debug and need to flip that aspect back on again.
Summary: Log levels of info, warn and error all have a place. Logging to aspects is valuable during development, easy to turn off for production and useful when returning to debug.
Logging is one of the weirder cross cutting concerns, for many reasons but especially including the one you mention.
We had to roll back an upgrade last month because one nonfatal (but still pretty bad) problem was generating a ridiculous amount of log entries. We had no way to turn it off or throttle it, so we rolled back.
I have fantasized for years about a programming language with a small runtime that supports instrumentation of the code, not just for debugging and profiling, but for logging as well. Logging becomes a set of conditional breakpoints stored with or at least near the code and you can tweak them individually.
Something akin to applying the 80/20 rule to Aspect Oriented Programming.
We fake this by implementing logging frameworks that try to short circuit out cheaply. But it really only pays off if you can marshal the arguments to the logger very cheaply, and you still have problems with third party code, which can't know that you've got workarounds for problems they consider to be fatal, and are therefore so noisy at the WARN or INFO log levels that you can't see your own log entries (Apache Foundation, I'm looking at you).
"Logpoints are especially useful for injecting logging while debugging production servers that cannot be paused or stopped."
There is a lot more going on with them than you might realize. Not sure how coupled it is to the nodejs/kubernetes/Azure stuff they have built into VSCode, but it seems like a working example of the future. Sorta like time travel debugging in node-chakracore.
Windows has this facility called EWT that lets you setup all sorts of eventing/tracing that is actually not present(the calls) in the code unless the EWT facility injects the code necessary to enable it into the running process. This allows you to dynamically enable and disable event sources based on stuff like.. whether or not something is currently subscribing to it. Worth looking into just for the sheer amazement of it if you are not a Windows developer and familiar with it.
Wouldn't this be the same as naming loggers? Many logging frameworks allow you to name a logger and just reference that. From there, you can enable or disable named loggers, and get this same "apect" that you talk about.
Netscaler logging deserves its own level of hell. You're given 15 undocumented logging levels, with the suggestion not to leave it running longer than necessary, and that level 11 should be enough for anybody.
Logging is very important because it also allows you to debug/resolve issues easily on production systems. But caring about the appropriate log levels and the ability to configure them at runtime is also super important. It's just insane and ridiculous to log everything on a trace/debug level and stream it to the Expensive Innovative Big Data Machine Learning Smart Log Analytics Product. Yes, I'm looking at you Application Insights.