Thomas Maroulis

much ado about logs

In my last post I touched on the subject of getting the logs from all of your systems, where they are hard to find and analyse, to a place where they can actually be useful. Important though that may be, that discussion did sidestep one slightly important issue… What should you actually put in those logs? After all it’s not much use being able to analyse hundreds of thousands of logs if the information you need is not there to begin with.

There is a quote on this topic that, although on first hearing it might sound like one of those theoretical aphorisms that do not always work in the real world, I have found it to be a very good guideline when deciding what should and should not be included in logs. Unfortunately I can’t remember where I first read this and so can’t credit the original author. Paraphrasing… “Logs should contain enough information that in the case of a failure you should be able to determine which line of code the problem is in without having to debug the system”.

On first reading this may sound more like wishful thinking than a pragmatic approach. After all, software systems are complicated beasts and often the amount of information you would need to log to figure out what went wrong can be prohibitively large. There is obviously a balance to be found between logging too little and logging so much that you exceed the capabilities of the your log analysis system to separate the wheat from the chaff. I am not disputing that. Instead, what I’m arguing is that you can get surprisingly close to the goal of the quote above without overwhelming your logging system.

stack trace magic

The first and most obvious way to satisfy the quote is to log stack traces. These contain a wealth of information that often tells you exactly what you need to know in order to troubleshoot a failure. The problem is they often also tell you ten times as many other things that you couldn’t care less about.

Consider the following (contrived) example…

java.lang.NullPointerException
  at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:213)
  at com.acme.ThingService.doTheThing()(ThingService.java:115)
  at com.acme.ThingService$$FastClassByCGLIB$$e7642461.invoke()
  at net.sf.cglib.proxy.MethodProxy.invoke()
  at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint()
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()
  at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed()
  at com.acme.OtherThingAspect.doTheOtherThing()
  at sun.reflect.NativeMethodAccessorImpl.invoke0()
  at sun.reflect.NativeMethodAccessorImpl.invoke()
  at sun.reflect.DelegatingMethodAccessorImpl.invoke()
  at java.lang.reflect.Method.invoke()
  at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs()
  at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod()
  at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke()
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()
  at org.springframework.aop.interceptor.AbstractTraceInterceptor.invoke()
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()
  at org.springframework.transaction.interceptor.TransactionInterceptor.invoke()
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()
  at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke()
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()
  at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept()
  at com.acme.Controller$$EnhancerByCGLIB$$7cb543e4.doTheThing()
  at com.acme.Controller.doTheThing()

With a bit of effort you can sift through the noise and find the useful information in there, but it is certainly not easy enough that a quick visual scan will tell you what you need to know. Do you care about every single framework method invocation? I think 99.9% of the time the answer is an emphatic no. There will be those situations where your bug is in the framework you are using or in the way two different frameworks interact, but most of the time what you care about is what your code is doing and you are simply trusting the frameworks to do their jobs. If your stack traces look like the one above then you are going to hesitate to log them for less than critical errors to avoid flooding your logs.

Instead the above stack trace could look be made to look like this on your logs with almost no loss of useful information:

java.lang.NullPointerException
  at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:213)
  at com.acme.ThingService.doTheThing()(ThingService.java:115)
  at com.acme.ThingService$$FastClassByCGLIB$$e7642461.invoke()
  ...
  at com.acme.OtherThing.doTheOtherThing()
  ...
  at com.acme.Controller$$EnhancerByCGLIB$$7cb543e4.doTheThing()
  at com.acme.Controller.doTheThing()

In fact, you could even omit the first line as it adds no value over looking into ThingService.java:115 and seeing

Preconditions.checkNotNull(something);

logging for web services

Web services add their own sets of challenges when it comes to logging. Whereas before you cared about which piece of the code was handling what data, now you often also care about which HTTP request this code was invoked in response to.

Logging the HTTP request path, the URL parameters, the IP the request came from, the response status code, any custom status codes returned in the body and how long the request took to complete are all very simple to implement and add considerable value when you are trying to analyse your logs for patterns or when you are trying to tie a specific failure to a particular client report.

All of these however are pretty much conventional wisdom so I am not going to dwell on them. Instead I will talk about UUIDs and two places where they can be used to make your life easier.

Imagine a client gets the following HTTP response body that was returned to a user of your API:

{
  status: 500,
  type: SOMETHING_BADâ„¢
}

If you are lucky they are going to give you an exact timestamp of when this error occurred so you can find the corresponding logs. Of course if your systems are busy enough then that timestamp had better be accurate to the millisecond if you are to have any chance of figuring out which logs talk about this. Let’s hope NTP has done its job and that network latency hasn’t made the client’s view of when something happened diverge too much from your system’s view.

Wouldn’t it much nicer if instead it looked like this?

{
  status: 500,
  type: SOMETHING_BADâ„¢,
  id: da75eb41-189e-4f16-9e54-225b2fbbb100
}

If that same ID is in your logs it is now trivial to tie an API failure to them.

This idea can be taken one step further to account for the fact that your system might actually be a chain of several microservices. If this UUID was generated at the point of failure, then that would potentially only allow you to trace the problem as far as the last service that handled this request. If the error had originated further upstream then you are a long way from debugging the issue.

Instead you could generate the UUID at the first service that receives the HTTP request and then propagate it to the downstream services. Then any logs generated by those services would include that ID and it would become trivial to tie an API failure to possibly multiple logs across multiple services.

That’s it about logs! On my next post we will be moving on to other things. Hope you enjoyed reading. Till next time!

If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.

blog comments powered by Disqus