Observability and Monitoring Best Practices ⋆ integralist

This post aims to discuss key monitoring discussion points and to summarise the relevant best practices when instrumenting application performance monitoring.

Below are some of the areas we’ll be focusing in on…

Note: we primarily work with Datadog so you’ll see them mentioned a lot throughout this post.

Terminology

There is a lot of confusion around the difference between certain terms such as “observability”, “monitoring”, “instrumentation” and “telemetry”. Let’s start with defining what each of these mean…

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs – Wikipedia

In that context, “observability” is the word you use when talking about how well your systems are doing in a broad overarching sense (is the system observable?). Beneath the umbrella term “observability” we’ll then find “monitoring” and “instrumentation”.

Monitoring is the translation of IT metrics into business meaning – Wikipedia

In that context, “monitoring” is the word you use when talking about tools for viewing data that has been recorded by your systems (whether that be time series data, or logging etc). These monitoring tools are supposed to help you identify both the “what” and the “why” something has gone wrong.

Instrumentation refers to an ability to monitor or measure the level of a product’s performance, to diagnose errors and to write trace information – Wikipedia

In that context, “instrumentation” is the word you use when talking about how you’re recording data to be viewed and monitored.

Telemetry is the process of gathering remote information that is collected by instrumentation – MSDN

In that context, “telemetry” is the word you use when talking about the mechanisms for acquiring the data that has been gathered by your instrumentation (e.g. tools like FluentD or Syslog).

Understand the different types of monitoring

Although most of this document is based around one specific type of monitoring (APM), it’s good to be aware of the various types of monitoring available across an entire system.

Server monitoring:
monitor the health of your servers and ensure they stay operating efficiently.
Configuration change monitoring:
monitor your system configuration to identify if and when changes to your infrastructure impact your application.
Application performance monitoring:
look inside your application and services to make sure they are operating as expected (also known as APM tooling).
Synthetic testing:
real time interactions to verify how your application is functioning from the perspective of your users (hopefully to catch errors before they do).
Alerting:
notify the service owners when problems occur so they can resolve them, minimizing the impact to your customers.

Data collection methods

There are fundamentally two methods for data collection:

Push: sending metrics to an analysis tool.
Pull: configuring a health check endpoint, that a centralised tool pulls data from.

When dealing with the ‘pull’ model you’ll hear people suggest that rather than a simple ‘200 OK’ response you should add extra information that gives humans more understanding of the overall state of the service.

So this could be things like a successful database connection was opened. But a possible concern would be the performance overhead that might need to be accounted for (remember: health endpoints are generally pinged every few minutes).

There are also various metric types you can collect data as. Two common ones are:

Counter: an ever increasing value.
Gauge: a point-in-time value (can arbitrarily go up and down).
Histogram: samples observations and counts them in configurable buckets.

Histograms might require a little extra clarification: they sample observations (e.g. request durations) and count different perspectives on the data. In the case of ‘request duration’ you’d likely see the different ‘perspectives’ being: count, avg, median, max and 95percentile.

For more information, see these Datadog articles: Metric Types and DogStatsD.

Frontend monitoring

There are two main approaches to frontend monitoring:

Real User Monitoring (RUM).
Synthetic.

The difference between them has to do with the type of traffic that is triggering the data collection. For example, with RUM the requests being processed are from real users, whereas with synthetic monitoring the requests are fake (i.e. they are generated by your own software).

Synthetic monitoring causes data to be collected for analysis, thus allowing you to identify the availability and performance of your system by constructing very specific test cases.

Make it useful, then actionable

Let’s start with a quote from Charity Majors (author of Database Reliability Engineering and CEO of honeycomb.io).

Don’t attempt to “monitor everything”. You can’t. Engineers often waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft.

When a monitor triggers an alarm, it should first and foremost be “useful”. Secondly, it should be “actionable”. There should be something you can do to resolve the alarm and also be a set of steps (post-resolution) to prevent that alarm from triggering again.

If the alarm isn’t actionable, then it just becomes noise.

Focus on user impact

Below is a quote from Mike Julian (author of Practical Monitoring and Monitoring Weekly)

Go as deep and wide with your instrumentation as you want, but always be asking yourself, “How will these metrics show me the user impact?”

Or put another way: “Who is your customer”?

Depending on the services you build, your customers might be people who pay money to use your service, or they might be other engineers within your organisation who are consuming an API you’ve developed. Either way, your customer is the most important thing to you and your service. You need to keep them happy.

For most users, whether they be a non-technical paying customer or an engineer, they will have certain criteria that will make them happy. You could imagine the sorts of questions they’ll ask to be something like…

I want the service to work as expected
i.e. you should monitor whatever is determined to be a ‘success rate’ for your service.
I want the service to be fast
i.e. you should monitor the service latency.
I want to use ’this much’ of the service
i.e. you should monitor things like ‘requests per second’ - do you have SLA’s defined?

In essence you should start by creating monitors for things that have a direct impact on your users.

For example, measuring OS level metrics such as CPU and memory consumption is great for diagnostics & performance analysis, as they help to highlight the underlying system behaviour.

But you should probably not be using these stats for alarming, as their values can have relative meaning in different situations and they don’t necessarily offer much in the way of understanding the root cause of the problem.

Instead, try monitoring 5xx errors and very slow latency times. These metrics are much more useful indicators of problems with the system and real negative user experiences.

Ultimately, the deeper you go, the more specific your alarms become, and the less useful they are at identifying trends and patterns.

Favour organic changes over static thresholds

Static thresholds such as “the number of errors reported has exceeded N” have a habit of raising false alarms, due typically to unexpected spikes in data (anomalies).

This happens so frequently that most monitoring tools/services (such as nagios) offer “flapping detection” to help prevent these deviations.

To help with this services such as Datadog offer a feature called “recovery thresholds” which helps to quieten monitor state changes so you can be confident when a monitor switches back to OK state that it has in fact definitely resolved itself.

The way it works is like so: you give Datadog a threshold value that must be met to consider the monitor “back to normal”. Once the monitor state switches to ALARM it will now never flip-flop between OK and ALARM. It will only ever go back to OK if the set recovery threshold goes below the specified value.

They also offer “anomaly detection”, which detects when a metric is behaving differently than it has in the past, taking into account trends, seasonal day-of-week and time-of-day patterns. This can be more useful for organically identifying issues, as it allows for buffer zones around your static thresholds.

Datadog also offers “outlier monitoring” which detects when a specific member of a group (e.g., hosts, availability zones, partitions) is behaving unusually compared to the rest. They are useful for noticing when a given group, which should behave uniformly, isn’t doing so.

Note: A summary of Datadog’s various detection methods can be found here.

Send critical and noncritical alarms to different channels

At BuzzFeed I work in the software infrastructure team and there we have two separate Slack channels for handling monitoring notifications:

#oo-infra-alarms
#oo-infra-warnings

Only alarms that require immediate review (such as a 5xx monitor) goes into #oo-infra-alarms. Everything else is sent to #oo-infra-warnings because although important, they aren’t surfacing immediately as user issues.

If you don’t do this, then you’ll find people become fatigued by the sheer amount of noise coming from multiple alarms. Especially alarms that are firing due to unactionable anomalies.

To quote (again) Charity Majors…

In the chaotic future we’re all hurtling toward, you need the discipline to have radically fewer paging alerts - not more.

You should also consider sending a monitor’s “warning” state to a different channel for similar reasons. You can define different channels in Datadog using the following template code:

{{#is_alert}}
@slack-my-channel-for-serious-alarms
{{/is_alert}}

{{#is_warning}}
@slack-my-channel-for-just-warnings-to-keep-an-eye-on
{{/is_warning}}

Give context

When an monitor triggers an alarm, and you’re on-call that night, then you might be unfamiliar with the system and its dependencies. One quick way to help people on-call is to provide them with additional context about the alarm and the affected system.

A general message template could look something like the following…

<Alarm Title>
<Alarm Summary>
<Concise service description>
<Monitoring links>

An example might look something like:

Foo-Service 5xx

Foo-Service is serving a high number of 5xx responses, these will not be cached at CDN, possibly resulting in further cache misses.

Foo-Service serves responsive article pages for the BuzzFeed website (www.buzzfeed.com) and is fronted by a CDN. It has an upstream dependency on Redis and the internal Buzz API v2.

Please refer to the monitoring for more details:

- Logs
- Dashboard
- Timeboard
- Request Breakdown

Alarms highlight the symptom and not the cause. So if at all possible, try to include information or data that might aid the on-call person in identifying the root cause of the issue.

Think about data aggregation

When dealing with TSDB’s (Time Series Database) you’ll find they will start aggregating multiple data points into a single data point. This is known as the “roll up” effect.

Note: if you weren’t aware, a TSDB is made up of key/value pairs. The key is the timestamp and the value is the metric value at that point in time.

The problem with ‘rolling up’ data is that it smooths out your data points. Meaning you shift from a graph that has lots of spikes (i.e. a graph that shows every possible false positive), to a graph that covers up those false positive spikes.

On the plus side, rolling up data like this means you get to see the data at a higher level and so patterns of use start to emerge.

There have been examples in the past where important spikes in CPU/Memory were not captured due to the smoothing out of aggregated data and so it can be useful to look at your data closely (at first) and then in some instances force the disabling of rolling up data using Datadog’s .rollup() method.

Ultimately, you’ll need to find the balance that works best for you and your monitoring requirements.

Note: you can read more about this smoothing out process here, as well as the .rollup() method Datadog provides to allow you to control this behaviour.

Know your graphs

We won’t repeat the details here, but suffice to say, each graph in Datadog has a purpose and specialised use case. You should review Datadog’s articles on the various graphs they offer and the whys/when of using them:

UPDATE

I’ve since written a blog post about stastistics (aimed at beginners) and so by understanding the basics of statistics you’ll be able to understand more clearly what certain graphs represent and how.

You can read the post here: Statistics and Graphs: The Basics

Map your graphs

It can be useful to order your graphs (within a dashboard/timeboard) chronologically. For example, CDN -> LB -> Service. This can help you mentally model the request flow through your system, such that you know the request starts by hitting your CDN layer, it’s then routed inside of your infrastructure and hits a load balancer, finally that load balancer distributes the request to a specific service node.

It can equally be useful to collate multiple services (and their graphs) within a single overarching dashboard, because when there is a problem in the system you can follow the request flow from start to finish and see where a bottleneck (or anomaly) somewhere else in the chain is causing a side effect elsewhere in the chain.

An alternative approach is to have a dashboard that focuses on the key metrics for a service’s performance, and underneath that they’ll have graphs that monitor their dependencies. So when an engineer gets a call because of an issue that seems to be with their service, they’ll check the dashboard and might see there’s an issue upstream of them with one of their dependencies.

Some companies even take that approach a step further and formalize this process and subsequently define a standardized structure for dashboards (i.e. all dashboards are structurally the same). The benefit of that approach is that people on-call can start at the beginning of a request and then follow the dashboards like a thread until they reach a service that is the root cause of the problem being reported.

Choosing between a metric or log

In order to help individual teams identify whether they should collect data as a metric or as a log, one recommended approach is to ask the following questions:

Is it easier for your team to think about metrics or logs?
Is it more effective for the thing in question to be a log entry or metric?
What questions do you typically ask when debugging?

We can’t answer these questions for you, but we have generally found the following approach works reasonably well as a generic pattern (YMMV)…

Collect an exception as both a log and an error.
- The log helps add additional context not available in the metric.
- The metric helps with monitoring and triggering alarms.
Log only errors/exceptions.
- “No news is good news”.
- Control other log calls using log levels (so they can be enabled when necessary).
Include unique identifiers with your logs
- This helps to quickly figure out what the log is possibly associated with when looking from a centralized distributed log system which contains many logs aggregated from many distinct services.
Mostly everything else we’ll record as a metric so we can monitor pattern changes.

Other useful tips

Datadog tags are useful for splitting metrics by type (e.g. status codes).
Datadog events, are useful for capturing additional info (e.g. exception message).
99% of the time you want a Timeboard, not a Screenboard.
- Timeboards allow for tracking data points across multiple graphs at once.
Let people know where the dashboards are, Slack pinned, Runbooks etc
For latency use 95th percentile (standard deviation), not just the ‘mean average’.
- Because the mean can miss important slow requests.
Load balancer metrics can also be useful to monitor (especially if service is falling over).