Profiling Go ⋆ Mark McDonnell

Note: I highly recommend also reading this official diagnostics documentation.

Memory Management

Before we dive into the techniques and tools available for profiling Go applications, we should first understand a little bit about its memory model as this can help us to understand what it is we’re seeing in relation to memory consumption.

Go’s implementation is a parallel mark-and-sweep garbage collector. In the traditional mark-and-sweep model, the garbage collector would stop the program from running (i.e. “stop the world”) while it detects unreachable objects and again while it clears them (i.e. deallocates the memory). This is to prevent complications where the running program could end up moving references around during the identification/clean-up phase. This would also cause latency and other issues for users of the program while the GC ran. With Go the GC is executed concurrently, so users don’t notice pauses or delays even though the GC is running.

Types of Profiling

There are a couple of approaches available to us for monitoring performance…

Timers: useful for benchmarking, as well as comparing before and after fixes.
Profilers: useful for high-level verification.

Tools Matrix

	Pros	Cons
ReadMemStats	- Simple, quick and easy. - Only details memory usage.	- Requires code change.
pprof	- Details CPU and Memory. - Remote analysis possible. - Image generation.	- Requires code change. - More complicated API.
trace	- Helps analyse data over time. - Powerful debugging UI. - Visualise problem area easily.	- Requires code change. - UI is complex. - Takes time to understand.

Analysis Steps

Regardless of the tool you use for analysis, a general rule of thumb is to:

Identify a bottleneck at a high-level

For example, you might notice a long running function call.

Reduce the operations

Look at time spent, or number of calls, and figure out an alternative approach.
Look at the number of memory allocations, figure out an alternative approach.

Drill down

Use a tool that gives you data at a lower-level.

Think about more performant algorithms or data structures.
There may also be simpler solutions.
Take a pragmatic look at your code.

Base Example

Let’s begin with a simple program written using Go 1.9.2…

package main

import (
    "log"
)

// bigBytes allocates 100 megabytes
func bigBytes() *[]byte {
    s := make([]byte, 100000000)
    return &s
}

func main() {
    for i := 0; i < 10; i++ {
        s := bigBytes()
        if s == nil {
            log.Println("oh noes")
        }
    }
}

Running this program can take ~0.2 seconds to execute.

So this isn’t a slow program, we’re just using it as a base to measure memory consumption.

ReadMemStats

The easiest way to look what our application is doing with regards to memory allocation is by utilising the MemStats from the runtime package.

In the following snippet we modify the main function to print out specific memory statistics.

func main() {
    var mem runtime.MemStats

    fmt.Println("memory baseline...")

    runtime.ReadMemStats(&mem)
    log.Println(mem.Alloc)
    log.Println(mem.TotalAlloc)
    log.Println(mem.HeapAlloc)
    log.Println(mem.HeapSys)

    for i := 0; i < 10; i++ {
        s := bigBytes()
        if s == nil {
            log.Println("oh noes")
        }
    }

    fmt.Println("memory comparison...")

    runtime.ReadMemStats(&mem)
    log.Println(mem.Alloc)
    log.Println(mem.TotalAlloc)
    log.Println(mem.HeapAlloc)
    log.Println(mem.HeapSys)
}

If I run this program I’ll see the following output:

memory baseline…

2017/10/29 08:51:56 56480
2017/10/29 08:51:56 56480
2017/10/29 08:51:56 56480
2017/10/29 08:51:56 786432

memory comparison...

2017/10/29 08:51:56 200074312
2017/10/29 08:51:56 1000144520
2017/10/29 08:51:56 200074312
2017/10/29 08:51:56 200704000

So we can see the difference between what the go application was using at the point in time that the main function started and when it finished (after allocating a lot of memory via the bigBytes function). The two items we’re most interested in are TotalAlloc and HeapAlloc.

The total allocations shows us the total amount of memory accumulated (this value doesn’t decrease as memory is freed). Whereas the heap allocations indicate the amount of memory at the point in time when the snapshot was taken, and it can include both reachable and unreachable objects (e.g. objects the garbage collector hasn’t freed yet). So it’s important to realise that the amount of memory ‘in use’ could have dropped after the snapshot was taken.

Take a look at the MemStats docs for more properties (inc. Mallocs or Frees).

Pprof

Pprof is a tool for visualization and analysis of profiling data.
It’s useful for identifying where your application is spending its time (CPU and memory).

You can install it using:
go get github.com/google/pprof

To begin with, let’s understand what a “profile” is:

A Profile is a collection of stack traces showing the call sequences that led to instances of a particular event, such as allocation. Packages can create and maintain their own profiles; the most common use is for tracking resources that must be explicitly closed, such as files or network connections. – pkg/runtime/pprof

Now there are a couple of ways to use this tool:

Instrument code within binary to generate a .profile for analysis during development.
Remotely analyse binary via a web server (no .profile is explicitly generated).

Note: the profile file doesn’t have to use a .profile extension (it can be whatever you like)

Generate .profile for analysis during development

In this section we’ll look at profiling both CPU and Memory allocation.
We’ll start with CPU profiling.

CPU Analysis

In the following example we’ve modified the application to import "runtime/pprof" and added the relevant API calls in order to record CPU data:

package main

import (
	"log"
	"os"
	"runtime/pprof"
)

// bigBytes allocates 10 sets of 100 megabytes
func bigBytes() *[]byte {
	s := make([]byte, 100000000)
	return &s
}

func main() {
	pprof.StartCPUProfile(os.Stdout)
	defer pprof.StopCPUProfile()

	for i := 0; i < 10; i++ {
		s := bigBytes()
		if s == nil {
			log.Println("oh noes")
		}
	}
}

Note: we use os.Stdout to make the example easier (i.e. no need to create a file) We’ll just use the shell’s ability to redirect output to create the profile file instead.

We can then build and run the application and save the profile data to a file:
go build -o app && time ./app > cpu.profile

Finally we can inspect the data interactively using go tool like so:
go tool pprof cpu.profile

From here you’ll see an interactive prompt has started up.
So let’s execute the top command and see what output we get:

(pprof) top
Showing nodes accounting for 180ms, 100% of 180ms total
      flat  flat%   sum%        cum   cum%
     180ms   100%   100%      180ms   100%  runtime.memclrNoHeapPointers /.../src/runtime/memclr_amd64.s
         0     0%   100%      180ms   100%  main.bigBytes /.../code/go/profiling/main.go (inline)
         0     0%   100%      180ms   100%  main.main /.../code/go/profiling/main.go
         0     0%   100%      180ms   100%  runtime.(*mheap).alloc /.../src/runtime/mheap.go
         0     0%   100%      180ms   100%  runtime.largeAlloc /.../src/runtime/malloc.go
         0     0%   100%      180ms   100%  runtime.main /.../src/runtime/proc.go
         0     0%   100%      180ms   100%  runtime.makeslice /.../src/runtime/slice.go
         0     0%   100%      180ms   100%  runtime.mallocgc /.../src/runtime/malloc.go
         0     0%   100%      180ms   100%  runtime.mallocgc.func1 /.../src/runtime/malloc.go
         0     0%   100%      180ms   100%  runtime.systemstack /.../src/runtime/asm_amd64.s

So this suggests the biggest CPU consumer is runtime.memclrNoHeapPointers.

Let’s now view a “line by line” breakdown to see if we can pinpoint the CPU usage further.

We’ll do this by using the list <function regex> command.
We can see from the top command that our main function is available via main.main.

So let’s list any functions within the main namespace:

(pprof) list main\.
Total: 180ms
ROUTINE ======================== main.bigBytes in /.../go/profiling/main.go
         0      180ms (flat, cum)   100% of Total
         .          .      6:   "runtime/pprof"
         .          .      7:)
         .          .      8:
         .          .      9:// bigBytes allocates 10 sets of 100 megabytes
         .          .     10:func bigBytes() *[]byte {
         .      180ms     11:   s := make([]byte, 100000000)
         .          .     12:   return &s
         .          .     13:}
         .          .     14:
         .          .     15:func main() {
         .          .     16:   pprof.StartCPUProfile(os.Stdout)
ROUTINE ======================== main.main in /.../code/go/profiling/main.go
         0      180ms (flat, cum)   100% of Total
         .          .     15:func main() {
         .          .     16:   pprof.StartCPUProfile(os.Stdout)
         .          .     17:   defer pprof.StopCPUProfile()
         .          .     18:
         .          .     19:   for i := 0; i < 10; i++ {
         .      180ms     20:           s := bigBytes()
         .          .     21:           if s == nil {
         .          .     22:                   log.Println("oh noes")
         .          .     23:           }
         .          .     24:   }
         .          .     25:}

OK, so yes we can see that 180ms is spent in the bigBytes function and pretty much all of that function’s time is spent allocating memory via make([]byte, 100000000).

Memory Analysis

Before we move on, let’s look at how to profile our memory consumption.

To do this we’ll change our application slightly so that StartCPUProfile becomes WriteHeapProfile (we’ll also move this call to the bottom of our main function otherwise if we keep it at the top of the function no memory has been allocated at that point). We’ll also remove the StopCPUProfile call altogether (as recording the heap is done as a snapshot rather than an ongoing process like with the CPU profiling):

package main

import (
	"log"
	"os"
	"runtime/pprof"
)

// bigBytes allocates 10 sets of 100 megabytes
func bigBytes() *[]byte {
	s := make([]byte, 100000000)
	return &s
}

func main() {
	for i := 0; i < 10; i++ {
		s := bigBytes()
		if s == nil {
			log.Println("oh noes")
		}
	}

	pprof.WriteHeapProfile(os.Stdout)
}

Again, we’ll build and execute the application and redirect the stdout to a file (for simplicity), but you could have created the file dynamically within your application if you so choose:
go build -o app && time ./app > memory.profile

At this point we can now run pprof to interactively inspect the memory profile data:
go tool pprof memory.profile

Let’s run the top command and see what the output is:

(pprof) top
Showing nodes accounting for 95.38MB, 100% of 95.38MB total
      flat  flat%   sum%        cum   cum%
   95.38MB   100%   100%    95.38MB   100%  main.bigBytes /...ain.go (inline)
         0     0%   100%    95.38MB   100%  main.main /.../profiling/main.go
         0     0%   100%    95.38MB   100%  runtime.main /.../runtime/proc.go

For a simple example application like we’re using, this is fine as it indicates pretty clearly which function is responsible for the majority of the memory allocation (main.bigBytes).

But if I wanted a more specific breakdown of the data I would execute list main.:

(pprof) list main.
Total: 95.38MB
ROUTINE ======================== main.bigBytes in /.../go/profiling/main.go
   95.38MB    95.38MB (flat, cum)   100% of Total
         .          .      6:   "runtime/pprof"
         .          .      7:)
         .          .      8:
         .          .      9:// bigBytes allocates 10 sets of 100 megabytes
         .          .     10:func bigBytes() *[]byte {
   95.38MB    95.38MB     11:   s := make([]byte, 100000000)
         .          .     12:   return &s
         .          .     13:}
         .          .     14:
         .          .     15:func main() {
         .          .     16:   for i := 0; i < 10; i++ {
ROUTINE ======================== main.main in /.../code/go/profiling/main.go
         0    95.38MB (flat, cum)   100% of Total
         .          .     12:   return &s
         .          .     13:}
         .          .     14:
         .          .     15:func main() {
         .          .     16:   for i := 0; i < 10; i++ {
         .    95.38MB     17:           s := bigBytes()
         .          .     18:           if s == nil {
         .          .     19:                   log.Println("oh noes")
         .          .     20:           }
         .          .     21:   }
         .          .     22:

This indicates where all our memory is allocated on a “line-by-line” basis.

Remotely analyse via web server

In the following example we’ve modified the application to start up a web server and we’ve imported the "net/http/pprof" package which automatically profiles what’s happening.

Note: if your application already uses a web server, then you don’t need to start another. The pprof package will hook into your web server’s multiplexer.

package main

import (
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof"
	"sync"
)

// bigBytes allocates 10 sets of 100 megabytes
func bigBytes() *[]byte {
	s := make([]byte, 100000000)
	return &s
}

func main() {
	var wg sync.WaitGroup

	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()

	for i := 0; i < 10; i++ {
		s := bigBytes()
		if s == nil {
			log.Println("oh noes")
		}
	}

	wg.Add(1)
	wg.Wait() // this is for the benefit of the pprof server analysis
}

When you build and run this binary, first visit the web server the code is running.
You’ll find that the pprof data is accessible via the path /debug/pprof/:
http://localhost:6060/debug/pprof/

You should see something like:

profiles:
0	block
4	goroutine
5	heap
0	mutex
7	threadcreate

full goroutine stack dump

/debug/pprof/

Where block, goroutine, heap, mutex and threadcreate are all links off to the recorded data; and by ‘recorded data’ I mean they each link through to a different .profile. This isn’t particularly useful though. A tool is needed to process these .profile data files.

We’ll come back to that in a moment, first let’s understand what these five profiles represent:

block: stack traces that led to blocking on synchronization primitives
goroutine: stack traces of all current goroutines
heap: a sampling of all heap allocations
mutex: stack traces of holders of contended mutexes
threadcreate: stack traces that led to the creation of new OS threads

The web server can also generate a 30 second CPU profile, which you can access via http://localhost:6060/debug/pprof/profile (it won’t be viewable in the browser, instead it’ll be downloaded to your file system).

The CPU profile endpoint is not listed when viewing /debug/pprof/ simply because the CPU profile has a special API, the StartCPUProfile and StopCPUProfile functions that stream output to a writer during profiling, hence this hidden endpoint will ultimately download the results to your file system (we looked at how to use this API in the previous section).

The web server can also generate a “trace” file, which you can access via http://localhost:6060/debug/pprof/trace?seconds=5 (again, it’s not listed for similar reasons as the CPU profile - in that it generates file output that is downloaded to your file system). This trace file out requires the use of go tool trace (which we’ll cover in the next section).

Note: more info on pprof options can be found here: golang.org/pkg/net/http/pprof/

If you’re using a custom URL router, you’ll need to register the individual pprof endpoints:

package main

import (
    "net/http"
    "net/http/pprof"
)

func message(w http.ResponseWriter, r *http.Request) {
    w.Write([]byte("Hello World"))
}

func main() {
    r := http.NewServeMux()
    r.HandleFunc("/", message)

    r.HandleFunc("/debug/pprof/", pprof.Index)
    r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    r.HandleFunc("/debug/pprof/profile", pprof.Profile)
    r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    r.HandleFunc("/debug/pprof/trace", pprof.Trace)

    http.ListenAndServe(":8080", r)
}

So ideally you would use go tool pprof on the command line.
As this allows you to more easily interpret and interrogate the data interactively.

To do this, you have to run your binary as before and then, in a separate shell, execute:
go tool pprof http://localhost:6060/debug/pprof/<.profile>

For example, let’s look at the memory heap profile data:
go tool pprof http://localhost:6060/debug/pprof/heap

From here you’ll see an interactive prompt has started up:

Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /.../pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
Type: inuse_space
Time: Oct 27, 2017 at 10:01am (BST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

Note: you’ll see the “type” is set to inuse_space (meaning how much memory is still in use)

As shown, you can type either help or o to see what’s available to use.

Here are a couple of useful commands:

top: outputs top entries in text form
topK: where K is a number (e.g. top2 would show top two entries)
list <function regex>: outputs top entries in text form

For example, if I execute top then we’d see the following output:

Showing nodes accounting for 95.38MB, 100% of 95.38MB total
      flat  flat%   sum%        cum   cum%
   95.38MB   100%   100%    95.38MB   100%  main.bigBytes /...ain.go (inline)
         0     0%   100%    95.38MB   100%  main.main /.../profiling/main.go
         0     0%   100%    95.38MB   100%  runtime.main /.../runtime/proc.go

For a simple example application like we’re using, this is fine as it indicates pretty clearly which function is responsible for the majority of the memory allocation (main.bigBytes).

But if I wanted a more specific breakdown of the data I would execute list main.main:

Total: 95.38MB
ROUTINE ======================== main.main in /.../profiling/main.go
   95.38MB    95.38MB (flat, cum)   100% of Total
         .          .      8:   "sync"
         .          .      9:)
         .          .     10:
         .          .     11:// bigBytes allocates 10 sets of 100 megabytes
         .          .     12:func bigBytes() *[]byte {
   95.38MB    95.38MB     13:   s := make([]byte, 100000000)
         .          .     14:   return &s
         .          .     15:}
         .          .     16:
         .          .     17:func main() {
         .          .     18:   fmt.Println("starting...")

This indicates where all our memory is allocated on a “line-by-line” basis.

I noted earlier that the default “type” for the heap analysis was “memory still in use”. But there is an alternative type which indicates the amount of memory that was allocated in total throughout the lifetime of the program. You can switch to that mode using the -alloc_space flag like so:

go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap

Let’s see the difference in the output by executing the list command:

(pprof) list main.bigBytes

Total: 954.63MB
ROUTINE ======================== main.bigBytes in /.../go/profiling/main.go
  953.75MB   953.75MB (flat, cum) 99.91% of Total
         .          .      7:   "sync"
         .          .      8:)
         .          .      9:
         .          .     10:// bigBytes allocates 10 sets of 100 megabytes
         .          .     11:func bigBytes() *[]byte {
  953.75MB   953.75MB     12:   s := make([]byte, 100000000)
         .          .     13:   return &s
         .          .     14:}
         .          .     15:
         .          .     16:func main() {
         .          .     17:   var wg sync.WaitGroup

Note: if you wanted to be explicit you could have used the “in use” type like so:
go tool pprof -inuse_space http://localhost:6060/debug/pprof/heap

The reason to choose either -inuse_space or -alloc_space will depend on where your specific concerns are focused. For example, if you’re concerned about garbage collection performance then you’ll want to look at the “allocated” memory (i.e. -alloc_space).

Note: you can also inspect the number of objects (not just their space) with -inuse_objects and -alloc_objects.

Image Generation

You can also generate an image of your analysis data using either the flag -png, -gif or -svg and then redirecting stdout to a filename like so:

go tool pprof -png http://localhost:6060/debug/pprof/heap > data.png

This generates an image that looks like the following (notice how the bigger the box, the more resources it’s consuming - this helps you ‘at a glance’ to identify a potential problem zone):

Note: you can also output as a PDF with -pdf.

Web UI

Finally, there is a interactive web ui coming for pprof in the near future (as of November 2017).

See this post for more details.

But in short you can get the updated pprof tool from GitHub and then execute it with the new flag -http (e.g. -http=:8080).

Trace

Trace is a tool for visualization and analysis of trace data.
It’s suited at finding out what your program is doing over time, not in aggregate.

Note: if you want to track down slow functions, or generally find where your program is spending most of its CPU time, then you should consider using go tool pprof instead.

Let’s first modify our application to utilise tracing…

func main() {
	trace.Start(os.Stdout)
	defer trace.Stop()

	for i := 0; i < 10; i++ {
		s := bigBytes()
		if s == nil {
			log.Println("oh noes")
		}
	}

	var wg sync.WaitGroup
	wg.Add(1)

	var result []byte
	go func() {
		result = make([]byte, 500000000)
		log.Println("done here")
		wg.Done()
	}()

	wg.Wait()
	log.Printf("%T", result)
}

So, as far as utilising tracing, all we’ve done is import "runtime/trace" and then added calls to the trace.Start and trace.Stop functions (we defer the trace.Stop in order to ensure we trace everything our application is doing).

Additionally we create a goroutine and allocate a large 500mb slice of bytes. We wait for the goroutine to complete and then we log the type of the result. We’re doing this just so we have some additional spike data to visualise.

Now let’s re-compile our application, generate the trace data and open it with the trace tool…

$ go build -o app
$ time ./app > app.trace
$ go tool trace app.trace

Note: you can also generate a pprof compatible file from a trace by using the -pprof flag (if you decided you wanted to dynamically inspect the data that way). See the go documentation for more details.

Here’s the output from running go tool trace app.trace:

2017/10/29 09:30:40 Parsing trace...
2017/10/29 09:30:40 Serializing trace...
2017/10/29 09:30:40 Splitting trace...
2017/10/29 09:30:40 Opening browser

You’ll now see your default web browser should have automatically opened to:
http://127.0.0.1:60331

Note: it’s best to use Chrome, as go tool trace is designed to work best with it.

The page that is loaded will show the following list of links:

View trace
Goroutine analysis
Network blocking profile
Synchronization blocking profile
Syscall blocking profile
Scheduler latency profile

Each of these items can give a good insight as to what your application is doing, but we’re most interested in the first one “view trace”. Click it and it’ll give you a complete overview of what your application is doing in an interactive graph.

Note: press <Shift-?> to show shortcut keys, like w and s for zooming in/out.

Goroutines

If you zoom in enough on the graph you’ll see the “goroutines” segment is made up of two colours: a light green (runnable goroutines) and a dark green (running goroutines). If you click on that part of the graph you’ll see the details of that sample in the bottom preview of the screen. It’s interesting to see how, at any given moment, there can be multiple goroutines but not all of them are necessarily running all at once.

So in our example you see the program moves between having one goroutine ready to run, but not actually running (e.g. it’s “runnable”) and then we move towards two goroutines running (e.g. they’re both “running” and so there are no goroutines left marked as “runnable”).

It’s also interesting to see the correlation between the number of goroutines running and the number of actual underlying OS threads being utilised (i.e. the number of threads the goroutines are being scheduled on to).

Threads

Again, if you zoom in enough on the graph you’ll see the “threads” segment is made up of two colours: a light purple (syscalls) and a dark purple (running threads).

What’s interesting about the “heap” segment of the UI is that we can see for a short while we never allocate more (total) than 100mb to the heap because the go garbage collection is running concurrently (we can see it running on various processes/threads) and is clearing up after us.

This makes sense because in our code we allocate 100mb of memory and then assign it to a variable s which is scoped to exist only within the for loop block. Once that loop iteration ends the s value isn’t referenced anywhere else so the GC knows it can clean up that memory.

Heap

We can see as the program moves on that we eventually start seeing some contention and so the total allocated heap becomes 200mb then back and forth between 100mb and 200mb (this is because the GC isn’t running all the time). Then near the end we see the 500mb spike that I added to our code as the total allocated amount of memory in the heap shoots to 600mb.

But at that point, if we click on the spike in heap allocation, we can see in the bottom preview window the “NextGC” run indicates that the total allocated should be zero (which makes sense as that’s the end of the program).

Procs

In the “procs” section of the UI we can see during our large 500mb allocation that Proc 3 has a new goroutine running main.main.func1 (which is our go function that’s doing the allocating).

If you were to “View Options” and then tick “Flow events” you’ll see an arrow from the main.main function going towards the main.main.func1 function running on a separate process/thread (probably a bit difficult to see in the below image, but it’s definitely there).

So it’s good to be able to visually see the correlation between first of all the main.main.func1 goroutine running and the memory allocation occurring at that time, but also being able to see the cause and effect (i.e. the flow) of the program (i.e. knowing what exactly triggered the new goroutine to be spun up).

Conclusion

That’s our tour of various tools for profiling your Go code. Take a look at my earlier article on profiling Python for more of the same.