go tool pprof
is a profiling tool built into Go. It is useful for understanding hotspots in your code. One can profile a running cockroach binary, or a specific go benchmark.
Table of Contents |
---|
Basics
This blog post will get you started. This blog post also covers mutex profiles, which are relatively newer (and thus not covered in the first blog post).
...
No profile will give you a smoking gun, pointing you to a particular spot that, when fixed, will make your program magically faster. Instead, the different profiles give you visibility from different angles into a high-dimensional space. You need to think careful about what you need to measure, and formulate your own hypotheses, and use the appropriate profiles to test your hypotheses.
Pulling profiles from a running Cockroach server.
A running cockroach node can directly serve a pprof-compatible profile. The Admin UI debug page (http://localhost:8080/#/debug
, replacing localhost with the URI of a remote cluster). Do note on a multi-node cluster this gives you the profile from the node you talk to, which may or may not be the node(s) you care about. You can get them directly with go tool pprof ./cockroach http://localhost:8080/debug/pprof/heap?debug=1
(the URLs for the various profile types are listed on the main debug
page).
Profiles benefit from optionally specifying the binary that is being profiled. This binary must be exact: if you take a binary from a remote linux server, and are profiling on your local OS X machine, you should copy over the Linux binary as well. If you also have the exact same revision checked out in your $GOPATH
, you will see accurate source code lines, which can be very useful. Do note that the ./cockroach
binary must be the same one locally as the one running on the cluster at the URL, so if you point it at a remote linux machine, first copy over the linux binary. Be careful about this because pprof
won't give you an error message, it will just happily show you incorrect symbols and you will be scratching your head about why some completely unconnected innocent function is taking up 80% of your CPU.
Pulling a profile from a go benchmark
Go benchmarks are nice because they let you isolate a very specific code path, as opposed to wading through the entire running Cockroach system. If you have a benchmark Foo
in package bar
, you can run:
...
Then you can run go tool pprof /tmp/cpu.out to jump into a CPU profile (or whatever other profile you are looking for) of that benchmark run. There is also a -memprofile flag for allocations.
Which profile to use when?
The following profiles exist:
CPU profiles
shows you the functions that use the most CPU time. Strictly speaking, it periodically samples the callstack of the running program, and populates a graph of function calls, with percentages indicating what fraction of the samples were "in" each of the functions.
Unlike all other profile types, CPU profiles sample over a window of time instead of taking a point-in-time snapshot. The default profiling duration is 30 seconds, but it is possible to adjust this by adding the seconds=n
query param to the debug URL.
Heap profile
There are four kinds of heap profiles for allocations made by Go code (for allocations made by C++, see "jemalloc profile" below):
...
In general you probably care to start with inuse
over alloc
since CockroachDB uses a lot of memory by design. Lots of small allocations are likely what you're looking to find and minimize (this is inuse_objects
), especially if a CPU profile shows you that a large chunk of your CPU time is taken up by the garbage collector running, or runtime.growSlice
or one of the other allocators of memory. Of course, if you are profiling a benchmark run that precisely tracks your specific code path from program start, alloc
can be more accurate.
Mutex profile
you set a probability (which you can set via ENV variable COCKROACH_MUTEX_PROFILE_RATE
if you're pulling directly off a Cockroach binary, 0 and 1 are special: 0 disables the mutex profile and 1 captures 100% of mutex contention events. For other values, the profiler will sample on average 1/X events.) and then every time a function hits a mutex, with that probability it records how long it waited in that mutex. This lets you find places in your code where goroutines are waiting. Of course, many times goroutines are waiting by design, but not always, so this one is tricky to use.
Block profile
like a mutex profile, but on all blocking events (mutexes, condition variables, select statements).
Goroutine profile
shows a snapshot of stack traces for all goroutines running at a given point in time.
Jemalloc profile
We use jemalloc as our c++ memory allocator, which includes a related profiling tool. Use this if you're seeing an unexpected amount of "cgo allocations" in the runtime.go log messages (remember that the rocksdb block cache is expected to use a lot of memory here, up to the limit set on the command line). It's a little harder to use than its go counterpart. You must start the server with COCKROACH_MEMPROF_INTERVAL=60s MALLOC_CONF=prof:true
and profiles will be written to the log directory every minute. Then you can interpret the profiles with jeprof
, which can be found in $GOPATH/native/$SOMETHING/jemalloc/bin
. Benjamin Darnell (Deactivated) is the local jemalloc expert.
Useful pprof
commands
pprof
can be frustrating to use and navigate, due to the cryptic error messages, and the arbitrary choice of command or variable for each feature. The full list is available at help
and eventually you will use a lot more than the ones listed here, but this will get you started:
web
: the most useful tool: fire up your browser with an SVG image of the current view of the profiler. The variables listed below (the kinds that arex=y
) manipulate the state of pprof, and subsequent calls toweb
redraw a new image, with the new view.pdf
: likeweb
, but generate a PDF instead of an SVG. Useful for uploading damning evidence to GitHub, which does not allow SVG uploads.top<N>
shows you the top N (by whatever metric you are currently profiling) functions. Some people use this, but I find it confusing and prefer to look at the web view.list
: outputs annotated source for functions matching regexp. This is great for seeing where in a function resources are being used.weblist
: likelist
, but opens up in a browser for easier viewing.call_tree
. By default, pprof draws a directed acyclic graph. This can get a little messy to understand, particularly if you care exactly when a particular hot node in the graph is getting called, and it has multiple callers.call_tree
draws a tree (so child nodes from different callers are separated into different subtrees).focus=foo
: focus on functions namedfoo
(i.e. draw a root node for each function that matches the regexpfoo
).nodefraction=xyz
: by default, pprof tries to intelligently "prune" the tree to only show a subset of nodes so t the graph is manageable. But sometimes this is too aggressive, and you want to relax it/make your browser suffer. Do note that on each SVG/PDF, the legend box shows the current edgefraction (under "Dropped nodes (cum <=xyzs)").cum
: set pprof into "cumulative" mode. By default pprof starts inflat
mode, which shows how much time was spent in that function, whereas cum shows how much time was spent in that function including all children. Note thatcum
/flat
just sets pprof into cumulative/flat mode, which affects subsequent calls totop10
andweb
, etc.edgefraction
,nodecount
: like nodefraction, different ways to prune the graph.peek foo: shows info about time spent in callers of foo
What's this cgocall
that takes up all my CPU?
CockroachDB uses RocksDB. RocksDB is written in C++, and thus all calls to RocksDB go through CGo
, Go's foreign function interface for calling external code written in C. pprof in its default mode does not cross this barrier, and thus all the CGo calls show up as a single block cgocall
. This is not helpful if, for instance, you are benchmarking how to use RocksDB more intelligently.
...
- to make sure your binary is running with the appropriate C++ symbols compiled in that pprof can use to decipher beyond the CGo barrier.
I'm confused! This is overwhelming!
I was too when I started. Here are some initial tips:
- Start with a
cpuprofile
. It's the simplest profile. - Use
web
a lot. Browsing around on a single pprof graph is easier than trying to use the command line. - Run the program/benchmark with different settings (e.g. pprofing against Cockroach running against a load generator at one setting versus another setting), looking at the graphs in each case, rather than trying to navigate pprof over the command line. Try to just take in how the graphs look different.
- Understand that profiling is difficult because so many hours have gone into looking at profiles. The low hanging fruit has mostly been picked! If you're looking for things to optimize, chances are you will not find it in existing code without some deeper digging. Talk to other engineers if you're looking for juicy trails to follow.
See also
This golang blog post explains it extremely well and this one by Dmitry Vuykov goes into even more detail.