go tool pprof is a profiling tool built into Go. It is useful for understanding hotspots in your code. One can profile a running cockroach binary, or a specific go benchmark.
This blog post will get you started. This blog post also covers mutex profiles, which are relatively newer (and thus not covered in the first blog post).
There are broadly speaking, three types of profiles that one can examine: cpuprofiles, memory profiles, and mutex profiles. CPU profiles show you where CPU is being spent while your program is running. Memory profiles (which have 4 subtypes) broadly show you where memory is being used. Mutex profiles show you where code is waiting on a contended mutex.
No profile will give you a smoking gun, pointing you to a particular spot that, when fixed, will make your program magically faster. Instead, the different profiles give you visibility from different angles into a high-dimensional space. You need to think careful about what you need to measure, and formulate your own hypotheses, and use the appropriate profiles to test your hypotheses.
Pulling profiles from a running Cockroach server.
A running cockroach node can directly serve a pprof-compatible profile. The Admin UI debug page (
http://localhost:8080/#/debug, replacing localhost with the URI of a remote cluster). Do note on a multi-node cluster this gives you the profile from the node you talk to, which may or may not be the node(s) you care about. You can get them directly with
go tool pprof ./cockroach http://localhost:8080/debug/pprof/heap?debug=1 (the URLs for the various profile types are listed on the main
Profiles benefit from optionally specifying the binary that is being profiled. This binary must be exact: if you take a binary from a remote linux server, and are profiling on your local OS X machine, you should copy over the Linux binary as well. If you also have the exact same revision checked out in your
$GOPATH, you will see accurate source code lines, which can be very useful. Do note that the
./cockroach binary must be the same one locally as the one running on the cluster at the URL, so if you point it at a remote linux machine, first copy over the linux binary. Be careful about this because
pprof won't give you an error message, it will just happily show you incorrect symbols and you will be scratching your head about why some completely unconnected innocent function is taking up 80% of your CPU.
Pulling a profile from a go benchmark
Go benchmarks are nice because they let you isolate a very specific code path, as opposed to wading through the entire running Cockroach system. If you have a benchmark
Foo in package
bar, you can run:
make bench PKG=./pkg/bar BENCHES=BenchmarkFoo TESTFLAGS=-cpuprofile=/tmp/cpu.out.
Then you can run go tool pprof /tmp/cpu.out to jump into a CPU profile (or whatever other profile you are looking for) of that benchmark run. There is also a -memprofile flag for allocations.
Which profile to use when?
The following profiles exist:
shows you the functions that use the most CPU time. Strictly speaking, it periodically samples the callstack of the running program, and populates a graph of function calls, with percentages indicating what fraction of the samples were "in" each of the functions.
Unlike all other profile types, CPU profiles sample over a window of time instead of taking a point-in-time snapshot. The default profiling duration is 30 seconds, but it is possible to adjust this by adding the
seconds=n query param to the debug URL.
There are four kinds of heap profiles for allocations made by Go code (for allocations made by C++, see "jemalloc profile" below):
-inuse_space Display in-use memory size
-inuse_objects Display in-use object counts
-alloc_space Display allocated memory size
-alloc_objects Display allocated object counts
inuse profiles count only those memory allocations that occur during the running of the profile generation,
alloc profiles count from the start of the program start.
space measures total size of allocation,
objects counts the total number of objects regardless of size.
In general you probably care to start with
alloc since CockroachDB uses a lot of memory by design. Lots of small allocations are likely what you're looking to find and minimize (this is
inuse_objects), especially if a CPU profile shows you that a large chunk of your CPU time is taken up by the garbage collector running, or
runtime.growSlice or one of the other allocators of memory. Of course, if you are profiling a benchmark run that precisely tracks your specific code path from program start,
alloc can be more accurate.
you set a probability (which you can set via ENV variable
COCKROACH_MUTEX_PROFILE_RATE if you're pulling directly off a Cockroach binary, 0 and 1 are special: 0 disables the mutex profile and 1 captures 100% of mutex contention events. For other values, the profiler will sample on average 1/X events.) and then every time a function hits a mutex, with that probability it records how long it waited in that mutex. This lets you find places in your code where goroutines are waiting. Of course, many times goroutines are waiting by design, but not always, so this one is tricky to use.
like a mutex profile, but on all blocking events (mutexes, condition variables, select statements).
shows a snapshot of stack traces for all goroutines running at a given point in time.
We use jemalloc as our c++ memory allocator, which includes a related profiling tool. Use this if you're seeing an unexpected amount of "cgo allocations" in the runtime.go log messages (remember that the rocksdb block cache is expected to use a lot of memory here, up to the limit set on the command line). It's a little harder to use than its go counterpart. You must start the server with
COCKROACH_MEMPROF_INTERVAL=60s MALLOC_CONF=prof:true and profiles will be written to the log directory every minute. Then you can interpret the profiles with
jeprof, which can be found in
$GOPATH/native/$SOMETHING/jemalloc/bin. Benjamin Darnell is the local jemalloc expert.
pprof can be frustrating to use and navigate, due to the cryptic error messages, and the arbitrary choice of command or variable for each feature. The full list is available at
help and eventually you will use a lot more than the ones listed here, but this will get you started:
cgocall that takes up all my CPU?
CockroachDB uses RocksDB. RocksDB is written in C++, and thus all calls to RocksDB go through
CGo, Go's foreign function interface for calling external code written in C. pprof in its default mode does not cross this barrier, and thus all the CGo calls show up as a single block
cgocall. This is not helpful if, for instance, you are benchmarking how to use RocksDB more intelligently.
To open this black box, you have to
- Run Cockroach/Your benchmark on linux
- follow the instructions here
- to make sure your binary is running with the appropriate C++ symbols compiled in that pprof can use to decipher beyond the CGo barrier.
I'm confused! This is overwhelming!
I was too when I started. Here are some initial tips:
- Start with a
cpuprofile. It's the simplest profile.
web a lot. Browsing around on a single pprof graph is easier than trying to use the command line.
- Run the program/benchmark with different settings (e.g. pprofing against Cockroach running against a load generator at one setting versus another setting), looking at the graphs in each case, rather than trying to navigate pprof over the command line. Try to just take in how the graphs look different.
- Understand that profiling is difficult because so many hours have gone into looking at profiles. The low hanging fruit has mostly been picked! If you're looking for things to optimize, chances are you will not find it in existing code without some deeper digging. Talk to other engineers if you're looking for juicy trails to follow.
This golang blog post explains it extremely well and this one by Dmitry Vuykov goes into even more detail.