Note: In the past, Delve has garnered a reputation for randomly panicking and not living up to the expectations of a debugger. However, a couple of years have passed since then, and it is now possible to attach to and debug running cockroach clusters, making it a powerful tool to have in your arsenal.
Delve is an extremely useful tool to debug go programs. It has the ability to attach to a running process by simply specifying the process ID. Once attached to the process, you can do things such as switch goroutines, jump to a certain stack frame, and inspect state. What follows is a quick tutorial on how to get delve up and running, attach to a running cockroach process, and look at some state.
This tutorial assumes you have a running cockroach process.
Step 1: Get delve (OSX, Linux):
Running go get
should be the only thing you need to do:
$ go get -u github.com/go-delve/delve/cmd/dlv
Step 2: Attach to process:
Get the process ID of the cockroach node you want to attach to and run:
$ dlv attach <pid>
Note that you must have the source code at the same sha as the binary you are debugging checked out for any debugging to make sense.
Step 3: Do what you came to do:
Now you attached to the process, the process is paused. You can jump around and look at things. For example, you can list the currently running goroutines and their IDs:
(dlv) goroutines [...] Goroutine 2239 - User: ./pkg/sql/create_stats.go:222 github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startJob (0x58d576a) [...]
Once you have an interesting goroutine ID (e.g. from stuck goroutine stacks), you can jump to a goroutine:
(dlv) goroutine 2239 Switched from 0 to 2239 (thread 5888500)
Check its trace:
(dlv) bt 0 0x00000000040308eb in runtime.gopark at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:303 1 0x0000000004030993 in runtime.goparkunlock at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:308 2 0x00000000040082ae in runtime.chanrecv at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/chan.go:520 3 0x0000000004007f9b in runtime.chanrecv1 at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/chan.go:402 4 0x00000000058d576a in github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startJob at ./pkg/sql/create_stats.go:222 5 0x0000000005a3f0c3 in github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startExec.func1 at ./pkg/sql/create_stats.go:76 6 0x000000000405f281 in runtime.goexit at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/asm_amd64.s:1333
Jump to an interesting frame:
(dlv) frame 4 > runtime.gopark() /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:303 (PC: 0x40308eb) Warning: debugging optimized function Frame 4: ./pkg/sql/create_stats.go:222 (PC: 58d576a) 217: Progress: jobspb.CreateStatsProgress{}, 218: }) 219: if err != nil { 220: return err 221: } => 222: return <-errCh 223: } 224: 225: // maxNonIndexCols is the maximum number of non-index columns that we will use 226: // when choosing a default set of column statistics. 227: const maxNonIndexCols = 100
And print out some state (here n
is the variable name for the *createStatsNode
):
(dlv) p n *github.com/cockroachdb/cockroach/pkg/sql.createStatsNode { CreateStats: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.CreateStats { Name: "__auto__", ColumnNames: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.NameList len: 0, cap: 0, nil, Table: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.TableExpr(*github.com/cockroachdb/cockroach/pkg/sql/sem/tree.TableRef) ..., AsOf: (*"github.com/cockroachdb/cockroach/pkg/sql/sem/tree.AsOfClause")(0xc007d8eb78),}, p: *github.com/cockroachdb/cockroach/pkg/sql.planner { txn: *(*"github.com/cockroachdb/cockroach/pkg/internal/client.Txn")(0xc00b56c240), stmt: *(*"github.com/cockroachdb/cockroach/pkg/sql.Statement")(0xc007d3ab60), semaCtx: (*"github.com/cockroachdb/cockroach/pkg/sql/sem/tree.SemaContext")(0xc00b85b460), [...] run: github.com/cockroachdb/cockroach/pkg/sql.createStatsRun {resultsCh: chan github.com/cockroachdb/cockroach/pkg/sql/sem/tree.Datums 0/0, errCh: chan error 0/0},}
There are a lot of things you can do with dlv
, some of which you can start exploring by running help
while in dlv
. If something is unclear, someone on the #engineering
channel will be willing to help. Hopefully these steps can at least get you started in using a powerful debugging tool to cut down the amount of time spent wondering what's going wrong with a cluster.
How to: Get a linux delve binary onto a roachprod node
The instructions on the delve repo require go to be installed, which is something that the roachprod nodes do not have. The easiest way is to build the delve binary on your development machine. Skip the cloning if you already have the repo:
$ git clone https://github.com/go-delve/delve.git $GOPATH/src/github.com/go-delve/delve $ cd $GOPATH/src/github.com/go-delve/delve/cmd/dlv # Build for linux-amd64. $ GOOS=linux GOARCH=amd64 go build
And then copy that to the roachprod node you're interested in debugging:
$ roachprod put $CLUSTER:<node-id> $GOPATH/src/github.com/go-delve/delve/cmd/dlv/dlv