Note: In the past, Delve has garnered a reputation for randomly panicking and not living up to the expectations of a debugger. However, a couple of years have passed since then, and it is now possible to attach to and debug running cockroach clusters, making it a powerful tool to have in your arsenal.
Delve is an extremely useful tool to debug go programs. It has the ability to attach to a running process by simply specifying the process ID. Once attached to the process, you can do things such as switch goroutines, jump to a certain stack frame, and inspect state. What follows is a quick tutorial on how to get delve up and running, attach to a running cockroach process, and look at some state.
This tutorial assumes you have a running cockroach process.
Debugging A Process
Step 1: Get delve (OSX, Linux):
Running go get
should be the only thing you need to do:
$ go get -u github.com/go-delve/delve/cmd/dlv
Step 2: Attach to process:
Get the process ID of the cockroach node you want to attach to and run:
$ dlv attach <pid>
Note that you must have the source code at the same sha as the binary you are debugging checked out for any debugging to make sense.
Step 3: Do what you came to do:
Now you attached to the process, the process is paused. You can jump around and look at things. For example, you can list the currently running goroutines and their IDs:
(dlv) goroutines [...] Goroutine 2239 - User: ./pkg/sql/create_stats.go:222 github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startJob (0x58d576a) [...]
Once you have an interesting goroutine ID (e.g. from stuck goroutine stacks), you can jump to a goroutine:
(dlv) goroutine 2239 Switched from 0 to 2239 (thread 5888500)
Check its trace:
(dlv) bt 0 0x00000000040308eb in runtime.gopark at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:303 1 0x0000000004030993 in runtime.goparkunlock at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:308 2 0x00000000040082ae in runtime.chanrecv at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/chan.go:520 3 0x0000000004007f9b in runtime.chanrecv1 at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/chan.go:402 4 0x00000000058d576a in github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startJob at ./pkg/sql/create_stats.go:222 5 0x0000000005a3f0c3 in github.com/cockroachdb/cockroach/pkg/sql.(*createStatsNode).startExec.func1 at ./pkg/sql/create_stats.go:76 6 0x000000000405f281 in runtime.goexit at /usr/local/Cellar/go/1.11.5/libexec/src/runtime/asm_amd64.s:1333
Jump to an interesting frame:
(dlv) frame 4 > runtime.gopark() /usr/local/Cellar/go/1.11.5/libexec/src/runtime/proc.go:303 (PC: 0x40308eb) Warning: debugging optimized function Frame 4: ./pkg/sql/create_stats.go:222 (PC: 58d576a) 217: Progress: jobspb.CreateStatsProgress{}, 218: }) 219: if err != nil { 220: return err 221: } => 222: return <-errCh 223: } 224: 225: // maxNonIndexCols is the maximum number of non-index columns that we will use 226: // when choosing a default set of column statistics. 227: const maxNonIndexCols = 100
And print out some state (here n
is the variable name for the *createStatsNode
):
(dlv) p n *github.com/cockroachdb/cockroach/pkg/sql.createStatsNode { CreateStats: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.CreateStats { Name: "__auto__", ColumnNames: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.NameList len: 0, cap: 0, nil, Table: github.com/cockroachdb/cockroach/pkg/sql/sem/tree.TableExpr(*github.com/cockroachdb/cockroach/pkg/sql/sem/tree.TableRef) ..., AsOf: (*"github.com/cockroachdb/cockroach/pkg/sql/sem/tree.AsOfClause")(0xc007d8eb78),}, p: *github.com/cockroachdb/cockroach/pkg/sql.planner { txn: *(*"github.com/cockroachdb/cockroach/pkg/internal/client.Txn")(0xc00b56c240), stmt: *(*"github.com/cockroachdb/cockroach/pkg/sql.Statement")(0xc007d3ab60), semaCtx: (*"github.com/cockroachdb/cockroach/pkg/sql/sem/tree.SemaContext")(0xc00b85b460), [...] run: github.com/cockroachdb/cockroach/pkg/sql.createStatsRun {resultsCh: chan github.com/cockroachdb/cockroach/pkg/sql/sem/tree.Datums 0/0, errCh: chan error 0/0},}
There are a lot of things you can do with dlv
, some of which you can start exploring by running help
while in dlv
. If something is unclear, someone on the #engineering
channel will be willing to help. Hopefully these steps can at least get you started in using a powerful debugging tool to cut down the amount of time spent wondering what's going wrong with a cluster.
One useful trick is that you can dive into any structures you have a pointer to. On most architectures (i.e. in practice nearly always) the receiver of a method in a stack trace corresponds to the first memory address listed for that frame. For example, to print the innards of a *Replica
:
$ dlv attach 24490 Type 'help' for list of commands. (dlv) print (*storage.Replica)(0xc01fed5800).mu.quotaReleaseQueue []int64 len: 1, cap: 1, [977] (dlv) print (*storage.Replica)(0xc01fed5800).mu.proposalQuotaBaseIndex 23 (dlv) print (*storage.Replica)(0xc01fed5800).RangeID 1514 (dlv) exit Would you like to kill the process? [Y/n] n
This can be scripted to extract information from a running process with an interruption that is (hopefully) small enough to not disturb the situation.
Debugging Tests
A common place to drop into a debugger is right in the middle of a unit test. This allows one to quickly iterate in a TDD fashion. With Delve there is a unique sub-command for starting a debugging session with tests running: dlv test
. After executing this command, you break at the start of a specific unit test by issuing a break <TestName>
statement. Finally, if you don't know the name of the test (or don't want to stop dlv
to find it), you can issue the funcs
command with a regex expression to list the tests of similar names. For example, here is a snippet of dropping into a dlv test
session and breaking on a test in the github.com/cockroachlabs/managed-service/pkg/auditlog directory:
$ dlv test Type 'help' for list of commands. (dlv) funcs auditlog.Test* github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub.func1 github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub.func2 github.com/cockroachlabs/managed-service/pkg/auditlog.TestAuditLogMiddleware_LogSuccessfulQueryWithSensitiveDataToScrub.func3 github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery.func1 github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery.func2 github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery.func2.1 github.com/cockroachlabs/managed-service/pkg/auditlog.TestRecordHttpRequest github.com/cockroachlabs/managed-service/pkg/auditlog.TestRecordHttpRespAndErr github.com/cockroachlabs/managed-service/pkg/auditlog.TestRecordRequestAndRecordRespAndErr github.com/cockroachlabs/managed-service/pkg/auditlog.TestScrubNestedField github.com/cockroachlabs/managed-service/pkg/auditlog.TestScrubSensitiveFields (dlv) break TestLogFailedQuery Breakpoint 1 set at 0x1ce06bb for github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery() ./middleware_test.go:83 (dlv) c > github.com/cockroachlabs/managed-service/pkg/auditlog.TestLogFailedQuery() ./middleware_test.go:83 (hits goroutine(78):1 total:1) (PC: 0x1ce06bb) 78: require.Equal(t, nulls.NewString("{}"), entry.Response) // default is null so this confirms the second write succeeded 79: require.Equal(t, nulls.NewString(""), entry.Error) 80: }) 81: } 82: => 83: func TestLogFailedQuery(t *testing.T) { 84: conn, err := pop.Connect("test") 85: require.NoError(t, err) 86: 87: // Simulate authenticated request 88: om, cluster := models.CreateRandomOrgUserAndCluster(t, conn) (dlv)
Getting a Linux Delve binary onto a roachprod node
The instructions on the delve repo require go to be installed, which is something that the roachprod nodes do not have. The easiest way is to build the delve binary on your development machine. Skip the cloning if you already have the repo:
$ git clone https://github.com/go-delve/delve.git $GOPATH/src/github.com/go-delve/delve $ cd $GOPATH/src/github.com/go-delve/delve/cmd/dlv # Build for linux-amd64. $ GOOS=linux GOARCH=amd64 go build
And then copy that to the roachprod node you're interested in debugging:
$ roachprod put $CLUSTER:<node-id> $GOPATH/src/github.com/go-delve/delve/cmd/dlv/dlv