These instructions, while possibly still useful, are out of date as command line flags have changed and tools been phased out. Some of the incantations below have been updated, others have not. Please use common sense and update this document with improvements.
Here is a list of recent Go test failures, in case you wanted to check out a few. CI jobs file issues when a test fails on a release branch, and an owning team is determined according to TEAMS.yaml and the CODEOWNERS file. That team is then mentioned (to notify it) and the issue is put into the team’s Github project’s triage column.
...
Additionally, many tests avoid having the in-memory server log to the default test output (this is very noisy and can hide the actual failure messages). The logs will be found in a directory containing the test name.
Package-level failures
The Go testing framework does not execute testing code in a sandbox. It is thus possible for a test to crash the entire test run and these failures are typically more difficult to find. You will likely need to look through the full output to identify the problem. Here are a few things to look out for:
Data race: contains string
WARNING: DATA RACE
. This applies only to builds that have the race detector enabled. When the race detector finds a race, it will let the package exit nonzero, which translates into an opaque test failure. Luckily, this one is easy to grep for.Package crash: this is more difficult. Search for
--- FAIL
lines for a package (not a test) in the full output and up from there, to hopefully discover that the runtime aborted a the process and produced a goroutine dump. One common one encountered is “test timed out after X” or “panic: X” which you can grep for directly.Killed by oomkiller: this one is tough. The best symptom is the absence of symptoms. Sometimes you will see a “Signal: killed” in the output but this is not guaranteed.
Investigating a failure in a remote execution environment
We use EngFlow to execute some tests. See /wiki/spaces/devinf/pages/3217850384 and /wiki/spaces/devinf/pages/3141107902 .
Reproducing a test failure
...
https://www.loom.com/share/b672200008164b42b2763477ceea70da
Update [July 31, 2023]: Make sure you specify
env.TARGET
instead ofenv.PKG
, otherwise the build will fail.
stress{,race}
./dev test --stress pkg/something $PKG --filter '^MyTestName$'"^$TEST"
; ideally on a gceworker (to avoid clogging your work station).
If this doesn’t yield a reproduction in due time, you could try under race (add --race
flag) or adjust the level of parallelism:
Code Block |
---|
PKG=./pkg/what/which TEST=TestFoo P=100 ./dev test --stress |
...
$PKG --filter $TEST [--race] -- --jobs $P --local_resources=cpu=$P --local_resources=memory=HOST_RAM |
Other Notes:
To print the full logs of the failed test trial, add
--show-logs
to your test cmd. Run./dev test --help
to see this option. (The CRDB logs are default saved to a temp dir, but the path to that dir is currently broken).If you’re trying to reproduce an instance of a slow or hanging test, add a per trial timeout (i.e. fail if a test trial takes longer than 5 minutes), add
--test-args='-test.timeout 5m'
to your test cmd.test-args
are passed directly togo test
whose binary is executed for every trial in stress; therefore,--test-args
can be treated as ‘per trial’ args.Build tags are passed like this (note the
--
args separator):Code Block ./dev test --stress $PKG --filter $TEST -- --define gotags=bazel,gss,deadlock
roachprod-stress{,race}
When a gceworker won’t do, you can farm out the stressing to a number of roachprod machines. First create a cluster:
...