Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

These instructions, while possibly still useful, are out of date as command line flags have changed and tools been phased out. Some of the incantations below have been updated, others have not. Please use common sense and update this document with improvements.

Here is a list of recent Go test failures, in case you wanted to check out a few. CI jobs file issues when a test fails on a release branch, and an owning team is determined according to TEAMS.yaml and the CODEOWNERS file. That team is then mentioned (to notify it) and the issue is put into the team’s Github project’s triage column.

...

Additionally, many tests avoid having the in-memory server log to the default test output (this is very noisy and can hide the actual failure messages). The logs will be found in a directory containing the test name.

Package-level failures

The Go testing framework does not execute testing code in a sandbox. It is thus possible for a test to crash the entire test run and these failures are typically more difficult to find. You will likely need to look through the full output to identify the problem. Here are a few things to look out for:

  • Data race: contains string WARNING: DATA RACE. This applies only to builds that have the race detector enabled. When the race detector finds a race, it will let the package exit nonzero, which translates into an opaque test failure. Luckily, this one is easy to grep for.

  • Package crash: this is more difficult. Search for --- FAIL lines for a package (not a test) in the full output and up from there, to hopefully discover that the runtime aborted a the process and produced a goroutine dump. One common one encountered is “test timed out after X” or “panic: X” which you can grep for directly.

  • Killed by oomkiller: this one is tough. The best symptom is the absence of symptoms. Sometimes you will see a “Signal: killed” in the output but this is not guaranteed.

Investigating a failure in a remote execution environment

We use EngFlow to execute some tests. See /wiki/spaces/devinf/pages/3217850384 and /wiki/spaces/devinf/pages/3141107902 .

Reproducing a test failure

...

https://www.loom.com/share/b672200008164b42b2763477ceea70da

  • Update [July 31, 2023]: Make sure you specify env.TARGET instead of env.PKG , otherwise the build will fail.

stress{,race}

./dev test --stress pkg/something $PKG --filter '^MyTestName$'"^$TEST"; ideally on a gceworker (to avoid clogging your work station).

If this doesn’t yield a reproduction in due time, you could try under race (add --race flag) or adjust the level of parallelism:

Code Block
PKG=./pkg/what/which
TEST=TestFoo
P=100
./dev test --stress $PKG --filter $TEST [--

...

race] -- --jobs $P --local_resources=cpu=$P --local_resources=memory=HOST_RAM 

Other Notes:

  • To print the full logs of the failed test trial, add --show-logs to your test cmd. Run ./dev test --help to see this option. (The CRDB logs are default saved to a temp dir, but the path to that dir is currently broken).

  • If you’re trying to reproduce an instance of a slow or hanging test, add a per trial timeout (i.e. fail if a test trial takes longer than 5 minutes), add --test-args='-test.timeout 5m' to your test cmd. test-args are passed directly to go test whose binary is executed for every trial in stress; therefore, --test-args can be treated as ‘per trial’ args.

  • Build tags are passed like this (note the -- args separator):

    Code Block
    ./dev test --stress $PKG --filter $TEST -- --define gotags=bazel,gss,deadlock

roachprod-stress{,race}

When a gceworker won’t do, you can farm out the stressing to a number of roachprod machines. First create a cluster:

...