Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

These instructions, while possibly still useful, are out of date as command line flags have changed and tools been phased out. Some of the incantations below have been updated, others have not. Please use common sense and update this document with improvements.

Here is a list of recent Go test failures, in case you wanted to check out a few. CI jobs file issues when a test fails on a release branch, and an owning team is determined according to TEAMS.yaml and the CODEOWNERS file. That team is then mentioned (to notify it) and the issue is put into the team’s Github project’s triage column.

...

Additionally, many tests avoid having the in-memory server log to the default test output (this is very noisy and can hide the actual failure messages). The logs will be found in a directory containing the test name.

Package-level failures

The Go testing framework does not execute testing code in a sandbox. It is thus possible for a test to crash the entire test run and these failures are typically more difficult to find. You will likely need to look through the full output to identify the problem. Here are a few things to look out for:

  • Data race: contains string WARNING: DATA RACE. This applies only to builds that have the race detector enabled. When the race detector finds a race, it will let the package exit nonzero, which translates into an opaque test failure. Luckily, this one is easy to grep for.

  • Package crash: this is more difficult. Search for --- FAIL lines for a package (not a test) in the full output and up from there, to hopefully discover that the runtime aborted a the process and produced a goroutine dump. One common one encountered is “test timed out after X” or “panic: X” which you can grep for directly.

  • Killed by oomkiller: this one is tough. The best symptom is the absence of symptoms. Sometimes you will see a “Signal: killed” in the output but this is not guaranteed.

Investigating a failure in a remote execution environment

We use EngFlow to execute some tests. See /wiki/spaces/devinf/pages/3217850384 and /wiki/spaces/devinf/pages/3141107902 .

Reproducing a test failure

One common mistake is to forget to unskip the test that you’re stressing (if it is skipped to begin with, as it may be as the test-infra community service team does that to flaky tests). It happens to everyone. Just keep in mind that this is a thing that happens, and if the number of iterations seems to fly up very quickly, ponder whether it’s currently happening to you.

Below we describe how to “manually” stress tests (i.e. on your machine or a gceworker). We also recommend, as a quick first thing to try, to let CI stress the test for you:

https://www.loom.com/share/b672200008164b42b2763477ceea70da

  • Update [July 31, 2023]: Make sure you specify env.TARGET instead of env.PKG , otherwise the build will fail.

stress{,race}

./dev test --stress pkg/something $PKG --filter '^MyTestName$'"^$TEST"; ideally on a gceworker (to avoid clogging your work station).

If this doesn’t yield a reproduction in due time, you could try under race (add --race flag) or adjust the level of parallelism:

Code Block
PKG=./pkg/what/which
TEST=TestFoo
P=100
./dev test --stress $PKG --filter $TEST [--race] -- --jobs $P --local_resources=cpu=$P --local_resources=memory=HOST_RAM 

Other Notes:

  • To print the full logs of the failed test trial, add --show-logs to your test cmd. Run ./dev test --

...

  • help to see this option. (The CRDB logs are default saved to a temp dir, but the path to that dir is currently broken).

  • If you’re trying to reproduce an instance of a slow or hanging test, add a per trial timeout (i.e. fail if a test trial takes longer than 5 minutes), add --test-args='-test.timeout 5m' to your test cmd. test-args are passed directly to go test whose binary is executed for every trial in stress; therefore, --test-args can be treated as ‘per trial’ args.

  • Build tags are passed like this (note the -- args separator):

    Code Block
    ./dev test --stress $PKG --filter $TEST -- --define gotags=bazel,gss,deadlock

roachprod-stress{,race}

When a gceworker won’t do, you can farm out the stressing to a number of roachprod machines. First create a cluster:

...