Error concepts

Programmers do not usually like thinking about errors. When learning how to program, initially, programming assignments are silent about error handling, or at best dismissive. For many applications, the “best practice” for error handling is “exit the program as soon as an error is encountered”.

In contrast, in a distributed system, especially in CockroachDB, error definitions and error handling are a critical aspect of product quality.

Here are some of the important things that we care about:

Basic usage in Go

See the sub-page “Error handling basics”, which is also included in the overall Go style guide.

In a nutshell:

Errors and stability

Here are the general rules about how errors are allowed to impact the lifecycle of a network service:

Situation

Examples

What to do

Stop client session?

Send crash report to telemetry?

Stop process?

Error due to user input in request/query; or computational error in the query language

SELECT invalid;

SELECT 1/0;

HTTP: request for object that does not exist

Return a regular error response to the client.

SQL: use a SQLSTATE code (See section below)

HTTP: use an appropriate HTTP code

No

No

No

Server detects unexpected condition scoped to a single client query. The situation does not correspond to a candidate future feature extension.

Unreachable code was reached

Precondition does not hold while processing client-specific input

return errors.AssertionFailedf(…)

(or NewAssertionFailureWithWrappedErrf)

No

Automatic for assertion failures

No

Server detects unexpected condition scoped to a single client query. The situation is a candidate future feature extension.

Client passes a combination of parameters that is not yet supported.

A complex condition arrives in the default or else clause of the code. At the time the code was written, that condition was thought to be impossible, but someone comes with a counter-example that makes sense.

Find a related issue or file a new one. Make an error withunimplemented.NewWithIssue(…) and refer to the issue. Mark the issue with labels docs-known-limitationand X-anchored-telemetry.

No

No
(Although all unimplemented errors get their own, non-crash telemetry automatically too.)

No

Server detects unexpected invalid state scoped to the client session

Unreachable code was reached

Precondition does not hold while processing internal session-bound state

Propagate assertion error to client, see above. (Wrap existing errors)

If the error pertains to an admin-only feature, call log.Warningf

Yes or make it read-only

Automatic for assertion failures

No

Server detects unexpected invalid state with uncertain scope on a read-only path or a path guaranteed not to persist data

Shared subsystem returns an unexpected error

Data returned from disk does not comply to expected type

A read-then-write operation reads invalid data from disk.

Propagate assertion error to client, see above. (Wrap existing errors)

Also call log.Errorf

Ensure no data is persisted after the error is detected

Yes or make it read-only

Automatic for assertion failures

No

Server detects unexpected invalid state on a path that might persist data in storage

The post-conditions during a data persistence operation fail

A write operation to a data persistence output fails in a way that doesn’t allow the write to be cancelled (e.g. corruption detected KV storage, or write error critical log sink).

Call log.Fatalf

Automatic by log.Fatal

Automatic by log.Fatal

Automatic by log.Fatal

Large strings inside error payloads

Be careful not to include arbitrarily large strings inside error payloads.

This can cause excessive memory consumption (even a server crash) and incomplete/truncated crash reports.

When in doubt, only include a prefix up to a maximum length. Use a special character (e.g. unicode ellipsis “…” ) to indicate that truncation happened.

Errors and performance

We work under the assumption that errors are important, but yet are uncommon.

There are two sides of this “uncommon“ coin:

For example:

Bad

Good

func myFunc(x int) (result, error) {
   maybeErr := errors.New("hello")
   if x > 10 {
      return nil, maybeErr
   }
   return result, nil
}
func myFunc(x int) (result, error) {
   if x > 10 {
      return nil, errors.New("hello")
   }
   return result, nil
}

or alternatively, when the error will be tested elsewhere:

var maybeErr = errors.New("hello")
func myFunc(x int) (result, error) {
   if x > 10 {
      return nil, errors.WithStack(maybeErr)
   }
   return result, nil
}

Error messages, hints and codes

Error objects are structured. We use different parts of an error object for different purposes. Care should be taken to not stuff text/data intended for one field into another.

Field

What it’s for

Example

Message (mandatory)

Tells the human user a summary what happened.

  • The message is for the human user: tell what happened in prose.

  • It’s a summary. Keep it short (yet clear and accurate).

  • The message is about what happened up to the point the error occurred. It should be descriptive about the past / user input.

  • The error message is likely to be embedded in textual contexts that assume a single-line string:

    • Do not start the message with a capital nor end it with a period.

    • Avoid newline characters.

  • Be open to feedback from users and documentation writers about how to improve the text of the message.

  • There is a single message per error object: composite errors concatenate their messages.

errors.Newf(“invalid input: %v”, userInput)

SQLSTATE (highly recommended)

A 5-character code meant to inform automation about what happened and what it can do about the error.

  • Try to use the same code as PostgreSQL in an equivalent situation.

  • Only be creative if PostgreSQL has no equivalent or related situation.

  • When you are creative, be mindful that the SQLSTATE codes are organized in categories indicated by the first 2 characters. Use the proper category for your error.

  • Use SQL logic tests to verify that the proper SQLSTATE is returned in known situations.

  • We have special codes:

    • XX000 - internal error; code automatically derived for assertion failures, also triggers a crash report in telemetry when the error flows back to the client.

    • XXUUU - automatically chosen when the error does not announce its own SQLSTATE. We should reduce occurrences of XXUUU over time; a user encountering this is a suggestion to enhance our error handling to choose a better code.

    • XXA00 - txn committed but schema change failed. The transaction did commit but a schema change op failed. Manual intervention is likely needed.

    • See pgcode/codes.go for more.

  • PostgreSQL has special codes which are equally special in CockroachDB:

    • 40001: serialization error. The transaction did not commit and can be retried.

    • 40003: statement completion unknown. The transaction may or may not have committed and may or may not be retried. Manual intervention is likely needed.

pgerror.New(pgcode.CheckViolation, "CHECK constraint failed")

or add a SQLSTATE to an existing error:

pgerror.WithCandidateCode(someErr, pgcode.CheckViolation)

Hint (optional, recommended)

Tells the human user about what they can do to resolve the error.

  • The hints are for the human user: tell in prose.

  • Make recommendations about what the user can change to observe a different outcome.

  • Hints are presented to the user in paragraphs:

    • Each hint payload can be multi-line.

    • Use full sentences, with a capital at the beginning and a period at the end.

    • There can be multiple hint payloads. They typically appear under each other.

errors.WithHintf(
  errors.Newf("unknown value: %s", word),
  "Accepted values: %s", strings.Join(",", possibleValues)
)

Detail (optional)

Tells the human user about the details of what happened.

  • The detail field is for the human user: tell in prose.

  • It’s also about what happened in the past.

  • Details are presented to the user in paragraphs:

    • Each detail payload can be multi-line.

    • Use full sentences, with a capital at the beginning and a period at the end.

    • There can be multiple detail payloads. They typically appear under each other.

errors.WithDetailf(
   errors.Newf(“invalid keyword: %s”, word),
   "Error encountered while processing input:\n%s", multiLineInput))

Errors as API

What does it mean that “errors are part of the documented API”?

There is a careful balance to maintain: users want to have more guarantees, but each guarantee comes with an engineering burden.

Here is how we manage the amount of engineering work:

Checking errors, errors in unit tests

Sensitive data inside error objects

Many error objects are copied into logs, crash reports or other artifacts that are then communicated to CRL Tech Support automatically.

To preserve the confidentiality of our customer data, we are careful to isolate user-provided data from strings that are fixed inside CockroachDB. We call this “redactability”.

See the page Log and error redactability for more details.

General concepts: