Error concepts and handling in CockroachDB
Error concepts
Programmers do not usually like thinking about errors. When learning how to program, initially, programming assignments are silent about error handling, or at best dismissive. For many applications, the “best practice” for error handling is “exit the program as soon as an error is encountered”.
In contrast, in a distributed system, especially in CockroachDB, error definitions and error handling are a critical aspect of product quality.
Here are some of the important things that we care about:
Errors should not cause running servers (e.g. a database node) to terminate immediately. Customers would consider this an unacceptable defect. Correct and deliberate error handling is a core part of product quality and stability.
Users will read the text of error messages, however users cannot be assumed to understand the source code. If an error message is confusing, the users will ask confused questions to our tech support. If an error message is misguiding, the users will ask the wrong questions to our tech support. And so on. Error messages should be clear and accurate and avoid referring to source code internals.
Any error visible to one user will likely be visible to dozens, if not thousands of users eventually. We want our users to understand what they should do about an error on their own, so they do not need to reach out to technical support. For this, we want our error messages to be self-explanatory and include hint annotations. We also make specific error codes (e.g. SQLSTATE) part of our public, documented API for using CockroachDB.
Errors are part of the API and thus error situations should be exercised in unit tests.
Error make their way to log files and crash reports and can contain user-provided data. We care to separate customer confidential data from non-confidential data in log files and crash reports, and so we need to distinguish sensitive data inside error objects too.
Basic usage in Go
See the sub-page “Error handling basics”, which is also included in the overall Go style guide.
In a nutshell:
We prefer the use of the CockroachDB errors library at github.com/cockroachdb/errors. This is a superset of Go's own errors and pkg/errors.
use
errors.Wrap
to add context to an errorhandle type assertion failures gracefully with an error, instead of letting Go generate a panic
avoid panics generally, unless in an
init
function or in a package that uses a disciplined panic-based error handling protocol (and converts panics to errors)
Errors and stability
Here are the general rules about how errors are allowed to impact the lifecycle of a network service:
Situation | Examples | What to do | Stop client session? | Send crash report to telemetry? | Stop process? |
---|---|---|---|---|---|
Error due to user input in request/query; or computational error in the query language |
HTTP: request for object that does not exist | Return a regular error response to the client. SQL: use a SQLSTATE code (See section below) HTTP: use an appropriate HTTP code | No | No | No |
Server detects unexpected condition scoped to a single client query. The situation does not correspond to a candidate future feature extension. | Unreachable code was reached Precondition does not hold while processing client-specific input |
(or | No | Automatic for assertion failures | No |
Server detects unexpected condition scoped to a single client query. The situation is a candidate future feature extension. | Client passes a combination of parameters that is not yet supported. A complex condition arrives in the | Find a related issue or file a new one. Make an error with | No | No | No |
Server detects unexpected invalid state scoped to the client session | Unreachable code was reached Precondition does not hold while processing internal session-bound state | Propagate assertion error to client, see above. (Wrap existing errors) If the error pertains to an admin-only feature, call | Yes or make it read-only | Automatic for assertion failures | No |
Server detects unexpected invalid state with uncertain scope on a read-only path or a path guaranteed not to persist data | Shared subsystem returns an unexpected error Data returned from disk does not comply to expected type A read-then-write operation reads invalid data from disk. | Propagate assertion error to client, see above. (Wrap existing errors) Also call Ensure no data is persisted after the error is detected | Yes or make it read-only | Automatic for assertion failures | No |
Server detects unexpected invalid state on a path that might persist data in storage | The post-conditions during a data persistence operation fail A write operation to a data persistence output fails in a way that doesn’t allow the write to be cancelled (e.g. corruption detected KV storage, or write error critical log sink). | Call | Automatic by | Automatic by | Automatic by |
Large strings inside error payloads
Be careful not to include arbitrarily large strings inside error payloads.
This can cause excessive memory consumption (even a server crash) and incomplete/truncated crash reports.
A copy of the SQL syntax input by the SQL client is usually OK.
Placeholder values or the body of COPY statements can be more tricky.
Be especially careful with data loaded from storage.
Be careful of data generated from SQL built-in functions or subqueries.
When in doubt, only include a prefix up to a maximum length. Use a special character (e.g. unicode ellipsis “…” ) to indicate that truncation happened.
Errors and performance
We work under the assumption that errors are important, but yet are uncommon.
There are two sides of this “uncommon“ coin:
Error handling does not need to be optimized for performance. For example, we tolerate a moderate amount of string processing and heap allocations to construct error objects.
Error objects should not be constructed on the common path. Only construct errors when needed.
For example:
Bad | Good |
---|---|
func myFunc(x int) (result, error) {
maybeErr := errors.New("hello")
if x > 10 {
return nil, maybeErr
}
return result, nil
} | func myFunc(x int) (result, error) {
if x > 10 {
return nil, errors.New("hello")
}
return result, nil
} or alternatively, when the error will be tested elsewhere: var maybeErr = errors.New("hello")
func myFunc(x int) (result, error) {
if x > 10 {
return nil, errors.WithStack(maybeErr)
}
return result, nil
} |
Error messages, hints and codes
Error objects are structured. We use different parts of an error object for different purposes. Care should be taken to not stuff text/data intended for one field into another.
Field | What it’s for | Example |
---|---|---|
Message (mandatory) | Tells the human user a summary what happened.
| |
SQLSTATE (highly recommended) | A 5-character code meant to inform automation about what happened and what it can do about the error.
| or add a SQLSTATE to an existing error: |
Hint (optional, recommended) | Tells the human user about what they can do to resolve the error.
| |
Detail (optional) | Tells the human user about the details of what happened.
|
Errors as API
What does it mean that “errors are part of the documented API”?
Whether an error can occur for given input situations is documented.
If an API is documented not to return an error, then users can consider CockroachDB defective if an error is returned.
The set of possible errors is documented for these input situations.
If an API returns an error that was not documented as possible, then users can consider CockroachDB (or its documentation) defective.
What to do when a given error occurs is documented.
If a API returns an error with no clear “next steps”, then users can consider CockroachDB (or its documentation) defective.
There is a careful balance to maintain: users want to have more guarantees, but each guarantee comes with an engineering burden.
Here is how we manage the amount of engineering work:
We do not guarantee nor document the specific text of error messages, hints and details as part of our error API.
We emphasize “an error can occur” as the guarantee, not “this specific error will occur”.
Specific guarantees are expressed over the SQLSTATE values. These are unit tested.
Conversely, engineers are free to improve / extend / modify messages, hints and details without approval by the documentation and product team.
In some cases (this is a legacy case, which we strive to avoid nowadays), the guarantee includes a keyword at the first position in the message. For example “restart_transaction”.
Mention when new SQLSTATE values are introduced, or when a single error case has been broken down into multiple alternatives, inside a release note in the commit message.
Checking errors, errors in unit tests
Messages should not be considered stable:
inside Go code, use
errors.Is
,errors.As
anderrors.HasType
/HasInterface
, not.Error() = “…”
orstrings.Contains(…Error(), “…”)
in SQL logic tests, use regular expressions that only match the “important” part of a message
In unit tests:
Check SQLSTATE values using SQL logic tests (
error pgcode ….
)In Go unit tests, use
testutils.IsError()
Sensitive data inside error objects
Many error objects are copied into logs, crash reports or other artifacts that are then communicated to CRL Tech Support automatically.
To preserve the confidentiality of our customer data, we are careful to isolate user-provided data from strings that are fixed inside CockroachDB. We call this “redactability”.
See the page Log and error redactability for more details.
General concepts:
When something is potentially sensitive / confidential, we call it “unsafe” and it is automatically deleted / redacted out when sent to CRL Tech support.
This conservative approach maximally protects customer confidentiality.
We need to work extra to include bits of known-safe data into errors to make the error reports more useful during troubleshooting.
The CockroachDB errors library already knows about redactability and helps engineers as follows:
The first literal string arguments to
errors.New
,errors.Newf
,errors.Wrap
etc is automatically considered non-confidential / non-redactable.Most “simple” numeric values are automatically considered non-redactable.
All string values passed as positional arguments to error constructors and annotation functions are considered sensitive and thus redactable.
More non-redactability for values passed to error constructors are possible via the
SafeFormatter
interface (see implementations ofSafeFormat
throughout the source code)Error objects used as input to a new error object are decomposed into redactable and non-redactable bits automatically.
Errors constructed outside of
cockroachdb/errors
, e.g. viafmt.Errorf
, are considered sensitive and thus fully redactable.
Copyright (C) Cockroach Labs.
Attention: This documentation is provided on an "as is" basis, without warranties or conditions of any kind, either express or implied, including, without limitation, any warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose.