Tracing operations with Jaeger, Zipkin and OpenTelemetry

CockroachDB has extensive verbose logging and distributed tracing instrumentation built-in. One way in which this instrumentation is useful is through 3rd party trace collectors like Jaeger and Zipkin. CRDB can be instructed to trace everything it does and to send all the traces to a collector. Enabling tracing also activates all the log messages, at all verbosity levels, as traces include the log messages printed in the respective trace context.

Note that enabling full tracing is expensive both in terms of CPU usage and memory footprint, and is not suitable for high-throughput production environments.

There are several options for routing traces to a 3rd party collector, listed below. All of these are enabled by the fact that CRDB's Tracer can be configured to tee everything to the OpenTelemetry tracer, with OpenTelemetry being quickly embraced as the lingua franca of all observability tools.

  1. Output traces to a collector that speaks the OTLP protocol. For example, Lightstep supports this, as do special builds of Jaeger. This can be enabled with the trace.opentelemetry.collector cluster setting.
  2. Output traces to the OpenTelemetry Collector, which can in turn route them to a lot of other tools. The OTEL Collector is a canonical collector, speaking the OTLP protocol, that can buffer traces and perform some processing on them before exporting them to every tool in the universe (including Jaeger, Zipkin and other OTLP tools). This is again enabled with the trace.opentelemetry.collector cluster setting.
  3. Output traces to Jaeger or Zipkin using their native protocols. This is implemented by using the Jaeger and Zipkin dedicated "exporters" from the otel SDK. Enabling the Jaeger exporter is done through the trace.jaeger.agent cluster setting. Enabling the Zipkin exporter is done through the trace.zipkin.collector cluster setting.

When playing around and wanting to look at some traces, the simplest thing to do is use the Jaeger or Zipkin. Jaeger has a better UI, so we'll use that as an example. To run a Jaeger instance locally in a container, make sure Docker is running on your system and then following incantation:

docker run -d --name jaeger -p 6831:6831/udp -p 16686:16686 jaegertracing/all-in-one:latest

This runs the latest version of Jaeger, and forwards two ports to the container. 6831 is the trace ingestion port, 16686 is the UI port. By default, Jaeger will store all received traces in memory.

Now let's run CRDB and generate some traces. To see distributed traces in all their glory, the simplest thing is to use roachprod local. Create a cluster with:

roachprod create local -n 3
roachprod put local cockroach
roachprod start local

To enable trace generation do:

roachprod sql local:1
SET CLUSTER SETTING trace.jaeger.agent='localhost:6831'

Or even simpler, you can start the cluster with

roachprod start local --env=COCKROACH_JAEGER=localhost

and then you don't need to set the cluster setting.

Now go to http://localhost:16686, select the CockroachDB service, and you should be seeing traces streaming in.

Jaeger's memory storage works well for small use cases, but can result in OOMs when collecting many traces over a long period of time. Luckily, Jaeger also supports disk-backed local storage using Badger (not Pebble; we'll give them a pass on this, for now). To use this, start Jaeger by running the following adjusted Docker command:

docker run -d --name jaeger \
-e BADGER_DIRECTORY_VALUE=/badger/data -e BADGER_DIRECTORY_KEY=/badger/key \
-v /mnt/data1/jaeger:/badger \
-p 6831:6831/udp -p 16686:16686 jaegertracing/all-in-one:latest

Play around looking for some traces. A few things:

  • instead of wading through log messages in an unstructured fashion, now the logs are graphed in a nice tree format based on how the contexts were passed around. This is great! This also traverses machine boundaries so you don't have to look at three different flat .log files trying to sync up events. This greatly speeds up your debugging.

An older version of this guide instructed to run Jaeger with the COLLECTOR_ZIPKIN_HOST_PORT=9411 environment variable set. This variable is no longer needed when using the trace.jaeger.agent setting. The envvar was asking Jaeger to accept the Zipkin protocol back when we didn't have native support for the Jaeger protocol.

Copyright (C) Cockroach Labs.
Attention: This documentation is provided on an "as is" basis, without warranties or conditions of any kind, either express or implied, including, without limitation, any warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose.