All About Large Scale Cluster Testing

All About Large Scale Cluster Testing

From a New Joiner’s Perspective.

Executive Summary

Purpose

  • The Large Scale Cluster Testing program validates CockroachDB at “customer-like” size and shape (for example, clusters of 300 nodes across multiple zones, with ~1.2PB of storage). This ensures we discover and fix scale and resilience issues internally before customers experience them. We run real workloads (like TPCC), measure throughput/latency and recovery, capture evidence, and drive fixes into upcoming releases. The work is organised as an iterative SDLC (Software Development Life Cycle) so it’s repeatable and continuously improving.

Overview

What is Large Scale Cluster Testing?

  • It’s a cross-functional program to prove CockroachDB’s performance and resilience at production scale. We set up a large cluster (e.g., 300 nodes; 16 vCPU; 64GB RAM per node; multiple stores and zones), run real workloads and operations (TPCC, YCSB, backups, restores, CDC, schema changes), measure what matters, and capture evidence for engineering, product, and sales. The goal is to validate capabilities and identify bottlenecks proactively.

Why it matters

  • Customers expand and renew when they trust we can run their future workloads. A credible 300-node story unlocks growth for strategic accounts (e.g., customers targeting 100–1000 nodes). Testing reveals issues in our lab—so we can fix or document limits—before customers encounter them. Findings drive near-term roadmap (e.g., 25.3/25.4) and provide proof for Sales and PMM.

Iterative model (Agile SDLC)

  • We run in cycles: define (objectives) → design (scenarios/configs) → implement (provision/instrument) → validate (workloads/resilience) → fix (triage/backport) → report (engineering and executive) → improve (process, criteria, plans) → repeat. This cadence ensures continuous improvement across runs (e.g., March, May, August).

Who’s Who (Owners and Accountability)

Test Engineering / DRP (Execution)

  • The hands-on team that builds clusters, orchestrates workloads, schedules fault injection, sets up observability, captures artifacts (debug.zip, tsdump, pprof/traces), and writes detailed engineering reports. They also open Jira issues with evidence and re-test fixes.

CRDB Engineering (Product Areas)

  • Domain owners for KV, SQL Foundations, SQL Queries, Storage, DR (Backup/Restore), CDC, and Observability. They triage findings, produce fixes/backports, and partner with DRP to validate improvements at scale.

Cloud Ops / Cloud Engineering (Environments)

  • Provisions cloud environments (GCP/AWS/IBM), enforces access and security, and controls cost. They turn a design spec (e.g., 300-node, multi-zone, store layout) into a working cluster and keep it safe and manageable.

PM/PMM/Sales Engineering (Why and KPIs)

  • Define the “why” (customer/competitive needs) and top-level KPIs/success criteria, ensure the outputs translate to credible external claims, and equip the field with the story (executive summaries, receipts, constraints).

Product Security/Compliance (Security)

  • Runs vulnerability scans and helps ensure compliance coverage is adequate for the target scale and scenarios; they sign off on security work where in scope.

Infra Platform Engineering (Provisioning Patterns)

  • Defines provisioning patterns (e.g., storage/network density), guardrails, and cost guidance. Helps DRP/Cloud Ops choose cluster shapes that simulate customer deployments while remaining practical to operate.

R&D Operations (Program)

  • Owns the program cadence and hygiene: charters, milestones, sprint/ops planning, planning accuracy, cross-team dependencies, and reporting structures (dashboards, milestone health updates). They ensure the right people are engaged at the right time and that work flows across teams efficiently.

Step 1 – PM / PMM / Sales Engineering

Purpose: Define why the test exists.

Technical Actions

  • Translate customer and field asks into measurable test goals (e.g., 300 nodes | 1.2 PB | multi-zone | TPCC steady state 30–60 % CPU | CDC/backup targets).

  • Specify workload dimensions:

    • TPCC warehouses / client partitions

    • CDC sink counts + initial scan volume

    • Backup/restore data volume and RF

  • Create Jira Epic + acceptance criteria (label: O-25.x-scale-testing).

Inputs → Customer scenarios, competitive benchmarks, prior cycle results.
Outputs → Objectives doc + KPI list + scenario constraints.
Handoff → R&D Ops (team charter setup)


Step 2 – R&D Ops

Purpose: Plan how the program will run.

Technical Actions

  • Draft program charter + timeline + milestones + weekly sprint/ops reviews.

  • Publish hygiene templates (runbook, “Baseline Green” checklist, dashboard schema).

  • Set dependency tracking in Jira Plans / Confluence matrix.

Inputs → Step 1 objectives + capacity windows from Cloud Ops.
Outputs → Charter doc | timeline | meeting cadence | Go/No-Go review slot.
Handoff → DRP / Test Eng + DB Teams


Step 3 – DRP / Test Engineering + DB Teams

Purpose: Design what we will test.

Technical Actions

  • Build base cluster spec (e.g., n2-standard-16, 300 nodes, 2 stores × 2 TB, RF=5).

  • Prepare cluster settings plan (WAL failover, admission control, store bandwidth).

  • Define workloads + observability stack + artifact schedule.

  • Create pass/fail criteria per phase (baseline, perf, resilience, security).

Outputs → Base config sheet | scenario matrix | metrics plan | artifact schedule.
Handoff → R&D Ops / Stakeholders (for review)


Step 4 – R&D Ops + All Stakeholders

Purpose: Align ownership and dependencies.

Technical Actions

  • Build RACI matrix (showing who owns fixtures, CDC sinks, backup buckets, schema metrics).

  • Define pre-reqs (security defaults, budget limits, bucket paths, access lists).

Outputs → Final dependency sheet | owner assignments | approvals.
Handoff → Cloud Ops + Infra Platform Eng


Step 5 – Cloud Ops + Infra Platform Engineering

Purpose: Plan and guarantee capacity.

Technical Actions

  • Finalize regions/zones, IOPS, network topology, OS image, IAM roles, volume sizes.

  • Define provisioning SLA and access model (dbconsole, cert distribution, VPC rules).

Outputs → Provisioning plan | SLA targets | cost guardrails | network diagram.
Handoff → Product Security / Compliance (if in scope)


Step 6 – Product Security / Compliance

Purpose: Ensure readiness for secure testing.

Technical Actions

  • Validate encryption in flight / at rest, logging policy, vulnerability scan schedule.

  • Confirm SOC2 / HIPAA coverage and evidence collection process.

Outputs → Security + compliance approval notes | scan plan | contacts.
Handoff → DRP / Test Eng (environment request)


Step 7 – DRP / Test Engineering

Purpose: Request the environment.

Technical Actions

  • Submit cluster spec to Cloud Ops (300 nodes, zones, stores, PD sizes, OS image).

  • Attach access lists (SSH keys, dbconsole SSO groups, bucket ACLs).

  • Include timeline for data seed start.

Outputs → Provisioning request ticket / form with full specs.
Handoff → Cloud Ops + Infra


Step 8 – Cloud Ops + Infra

Purpose: Build the cluster.

Technical Actions

  • Provision nodes, configure IAM + network + TLS, apply OS hardening.

  • Deliver admin/SQL endpoints, dbconsole URLs, certs, firewall rules, cost monitors.

Outputs → Live cluster credentials + handover doc (ports, IPs, cert paths).
Handoff → DRP / Test Eng


Step 9 – DRP / Test Engineering

Purpose: Validate base health.

Technical Actions

  • Stage binaries (cockroach + workload), start cluster with agreed flags.

  • Enable Datadog/Looker dashboards, dbconsole, debug.zip/tsdump (verify impact).

  • Run SQL reachability, replication checks, and liveness tests.

Outputs → Healthy cluster + baseline dashboards + connectivity verified.
Handoff → Load Phase / DB SMEs


Step 10 – DRP / Test Engineering

Purpose: Prepare data & baseline.

Technical Actions

  • Load fixtures/imports (e.g., TPCC 4 M warehouses), then adjust RF if planned.

  • Verify data distribution, range counts, and baseline metrics.

Outputs → Baseline dashboards | initial SQL/latency report.
Handoff → DB SMEs for readiness check


Step 11 – DRP / Test Eng + DB SMEs

Purpose: Confirm readiness (Baseline Green).

Technical Actions

  • Run low-rate TPCC to verify throughput & latency.

  • Validate raft commit latency, gossip callbacks, SQL Liveness.

  • Tune cluster settings if required.

Outputs → Signed “Baseline Green” checklist | settings diff document.
Handoff → R&D Ops + PM / PMM for Go/No-Go


Step 12 – R&D Ops + PM / PMM

Purpose: Authorize execution.

Technical Actions

  • Conduct Go/No-Go review (check Gates A–C).

  • Freeze configs + publish run calendar + Slack channel for updates.

Outputs → Official “Go” decision | frozen configs | execution timeline.
Handoff → Scale Test Execution Phase

SDLC Phases (End-to-End)

4.1 Requirements and Analysis

  • Purpose: Convert customer and competitive needs into measurable, testable objectives. We avoid vague goals and set concrete KPIs (e.g., “TPCC steady-state tpmC at 30–60% CPU,” “Restore throughput per node,” “CDC lag under X”).

  • Inputs and owners: PM/PMM gather requirements; R&D Ops frames milestones; DRP + DB leaders translate into technical acceptance criteria and scenarios; stakeholders agree up front on pass/fail thresholds.

  • Outputs: A concise set of objectives (e.g., “qualify 300 nodes ~1.2PB steady-state; exercise backups/restores ~400+ TB; run schema changes on 100+ TB while foreground traffic is on”), success criteria per area, and a dependency/RACI map (who needs to do what, when).

  • References: Executive Summary for outcomes (top-line receipts) and Dependencies doc for handoffs.

4.2 Solution and Test Design

  • Purpose: Turn goals into reproducible scenarios with observability and phase gates.

  • Key design decisions:

    • Base configs: e.g., GCE n2-standard-16, 300 nodes, multi-zone, 2 stores per node (2TB each), ~1.2PB total, TLS/logging enabled. This shape is chosen for realism and repeatability.

    • Phases: baseline/health → functional/workloads → performance/stress → resilience/recovery → security/compliance. Each phase has explicit acceptance criteria.

    • Observability plan: Define dashboards (Datadog/Looker), dbconsole page-load SLAs, debug.zip/tsdump/pprof cadence (and its impact), and “must-have” KPIs (QPS/TPMC, p95/p99 latency, CPU/IOPS, raft/scheduler metrics, CDC lag/commit, backup/restore speeds, protected timestamp behavior).

    • Chaos pacing: Always get to steady-state (stable throughput, predictable latencies, healthy raft/scheduler/liveness) before injecting chaos (disk stall/node kill/network partition). Schedule chaos deliberately; avoid “always-on chaos” because it masks subtle signals and burns engineering time.

  • Outputs: Base configuration docs, scenario plan (who runs what, when), success metrics and pass/fail, artifact schedule (what to capture and when).

4.3 Implementation (Environments and Instrumentation)

  • Purpose: Build secure, observable test environments that match the design.

  • Steps:

    • Provision clusters (via Cloud Ops) with requested node count/shape, zones, stores; enable TLS/logging and access; stage cockroach and workload binaries; configure dashboards and dbconsole.

    • Seed large data quickly: Use workload fixtures/imports to load millions of TPCC warehouses so data volume is realistic (e.g., 4M warehouses; RF=5 may yield ~400+ TB on disk). Confirm data distribution and basic health.

  • Outputs: A running, monitored cluster with “green” baselines (health and performance) and credible data volume—ready for execution phases.

4.4 Verification and Validation (Execution at Scale)

  • Purpose: Prove behavior under realistic load and failures; collect receipts tied to criteria.

  • What we run (examples):

    • Functional/workload: Steady TPCC/YCSB/Bank/Roachmart; large imports; schema changes (ADD COLUMN/CREATE INDEX) on 100+ TB tables while foreground traffic is on; online full and incremental backups; CDC initial scans to multiple sinks (Kafka, Webhook, GCS).

    • Performance/stress: Reach target CPU/QPS/TPMC and hold steady-state; watch scheduler latency, raft metrics, and SQLLiveness traffic to ensure system balance and no runaway tail latencies.

    • Resilience/recovery: Fault injection (disk stalls—WAL failover should engage; node kill; network partition), rolling restarts; PIT restore and large restore (~400+ TB) while tracking per-node rates and retries; measure RTO/RPO and foreground impact.

    • Security/compliance: Run scans and coverage if in scope for the cycle.

  • Evidence (what we capture):

    • KPIs: throughput (QPS/TPMC), CPU/IOPS, p95/p99 latencies, dbconsole page-load SLAs, protected timestamp age, CDC lag/commit latency, backup/restore throughput, raft/scheduler metrics (e.g., log commit latency, send queues), range health.

    • Artifacts: dashboards (links/screens), debug.zip, tsdump, pprof/traces, job logs; store these centrally and link them in Jira issues and reports.

  • Outputs: Pass/fail per scenario and a clear list of findings/anomalies, each with attached evidence for triage.

4.5 Defect Triage, Remediation, and Re‑test

  • Purpose: Turn findings into product improvements and verified fixes fast.

  • Typical issue themes and fixes (examples):

    • KV: raft scheduler latency spikes; gossip/meta1/meta2 split behavior; occasionally bursty liveness; fixes/backports scheduled into 25.3/25.4.

    • SQL Foundations: large index backfill duration; roadmap item for distributed SST merge to accelerate backfills at very large table sizes.

    • DR (Backup/Restore/Fast Restore): retry heuristics for split/scatter and downloads; scale link workers; fast-restore retry logic for long-running downloads; several landed or planned in 25.3/25.4.

    • CDC: catch-up scan scalability, back-pressure, enriched envelope/metadata, graceful handling across rolling restarts/upgrades.

    • Observability: dbconsole page SLAs at scale; debug.zip/tsdump cadence tuning to avoid host latency.

  • Process:

    • DRP files Jira with artifacts and clear reproduction; product teams triage, fix/backport; DRP re-tests the exact scenario; document any remaining limitations with plan/timeline.

  • Outputs: Verified fixes, documented limitations, and a list of release-bound work tied directly to the receipts (e.g., targeted for 25.3/25.4).

4.6 Reporting and Release Readiness

  • Purpose: Publish what’s true with evidence; align the roadmap and external claims.

  • Artifacts:

    • Engineering Report (deep dive): KPIs by area, execution details, metrics, artifacts, issues and their status, re-test results. This is the technical source of truth.

    • Executive Summary (narrative): What worked (e.g., 2× faster IMPORT vs. 25.1, stable steady-state), what didn’t (e.g., fast restore failure at 400+ TB without retries), and what’s planned (e.g., 25.3/25.4 improvement list). This is the leadership/sales-facing story.

    • Dashboards and milestone health: Ongoing visibility for stakeholders.

  • Outputs: Credible claims, clear near-term release priorities, and a shared understanding of risks and limits.

4.7 Operations and Continuous Improvement

  • Purpose: Make each cycle more predictable, faster, and more insightful.

  • Practices:

    • Weekly ops/sprint planning and milestone reviews; improve planning accuracy (buffer unplanned work; cleaner ownership handoffs); explicitly separate steady-state windows from chaos windows for clearer signals.

  • Outputs: Sharper criteria, cleaner scenarios, better capacity planning, and faster fix → re-test loops; a healthier, more repeatable program.


Glossary

  • TPCC

    • An OLTP benchmark simulating order-entry workloads (new orders, payments, deliveries). We use it to measure steady-state throughput (tpmC/QPS) and p95/p99 latency without redesigning apps. Why it matters: It’s widely understood and comparable across runs.

  • Warehouses (TPCC)

    • Units of dataset size in TPCC. More warehouses means more data and workload scale. Why it matters: We use it to scale up data volume (e.g., 4M warehouses) and workload intensity in a controlled way.

  • CDC (Change Data Capture)

    • Streams changes (inserts/updates/deletes) to sinks like Kafka, Webhook, or GCS. Key metrics: lag (max_behind_nanos) and commit latency. Why it matters: Real customers rely on CDC to power downstream systems; we must prove it scales.

  • Backup/Restore / Fast Restore

    • Backup: copy data to cloud storage; Restore: bring it back. Fast Restore optimizes download/ingestion (and needs robust retries). Why it matters: RTO/RPO at large data sizes are critical; testing exposes retry gaps and throughput limits to fix.

  • PIT Restore (Point-in-Time)

    • Restores the cluster to a specific time. Why it matters: Recovery accuracy and speed for real incidents (e.g., operator error, data corruption).

  • Protected Timestamp

    • Prevents GC from removing data needed by long-running jobs (backfills, restores). We track its “age” to ensure it doesn’t grow unbounded and block GC for too long.

  • SQLLiveness

    • Tracks the liveness of SQL instances. Surges in SQLLiveness traffic can create scheduling pressure; we watch its behavior at scale.

  • Raft / Scheduler Latency

    • Raft is the consensus protocol that keeps replicas consistent. Scheduler latency is a measure of how quickly the system processes Raft work and other tasks. Why it matters: spikes can cause leadership churn and tail latencies.

  • Debug.zip / tsdump

    • Debug.zip: a compressed bundle of logs and diagnostics; tsdump: raw time-series export. Why it matters: they’re the evidence packages engineers use for root cause analysis—schedule them thoughtfully to avoid overhead.

  • Roachprod / Roachtest / Workload / csv-server

    • Roachprod: cluster lifecycle management; Roachtest: orchestrates multi-node tests; Workload: runs benchmarks like TPCC; csv-server: streams CSVs for fast data generation. Why it matters: these tools make large-scale testing feasible and repeatable.

 

Copyright (C) Cockroach Labs.
Attention: This documentation is provided on an "as is" basis, without warranties or conditions of any kind, either express or implied, including, without limitation, any warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose.