Large Scale Cluster Testing Program
0 Associations
Mentions

1. Program Overview & Objectives

The Large Scale Cluster Testing program is Cockroach Labs’ cornerstone initiative for validating CockroachDB and Cockroach Cloud under genuine, production-scale scenarios. The current cycle benchmarks clusters up to 300 nodes (single-region, multi-zone, >440TB data, ~1.2PB total storage), directly matching enterprise and cloud customer requirements.

Key Objectives:

  • Certify CockroachDB for massive, resilient cluster deployments—both horizontal (node count) and vertical (resource per node).

  • Discover scale-induced system bottlenecks, reliability gaps, and operational vulnerabilities (e.g., scheduler latency, SQLLiveness bursts, backup/restore, replica movement).

  • Deliver engineering priorities through actionable findings, driving rapid fixes, long-term product roadmap, and customer/competitive alignment.

  • Ensure continuous quality improvement, cost-effective test delivery, and readiness for future AI-driven analytics.


2. Stakeholders & Organizational Structure (Who’s Who)

Team/Stakeholder

Key Roles (Sample)

Responsibilities

Collaboration / Dependencies

Engineering Operations (R&D Ops)

Sandipan Banerjee, Colin Minihan

Program management, process definition, milestone/velocity tracking

All functions for planning, release, dashboard

Test Engineering / DRP

Stan Rosenberg, Bhaskarjyoti Bora, Dipti Joshi

Cluster orchestration, automation, workload setup, execution, findings

Cloud Ops, Engineering, DB teams

CRDB Engineering (Database Platform)

Michael Wang, Alex Lunev, Chandra Thumuluru

Core DB development, issue triage/remediation, reliability engineering

Dependent on Test Eng, works with DRP

Cloud Ops / Cloud Eng

Lakshmi Kannan, Abhishek

Cloud environment setup(IBM/AWS/GCP), resource management

Test Eng for reqs, Infra Ops for hardware

Product Management/PMM/Sales Eng

Andy Woods, Dipti Joshi, Raymond Austin, Chris Casano

Test goal definition, customer workload intake, criteria/benchmarking, KPI tracking

All DB teams, field/customer scenario intake

Product Security/Compliance

Biplav Saraf, Pritesh Lahoti, Mike Geehan

Security validation, compliance coverage, audit guidance, vulnerability scans

Engineering, Cloud Ops, Compliance

Infrastructure Platform Eng

Rima Deodhar, Infra Ops team

Hardware/storage/net provisioning, cost controls

Supports environments and resource planning


3. Program Structure & Processes

Multi-Phase Validation Cycles

Each test campaign progresses through iterative phases, ensuring robust coverage and continuous improvement.

  1. Planning & Resource Allocation:

    • Detailed cluster specs, workload definitions, environment scheduling, milestone setting.

  2. Scale Deploy & Baseline Validation:

    • Cluster launch (300 nodes, 16 vCPU/64GB RAM per node, 2 stores/node, ~1.2PB total).

    • Connectivity checks, baseline metrics, health dashboards.

  3. Functional & Workload Testing:

    • Benchmark workloads (TPC-C, YCSB, Bank, Roachmart), import scenarios, backup/restore, failover validation.

  4. Performance & Stress Testing:

    • Load testing: 100k SQL QPS, up to 4M warehouses, 600TB post-replication (CPU: ~20–40%).

    • Evaluate scheduler, SQLLiveness, replica scalability.

  5. Resilience & Recovery Workflows:

    • Fault injection, backup/restore validation (e.g., 440TB in 4.5 hours, RF=5), replica recovery, change data capture.

  6. Security & Compliance Testing:

    • Vulnerability scans, compliance suite execution for SOC2/HIPAA.

  7. Issue Triage & Revalidation:

    • Findings logged in JIRA (O-25.1-scale-testing), ownership/iteration cycles.

  8. Reporting:

    • Technical reports, executive summaries, dashboard/KPI updates, milestone health templates.

Continuous Improvement

  • Weekly ops/sprint planning, milestone reviews, dashboard-driven reporting, process improvement initiatives (e.g. DRP SME, shared responsibility model, enhanced metrics hygiene).


4. Technical Infrastructure & Tools

Cluster Environments

  • Standard Testbed:

    • 300 nodes

    • 16 vCPU, 64GB RAM per node

    • 2 stores/node (2TB/store)

    • ~1.2PB total storage

    • Single/multi-region, multi-zone

  • Environments:

    • IBM Cloud, AWS, GCP (ephemeral clusters for cost control)

    • On-prem infrastructure for specific density or topology scenarios

Tooling Stack

  • roachtest: Automated distributed test orchestration

  • roachprod: Cluster lifecycle management (build, deploy, tear-down)

  • Workload Generators: TPCC, YCSB, Bank, Roachmart (for real and synthetic tests)

  • csv-server: Rapid fixture/test data generation at scale

  • Dashboards/Status Boards: Real-time visibility (Datadog, Looker, custom dashboards)

  • JIRA: Issue tracking, process governance, fix cycles


5. Metrics, Outcomes & Reporting

Metrics & Operational Data

  • Cluster Performance Outputs:

    • TPC-C: 100k SQL QPS @ 300 nodes, 4M warehouses, ~600TB data

    • Backup: Full backup of 440TB in 4.5 hours; restore metrics tracked

    • CDC: Initial scan of 127TB in 10 hours (7 sinks)

  • Issue Rates:

    • 9–13 issues found per cycle (Q1 FY26; target = 24); highlights gaps in test coverage, triage accuracy

    • All findings are linked epics/stories, with fix/retest accountability

  • Planning Accuracy:

    • Test Eng+DRP: ~60–70% (industry target: 85–90%)

    • DB Server: ~60%, Cloud Admin: ~55%

  • Observability:

    • Health metrics: CPU, IOPS, QPS, error counts; automated failure alerting and dashboards

  • Success Criteria Tracking:

    • Benchmarks, customer workload pass/fail, compliance checklists; dashboards visible to all stakeholders

Reporting Practices

  • Engineering Reports: Technical deep dives for each test, including KPIs and issue status

  • Executive Summaries: Rollup for R&D and senior stakeholders (issue highlights, metric trajectories)

  • Dashboards: 24/7 health and progress tracking

  • Milestone Health Templates: Used for rolling status, blockers, and risks


6. Dependencies in Practice: Key Scenarios

  • Environment Provisioning: Test Engineering requests large cluster from Cloud Ops (IBM/AWS/GCP); Infra Ops handles storage, network, and cost.

    • Delays affect test launch, require sprints to realign.

  • Issue Handoffs: Test findings are transferred to CRDB Engineering/Product Security for root cause analysis and fix; then revalidated by Test Eng.

  • Customer Workload Modeling: Product Mgmt/Field Engineering provide scenario specs to Test Engineering for ingest/benchmark; results drive product direction.

  • Compliance Audit: Compliance/Product Security depends on clusters set up by Test Engineering; signoffs required before release.


7. Current Programs & Improvement Initiatives

Active Programs

  • 300-Node Scale Tests: Recurring campaigns, multi-phase execution, real-world performance and adversity testing

  • Customer Workload Intake & Simulation: Formal intake process via Jira/Confluence for high-impact, real customer cases (Wasabi, Ory, etc.)

  • Performance Under Adversity (PUA): Recent results: >10x latency reduction vs. earlier versions

  • Multi-Phase Milestone Cycles: Each test includes planning, deployment, functional/load/resilience/backup/security/compliance, fix/retest, reporting

Process Improvements

  • Shared Responsibility: DB teams and component SMEs own features, criteria, investigation—not just reactive support

  • DRP SME Program: Engineers cross-trained to write operations, expand test coverage, accelerate bug discovery

  • Metrics & Observability Upgrades: Enhanced metric tagging, better debug artifacts, Datadog/Looker dashboard upgrades


8. Challenges, Gaps & Risks

  • Issue Discovery Rates: Currently below target (Q1 FY26: 9–13/cycle vs target 24), ongoing review to recalibrate coverage

  • Planning Accuracy: Sits at ~60–70%, aiming to close gap toward industry-standard 85–90%

  • Resource Constraints: Testbed capacity and scheduling vs backlog; budget controls for large clusters

  • Documentation Gaps: Some processes (success criteria trackers, full compliance/security suite detail) referenced but not fully published

  • Improvement Needs: Clear assignment of cross-team owners (“drivers”), metrics hygiene, sustained process review


9. Summary Table: Key Program Facts

Aspect

Details/Values

Cluster Specs

300 nodes, 16 vCPU/64GB RAM per node, 2TB/store; ~1.2PB

Test Workloads

TPCC, YCSB, Bank, Roachmart; bulk import, backup/restore

Key Outputs

Engineering reports, executive summaries, dashboards, JIRA

Metrics

QPS: 100k at 300 nodes; Backup: 440TB/4.5h; CDC: 127TB/10h

Issue Rate

9–13/cycle (target: 24); all issues tracked, triaged, retested

Observability

CPU, IOPS, error rate, QPS; dashboards live 24/7

Planning Accuracy

Test Eng/DRP: 60–70%; DB Server: 60%; Cloud Admin: 55%

Dependencies

Cross-team handoffs (Cloud Ops, Eng, PM, Compliance, Security)

Cost/Quality

Disposable clusters, automation, resource dashboards

AI Readiness

Robust telemetry, reporting, dashboarding; foundation set


10. Reference Documentation & Links

1. Core Program Reports & Findings


2. Program Planning & Quality Management


3. Validation, Coordination & Sprint Processes


4. Technical Reference & How-To


5. Base Configurations & Test Scenarios


6. Dependencies, RACI, and Team Coordination


7. Metrics, Planning Accuracy & Roadmaps