1. Program Overview & Objectives
The Large Scale Cluster Testing program is Cockroach Labs’ cornerstone initiative for validating CockroachDB and Cockroach Cloud under genuine, production-scale scenarios. The current cycle benchmarks clusters up to 300 nodes (single-region, multi-zone, >440TB data, ~1.2PB total storage), directly matching enterprise and cloud customer requirements.
Key Objectives:
Certify CockroachDB for massive, resilient cluster deployments—both horizontal (node count) and vertical (resource per node).
Discover scale-induced system bottlenecks, reliability gaps, and operational vulnerabilities (e.g., scheduler latency, SQLLiveness bursts, backup/restore, replica movement).
Deliver engineering priorities through actionable findings, driving rapid fixes, long-term product roadmap, and customer/competitive alignment.
Ensure continuous quality improvement, cost-effective test delivery, and readiness for future AI-driven analytics.
2. Stakeholders & Organizational Structure (Who’s Who)
Team/Stakeholder | Key Roles (Sample) | Responsibilities | Collaboration / Dependencies |
|---|---|---|---|
Engineering Operations (R&D Ops) | Sandipan Banerjee, Colin Minihan | Program management, process definition, milestone/velocity tracking | All functions for planning, release, dashboard |
Test Engineering / DRP | Stan Rosenberg, Bhaskarjyoti Bora, Dipti Joshi | Cluster orchestration, automation, workload setup, execution, findings | Cloud Ops, Engineering, DB teams |
CRDB Engineering (Database Platform) | Michael Wang, Alex Lunev, Chandra Thumuluru | Core DB development, issue triage/remediation, reliability engineering | Dependent on Test Eng, works with DRP |
Cloud Ops / Cloud Eng | Lakshmi Kannan, Abhishek | Cloud environment setup(IBM/AWS/GCP), resource management | Test Eng for reqs, Infra Ops for hardware |
Product Management/PMM/Sales Eng | Andy Woods, Dipti Joshi, Raymond Austin, Chris Casano | Test goal definition, customer workload intake, criteria/benchmarking, KPI tracking | All DB teams, field/customer scenario intake |
Product Security/Compliance | Biplav Saraf, Pritesh Lahoti, Mike Geehan | Security validation, compliance coverage, audit guidance, vulnerability scans | Engineering, Cloud Ops, Compliance |
Infrastructure Platform Eng | Rima Deodhar, Infra Ops team | Hardware/storage/net provisioning, cost controls | Supports environments and resource planning |
3. Program Structure & Processes
Multi-Phase Validation Cycles
Each test campaign progresses through iterative phases, ensuring robust coverage and continuous improvement.
Planning & Resource Allocation:
Detailed cluster specs, workload definitions, environment scheduling, milestone setting.
Scale Deploy & Baseline Validation:
Cluster launch (300 nodes, 16 vCPU/64GB RAM per node, 2 stores/node, ~1.2PB total).
Connectivity checks, baseline metrics, health dashboards.
Functional & Workload Testing:
Benchmark workloads (TPC-C, YCSB, Bank, Roachmart), import scenarios, backup/restore, failover validation.
Performance & Stress Testing:
Load testing: 100k SQL QPS, up to 4M warehouses, 600TB post-replication (CPU: ~20–40%).
Evaluate scheduler, SQLLiveness, replica scalability.
Resilience & Recovery Workflows:
Fault injection, backup/restore validation (e.g., 440TB in 4.5 hours, RF=5), replica recovery, change data capture.
Security & Compliance Testing:
Vulnerability scans, compliance suite execution for SOC2/HIPAA.
Issue Triage & Revalidation:
Findings logged in JIRA (O-25.1-scale-testing), ownership/iteration cycles.
Reporting:
Technical reports, executive summaries, dashboard/KPI updates, milestone health templates.
Continuous Improvement
Weekly ops/sprint planning, milestone reviews, dashboard-driven reporting, process improvement initiatives (e.g. DRP SME, shared responsibility model, enhanced metrics hygiene).
4. Technical Infrastructure & Tools
Cluster Environments
Standard Testbed:
300 nodes
16 vCPU, 64GB RAM per node
2 stores/node (2TB/store)
~1.2PB total storage
Single/multi-region, multi-zone
Environments:
IBM Cloud, AWS, GCP (ephemeral clusters for cost control)
On-prem infrastructure for specific density or topology scenarios
Tooling Stack
roachtest: Automated distributed test orchestration
roachprod: Cluster lifecycle management (build, deploy, tear-down)
Workload Generators: TPCC, YCSB, Bank, Roachmart (for real and synthetic tests)
csv-server: Rapid fixture/test data generation at scale
Dashboards/Status Boards: Real-time visibility (Datadog, Looker, custom dashboards)
JIRA: Issue tracking, process governance, fix cycles
5. Metrics, Outcomes & Reporting
Metrics & Operational Data
Cluster Performance Outputs:
TPC-C: 100k SQL QPS @ 300 nodes, 4M warehouses, ~600TB data
Backup: Full backup of 440TB in 4.5 hours; restore metrics tracked
CDC: Initial scan of 127TB in 10 hours (7 sinks)
Issue Rates:
9–13 issues found per cycle (Q1 FY26; target = 24); highlights gaps in test coverage, triage accuracy
All findings are linked epics/stories, with fix/retest accountability
Planning Accuracy:
Test Eng+DRP: ~60–70% (industry target: 85–90%)
DB Server: ~60%, Cloud Admin: ~55%
Observability:
Health metrics: CPU, IOPS, QPS, error counts; automated failure alerting and dashboards
Success Criteria Tracking:
Benchmarks, customer workload pass/fail, compliance checklists; dashboards visible to all stakeholders
Reporting Practices
Engineering Reports: Technical deep dives for each test, including KPIs and issue status
Executive Summaries: Rollup for R&D and senior stakeholders (issue highlights, metric trajectories)
Dashboards: 24/7 health and progress tracking
Milestone Health Templates: Used for rolling status, blockers, and risks
6. Dependencies in Practice: Key Scenarios
Environment Provisioning: Test Engineering requests large cluster from Cloud Ops (IBM/AWS/GCP); Infra Ops handles storage, network, and cost.
Delays affect test launch, require sprints to realign.
Issue Handoffs: Test findings are transferred to CRDB Engineering/Product Security for root cause analysis and fix; then revalidated by Test Eng.
Customer Workload Modeling: Product Mgmt/Field Engineering provide scenario specs to Test Engineering for ingest/benchmark; results drive product direction.
Compliance Audit: Compliance/Product Security depends on clusters set up by Test Engineering; signoffs required before release.
7. Current Programs & Improvement Initiatives
Active Programs
300-Node Scale Tests: Recurring campaigns, multi-phase execution, real-world performance and adversity testing
Customer Workload Intake & Simulation: Formal intake process via Jira/Confluence for high-impact, real customer cases (Wasabi, Ory, etc.)
Performance Under Adversity (PUA): Recent results: >10x latency reduction vs. earlier versions
Multi-Phase Milestone Cycles: Each test includes planning, deployment, functional/load/resilience/backup/security/compliance, fix/retest, reporting
Process Improvements
Shared Responsibility: DB teams and component SMEs own features, criteria, investigation—not just reactive support
DRP SME Program: Engineers cross-trained to write operations, expand test coverage, accelerate bug discovery
Metrics & Observability Upgrades: Enhanced metric tagging, better debug artifacts, Datadog/Looker dashboard upgrades
8. Challenges, Gaps & Risks
Issue Discovery Rates: Currently below target (Q1 FY26: 9–13/cycle vs target 24), ongoing review to recalibrate coverage
Planning Accuracy: Sits at ~60–70%, aiming to close gap toward industry-standard 85–90%
Resource Constraints: Testbed capacity and scheduling vs backlog; budget controls for large clusters
Documentation Gaps: Some processes (success criteria trackers, full compliance/security suite detail) referenced but not fully published
Improvement Needs: Clear assignment of cross-team owners (“drivers”), metrics hygiene, sustained process review
9. Summary Table: Key Program Facts
Aspect | Details/Values |
|---|---|
Cluster Specs | 300 nodes, 16 vCPU/64GB RAM per node, 2TB/store; ~1.2PB |
Test Workloads | TPCC, YCSB, Bank, Roachmart; bulk import, backup/restore |
Key Outputs | Engineering reports, executive summaries, dashboards, JIRA |
Metrics | QPS: 100k at 300 nodes; Backup: 440TB/4.5h; CDC: 127TB/10h |
Issue Rate | 9–13/cycle (target: 24); all issues tracked, triaged, retested |
Observability | CPU, IOPS, error rate, QPS; dashboards live 24/7 |
Planning Accuracy | Test Eng/DRP: 60–70%; DB Server: 60%; Cloud Admin: 55% |
Dependencies | Cross-team handoffs (Cloud Ops, Eng, PM, Compliance, Security) |
Cost/Quality | Disposable clusters, automation, resource dashboards |
AI Readiness | Robust telemetry, reporting, dashboarding; foundation set |
10. Reference Documentation & Links
1. Core Program Reports & Findings
300 Node Cluster Scale Test Report and Findings – DRAFT
https://docs.google.com/document/d/1uigYdL_ftkkfm1Lt5cnuJgNq2OEBqzq9Q81n8ldVqfo150 Node Scale Test Report and Findings
https://docs.google.com/document/d/1jnU4x2-WNorYfXYm33hMh7SmegU556Ys4EANIu6nhYI300 Nodes Scale Test Executive Summary
https://docs.google.com/document/d/1yfFcrTKNz5nFU1ZtXxdsjpD5SgN7h43-xESYczDg7ko
2. Program Planning & Quality Management
FY2026 H2 Quality Plan (DRP/Test Engineering)
https://docs.google.com/document/d/1dZHMjj-cB-VqWK5-jJ_dF_MyQLKN6Vuo5mk4SLH2jMA
3. Validation, Coordination & Sprint Processes
Weekly Ops Planning and Sprint Planning
https://docs.google.com/document/d/16xgFmeXJCuKFtYf_yHMz0srnTJBJqj9gMcFTmdJkXSY
4. Technical Reference & How-To
Large Scale Testing Technical Reference (Confluence)
Large scale testing
5. Base Configurations & Test Scenarios
300 Node Scale Test Base Configuration – May 2025 – Draft
https://docs.google.com/document/d/19JKFVTQnqD0Gi3zudWJ6pjxayfNQMCi4P3DlAvEgam8300 Node Scale Test Base Configuration and Scenarios – March 2025 – Draft
https://docs.google.com/document/d/1F8whp9JUaXoT2DKCF3GNSZc13y_e80MjE-nrXEyBX_A
6. Dependencies, RACI, and Team Coordination
Large Scale Cluster Testing Dependencies
https://docs.google.com/document/d/1eBNNARqNrKnWUq_Ppo_Q_HxdpyTTFwegz6J5BB0XOiM
7. Metrics, Planning Accuracy & Roadmaps
Planning Accuracy Assessment – Test Engineering
https://docs.google.com/document/d/1FkM6MhzyluSztcJU146c2zposLQPTdXPp1uGYOmCBukQ2Q3 and Q3 FY26 R&D Roadmaps (Presentation)
https://docs.google.com/presentation/d/1t0D8xfLopaBrCYjacSTJa5RVsQZ-1XctrsM3z8uaCI0