It is the case that there’s currently a limit on the number rows that a SQL transaction can write before becoming “too large to commit”. The reasons for this limitation will be explained. It is surprisingly easy for an unsuspecting SQL user to find herself in this situation: statements like INSERT INTO <table> FROM SELECT … and CREATE TABLE <table> AS SELECT … can easily exceed the current limit of 100k rows. Even if no transaction has been explicitly used by the client, CRDB internally runs all SQL statements in implicit transactions. This codelab will modify the execution of CREATE TABLE … AS … such that it doesn’t run into this limitation any more. The point of the exercise is to familiarize the reader with parts of the CRDB codebase, not on the actual design of the solution.
Let’s start by observing the problem. First, let’s build and start a cockroach server:
user@bash: make build
user@bash: ./cockroach start --insecure --logtostderr
In another terminal, let’s open a SQL shell and attempt to create a large table:
user@bash: ./cockroach sql --insecure
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
root@:26257/> create database codelab;
CREATE DATABASE
root@:26257/> CREATE TABLE codelab.series AS SELECT generate_series FROM generate_series(1, 100100);
pq: kv/txn_coord_sender.go:423: transaction is too large to commit: 100108 intents
Intermezzo 1: Transactions and Intents
crdb is a transactional database, and this is reflected at all levels of our tiered architecture - sql, kv, storage. A kv-level transaction is logically identified by a UUID. KV requests carry this ID around so that the storage level can use it when evaluating reads and writes. A transactional write (a write part of txn T) of a key k writing value v places an intent on k. An intent is stored in our on-disk storage engine (RocksDB) like a committed value, but it has extra information: the intent contains v and the txn’s ID. The point of the intent is to block other conflicting transactions that try to access the key until T commits or rolls back. When T finally commits or rolls back, all its intents need to be resolved. The KV request causing a transaction to finish is EndTransaction{commit: true/false, intent_spans:...}. This request carries with it information about all the key spans that contain intents laid down by the finishing txn; all these intents will be cleaned up during execution of the EndTransaction.
With some idea about what intents are, we can now try to explain the “transaction is too large to commit” limitation. We generally don’t like very large KV requests; a request needs to be loaded into memory by the node evaluating it, and the resulting Raft command must be loaded into memory by each replica that’s applying it. All these points where large memory allocations are made are problematic, and the ones related to Raft commands even more so - once the command has been written to the Raft log, all the replicas are forced to apply it, and their execution cannot diverge. There’s no way for an individual replica to say “this is too big for me” and bail. So, if a command is indeed too large for a particular replica, that replica is forever going to crash trying to apply it and the cluster will be generally unhappy. So, in the particular case of EndTransaction requests, we limit their size by putting a limit on the number of intent key spans that they can carry. Generally, higher levels of crdb code control the sizes of the Raft commands that are going to be proposed; however, in the case of EndTransaction, that’s not as true: the EndTransaction request is automatically populated with the intent key spans. So, it makes sense for the number of these key spans to have a limit.
To understand where this limit is enforced and what we’ll need to do to work around it, it’ll be helpful to have a primer on how Cockroach write requests are executed.
Intermezzo 2: Cockroach KV requests
The SQL layer interacts with the data by means of KV requests. These requests are batched and sent through a hierarchy of layers until they make their way to the range holding the data in question. This sending happens through each layer calling into a lower-level implementation of the client.Sender interface. Each Sender does some processing on the batch and then sends it along to the next layer. Notable Senders are the DistSender, which splits the requests in a batch into sub-requests for different ranges, and then reassembles their results, and the TxnCoordSender, which does bookkeeping for transactional writes. Relevant to our quest, the TxnCoordSender keeps track of the keys being written by all the batches that are part of a transaction, as these batches go through it, and attaches these intents to the EndTransaction finishing that transaction.
Each request has a Header (e.g. check out one random request). This header indicates either a key or a range of key that the request needs to operate on. According to this span, the request will be routed to the right range.
Intermezzo 3: Terminology
We’ve used a couple of terms in this document without offering any explanation; to be successful in the crdb codebase, these terms have to be made more precise.
- Request: a request refers to a KV request, namely a BatchRequest (even single requests are wrapped in batches before being sent to KV). These are protobuf messages that will be routed to the data range on which they’re supposed to operate and eventually return a response. They’re split into read and write requests; besides the types of results these categories produce, an important difference is that only writes need to go through Raft.
- Evaluation: when a request arrives to its destination range, it is evaluated. For reads, this evaluation means actually reading data from the storage layer and packaging it into a response. For write, evaluation means figuring out exactly what low-level changes are to be made to the storage-level data (e.g. a DeleteRange request will cause a bunch of tombstone records to be written to storage) and serializing all these changes into a command. Evaluation happens on the current lease holder replica of the range in question.
- Command: a command is the result of evaluating a write request. It contains a binary representation of changes to be made to storage (RocksDB). It is proposed to Raft and, if accepted, it will subsequently be applied on all the replicas of the range.
- Proposing: a command is proposed to the Raft group formed by all the replicas of a range. If a quorum of the replicas accept it (according to the Raft consensus protocol rules), the command is written to the Raft log - at this point the command has occupied one position in the log. This log is constantly applied by all the members of the Raft group (i.e. all the range’s replicas).
- Application: Commands are taken off from the log by the Raft machinery and applied to the local RocksDB instance. This involves deserializing the changes marshalled inside the command and… applying them to the storage engine. After application, the RocksDB instance on the specific replica has been changed according to the initial write request that triggered the whole saga.
With this information about crdb requests in mind, let’s get to our objective and the code. So, we know that CREATE TABLE <table> AS SELECT … lays down too many intents and, as a result, its transaction fails to commit. Who exactly is preventing is from committing? With the crdb codebase checked-out, let’s grep for the error:
./pkg/kv/txn_coord_sender.go: return roachpb.NewErrorf("transaction is too large to commit: %d intents", len(et.IntentSpans))
Ah, txn_coord_sender.go. That makes sense: we know that the TxnCoordSender is in charge of keeping track of all the intents that a transaction created. Looking at the code, we see that the number of intents being compared to maxIntents is coming from txnMeta := tc.txnMu.txns[txnID]. tc.txnMu.txns is a map in which the TxnCoordSender accumulates data for each txn, as txns send requests through the TxnCoordSender. Here’s where this map is updated with intents generated by each batch. How does the TxnCoordSender know what intents have been laid down exactly? It’s not like requests have a notion of intents, exactly. Different types of write requests have information about what keys they might write to (e.g. conditional requests such as a CPut might or might not write something). What the code currently does is it looks at the Header of each request and it assumes that intents might have been laid down on any key in the key span indicated by the request’s header. As with anything in crdb, the actual code is more complicated, but it’s all around here. OK, so the TxnCoordSender peeks at all the requests going through it on behalf of a transaction, accumulates the key spans from each write request from the requests’ headers, and, for our CREATE TABLE statement, we somehow end up with too many of these key spans. What spans exactly is the TxnCoordSender ending up with? Let’s theorize first, and then we’ll convince ourselves that the theory is correct. It’d be a good guess to think that the TxnCoordSender ends up with a key span for each row being inserted in the new table. Before we delve into the sql-layer code that generates the KV requests and verify that this is indeed the case, let’s jump ahead and also propose a solution: it’d be correct to just attach a single range of intents to the EndTransaction that’s committing the table creation; that range would be the whole key space of the table. One big key range would be a lot better than a gazillion small ranges: for one, it’d make the TxnCoordSender happy and it wouldn’t block us from committing any more. More importantly, using a single range for cleaning up the intents is a smart idea - remember that we’re afraid of large requests, not so much of requests having to ultimately perform a lot of work. See Appendix A for a discussion of how this single EndTransaction request is executed and what it turns into. Before we talk about how we’d actually code the solution proposed here, let’s make a note that this idea of replacing individual intent keys with one big range is not clearly a general win - you don’t want to be too pessimistic about what range needs to be scanned for intents to be cleaned up. If a txn writes keys “a” and “z”, you don’t want it to need to scan the potentially billions of keys in between. However, in the case of a brand new table, we know that all keys pertaining to this table will have intents intents, and that those intents belong to the current transaction. So, a key span encompassing the whole table key space is exactly what we need to scan for intents.
One more thing still before we start coding: we owe ourselves more convincing that the intents accumulated by the TxnCoordSender are what we think they are (i.e. one per SQL row being inserted in the new table). For that, we need to understand a bit about how the SQL layer works.
Intermezzo 4: SQL execution
CRDB contains a SQL compiler with a front-end (parsing and semantic analysis), middle-end (logical plan construction) and back-end (plan execution). We’ll focus on the execution here. The logical plan of a SQL statement is represented as a tree of planNodes. These nodes double as executable structures implementing the Volcano model: they have Start()/Next() methods that run through all the rows that each node wants to pass up to its parent. To get an intuition about these planNodes, let’s look at the plan created for a SELECT statement:
SELECT * FROM customers WHERE State LIKE 'C%' AND strpos(address, 'Infinite') != 0 ORDER BY Name;
This slightly contrived example is supposed to return customers from states starting with "N" and whose address contains the string "Infinite". To get excited, let's see the query plan for this statement:
root@:26257> EXPLAIN(EXPRS) SELECT * FROM customers WHERE State LIKE 'C%' and strpos(address, 'Infinite') != 0 order by name;
+-------+------------+--------+----------------------------------+
| Level | Type | Field | Description |
+-------+------------+--------+----------------------------------+
| 0 | sort | | |
| 0 | | order | +name |
| 1 | index-join | | |
| 2 | scan | | |
| 2 | | table | customers@SI |
| 2 | | spans | /"C"-/"D" |
| 2 | | filter | state LIKE 'C%' |
| 2 | scan | | |
| 2 | | table | customers@primary |
| 2 | | filter | strpos(address, 'Infinite') != 0 |
+-------+------------+--------+----------------------------------+
So the plan produced for this query, from top (highest-level) to bottom, looks like:
sortNode -> indexJoinNode -> scanNode (index)
-> scanNode (PK)
When executing this query, the runner will iterate through the rows produced by the root node (the sortNode) which, in turn, will read all the nodes from indexJoinNode, which will read from the scanNodes that actually interface with the KV layer for reading data.
Now, with some mental picture of how a SQL query is executed, let’s see how a CREATE TABLE… AS… is executed. CREATE TABLE is compiled into a createTableNode; if an AS clause is present, this node does something quite odd internally: at runtime, in its Start() method, it creates and executes an insertNode which actually populates the new table. So, to understand what intents are created by CREATE TABLE AS, we will dig into what intents are created by an insertNode (which is also what INSERT statements get compiled to). The insertNode is a bit complicated; we’ll have to dig through a couple of layers. But it’ll all be worth it, as we’ll see how data is created in CRDB. Let’s start in insertNode.Next(), which is called in a loop by the parent until it says that its job is done. It start with this line: if next, err := n.run.rows.Next(ctx); !next {
...
}
So, we’ll be iterating through all the rows coming from n.run.rows, which is the insertNode's source. In our case, the source had been set by the CreateNode as the execution plan for the AS … clause - the root of this plan is a selectNode. The insertNode.Next() code proceeds to massage a row from the source into a row suitable for the new table, and eventually calls:
_, err = n.tw.row(ctx, rowVals)
This n.tw is a tableWriter, which is where the journey to producing KV values begins. As we’ll see, the n.tw.row(…) call batches a bunch of KV pairs to be written in the database later. tableWriter is an interface; the implementation we care about here is the tableInserter; tableInserter is a pretty thin layer on top of a RowInserter, but it does two important things: it creates one of those KV batch requests that we’ve mentioned before, and it later (in tableInserter.finalize()) sends that batch for execution. This finalize() is called by the insertNode only after the source has been exhausted; this means that everything that the insertNode wants to insert is batched in one big request. This is not great (issues #15713, #16180), As we’ve talked about in our discussion of the TxnCoordSender, write requests in this BatchRequest will result in intent key spans that get tracked. Let’s dig deeper to see what these write requests are: onto the RowInserter that the tableInserter uses for batching each row. The magic happens in RowInserter.InsertRow(): this fella is aware of the schema of the table in which we’re inserting, the primary key, the secondary indexes, etc., and so it is in charge of encoding the correct raw bytes for each of these. A discussion of these encodings is beyond our scope; see Structured data encoding in CockroachDB SQL. What we’re interested in are the KV requests generated. We see that, for each row, there’s a call to putFn