[CASSANDRA-21354][trunk] Avoid serialization and deserialization for coordinator-local single partition data read by netudima · Pull Request #4789 · apache/cassandra

netudima · 2026-05-07T15:44:06Z

Currently, when we execute a local read we fetch data from SSTables and Memtables using a merging iterator and write it to a byte buffer. Later when we combine a CQL response we deserialize the data back to iterate over them as a part of coordinator logic. So, we allocate rows and cells twice here, during the read from SSTables/Memtables and during the deserialization by coordinator logic if we read data locally (it is a typical scenario because usually drivers are sending requests to replicas).

The idea of optimization: if we do a single partition read of a small number of rows we can keep the data in memory and avoid double row objects allocation.

A system property is used to limit number of rows we keep in memory in this scenario (to avoid too much pressure on GC due to extended lifetime for these objects and promoting them to an old generation). The property also allows to disable the logic in case of any issues.

We cannot get a number of rows in advance, so we read first N rows to memory and if we still have something then we serialize the remaining to a buffer and then concatenate iterators for the in-memory rows + deserialized one when we need to iterate over the result

patch by Dmitry Konstantinov; reviewed by TBD for CASSANDRA-21354

…partition data read Currently, when we execute a local read we fetch data from SSTables and Memtables using a merging iterator and write it to a byte buffer. Later when we combine a CQL response we deserialize the data back to iterate over them as a part of coordinator logic. So, we allocate rows and cells twice here, during the read from SSTables/Memtables and during the deserialization by coordinator logic if we read data locally (it is a typical scenario because usually drivers are sending requests to replicas). The idea of optimization: if we do a single partition read of a small number of rows we can keep the data in memory and avoid double row objects allocation. A system property is used to limit number of rows we keep in memory in this scenario (to avoid too much pressure on GC due to extended lifetime for these objects and promoting them to an old generation). The property also allows to disable the logic in case of any issues. We cannot get a number of rows in advance, so we read first N rows to memory and if we still have something then we serialize the remaining to a buffer and then concatenate iterators for the in-memory rows + deserialized one when we need to iterate over the result patch by Dmitry Konstantinov; reviewed by TBD for CASSANDRA-21354

frankgh

There is a potential issue with data resurrection with this patch. I have added a few comments. Please take a look

frankgh · 2026-05-12T21:02:47Z

+    {
+        // all rows go directly into the overflow buffer
+        testMultipleRows(0, 3);
+    }


We are missing a test that has a RT marker between in memory rows and the overflow rows. I don't think we are handling this case correctly. We have a potential data resurrection issue here. Can we add this test?

Suggested change

}

}

@Test

public void inMemoryResponseWithRangeTombstoneBetweenInMemoryAndOverflow()

{

// RT at [1, 2] sits between the in-memory rows (0) and the overflow rows (3-4)

testWithTombstones(1, 5, 1, 2);

}

The fix I suggest is in org.apache.cassandra.db.ReadResponse.InMemoryDataResponse.LimitedUnfilteredRowIterator#computeNext to ensure the RT open/close pairs are never split . We either keep them entirely in memory or in overflow:

private boolean insideOpenMarker = false; @Override protected Unfiltered computeNext() { if (!wrapped.hasNext()) return endOfData(); if (rowCount >= maxRows && !insideOpenMarker) { hasOverflow = true; return endOfData(); } Unfiltered next = wrapped.next(); if (next.isRangeTombstoneMarker()) { RangeTombstoneMarker marker = (RangeTombstoneMarker) next; insideOpenMarker = marker.isOpen(isReverseOrder); } else rowCount++; return next; }

thank you for highlighting the issue!

I've applied a very similar change + extended test coverage for this logic

netudima added 2 commits May 7, 2026 16:38

fix RepairedDataInfo propagation

e32230a

frankgh requested changes May 12, 2026

View reviewed changes

netudima added 2 commits May 13, 2026 21:38

adress simple review comments

04740b6

address review comments

39931bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CASSANDRA-21354][trunk] Avoid serialization and deserialization for coordinator-local single partition data read#4789

[CASSANDRA-21354][trunk] Avoid serialization and deserialization for coordinator-local single partition data read#4789
netudima wants to merge 4 commits into
apache:trunkfrom
netudima:read_noser_proto_trunk

netudima commented May 7, 2026

Uh oh!

frankgh left a comment

Uh oh!

Uh oh!

frankgh May 12, 2026

Uh oh!

netudima May 13, 2026

Uh oh!

netudima May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    }
+    }
+    @Test
+    public void inMemoryResponseWithRangeTombstoneBetweenInMemoryAndOverflow()
+    {
+        // RT at [1, 2] sits between the in-memory rows (0) and the overflow rows (3-4)
+        testWithTombstones(1, 5, 1, 2);
+    }

Conversation

netudima commented May 7, 2026

Uh oh!

frankgh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frankgh May 12, 2026

Choose a reason for hiding this comment

Uh oh!

netudima May 13, 2026

Choose a reason for hiding this comment

Uh oh!

netudima May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants