[CASSANDRA-21354][trunk] Avoid serialization and deserialization for coordinator-local single partition data read#4789
Conversation
…partition data read Currently, when we execute a local read we fetch data from SSTables and Memtables using a merging iterator and write it to a byte buffer. Later when we combine a CQL response we deserialize the data back to iterate over them as a part of coordinator logic. So, we allocate rows and cells twice here, during the read from SSTables/Memtables and during the deserialization by coordinator logic if we read data locally (it is a typical scenario because usually drivers are sending requests to replicas). The idea of optimization: if we do a single partition read of a small number of rows we can keep the data in memory and avoid double row objects allocation. A system property is used to limit number of rows we keep in memory in this scenario (to avoid too much pressure on GC due to extended lifetime for these objects and promoting them to an old generation). The property also allows to disable the logic in case of any issues. We cannot get a number of rows in advance, so we read first N rows to memory and if we still have something then we serialize the remaining to a buffer and then concatenate iterators for the in-memory rows + deserialized one when we need to iterate over the result patch by Dmitry Konstantinov; reviewed by TBD for CASSANDRA-21354
frankgh
left a comment
There was a problem hiding this comment.
There is a potential issue with data resurrection with this patch. I have added a few comments. Please take a look
| { | ||
| // all rows go directly into the overflow buffer | ||
| testMultipleRows(0, 3); | ||
| } |
There was a problem hiding this comment.
We are missing a test that has a RT marker between in memory rows and the overflow rows. I don't think we are handling this case correctly. We have a potential data resurrection issue here. Can we add this test?
| } | |
| } | |
| @Test | |
| public void inMemoryResponseWithRangeTombstoneBetweenInMemoryAndOverflow() | |
| { | |
| // RT at [1, 2] sits between the in-memory rows (0) and the overflow rows (3-4) | |
| testWithTombstones(1, 5, 1, 2); | |
| } |
The fix I suggest is in org.apache.cassandra.db.ReadResponse.InMemoryDataResponse.LimitedUnfilteredRowIterator#computeNext to ensure the RT open/close pairs are never split . We either keep them entirely in memory or in overflow:
private boolean insideOpenMarker = false;
@Override
protected Unfiltered computeNext()
{
if (!wrapped.hasNext())
return endOfData();
if (rowCount >= maxRows && !insideOpenMarker)
{
hasOverflow = true;
return endOfData();
}
Unfiltered next = wrapped.next();
if (next.isRangeTombstoneMarker())
{
RangeTombstoneMarker marker = (RangeTombstoneMarker) next;
insideOpenMarker = marker.isOpen(isReverseOrder);
}
else
rowCount++;
return next;
}
There was a problem hiding this comment.
thank you for highlighting the issue!
There was a problem hiding this comment.
I've applied a very similar change + extended test coverage for this logic
Currently, when we execute a local read we fetch data from SSTables and Memtables using a merging iterator and write it to a byte buffer. Later when we combine a CQL response we deserialize the data back to iterate over them as a part of coordinator logic. So, we allocate rows and cells twice here, during the read from SSTables/Memtables and during the deserialization by coordinator logic if we read data locally (it is a typical scenario because usually drivers are sending requests to replicas).
The idea of optimization: if we do a single partition read of a small number of rows we can keep the data in memory and avoid double row objects allocation.
A system property is used to limit number of rows we keep in memory in this scenario (to avoid too much pressure on GC due to extended lifetime for these objects and promoting them to an old generation). The property also allows to disable the logic in case of any issues.
We cannot get a number of rows in advance, so we read first N rows to memory and if we still have something then we serialize the remaining to a buffer and then concatenate iterators for the in-memory rows + deserialized one when we need to iterate over the result
patch by Dmitry Konstantinov; reviewed by TBD for CASSANDRA-21354