Replies: 2 comments 1 reply
-
|
Quick and dirty performance test: #179 Initial conclusions:
|
Beta Was this translation helpful? Give feedback.
-
|
How would
I like this idea 👍 Even though it would still block the database for writes during |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
The sync process generally involves two steps:
The big motivation for the split is consistency: The user sees no changes directly during step 1, instead getting an atomic update to the next (or first) checkpoint in the sync_local step.
The user experience in step 1 is generally good: Data is downloaded incrementally, and we report progress on it. But when there is a lot of data, the sync_local step can take long to complete (minutes in some cases), blocks all writes, and gives no feedback on progress.
I'd like to address this in two ways:
Current implementation
Currently, we have three indexes on
ps_oplog:We use the indexes in these ways:
ps_oplog_key: de-duplicates operations on the same source row in the same bucket, to avoid storing the entire history.ps_oplog_row: groups operations by target row, allowing us to de-duplicate multiple copies of the same target row if it is synced via multiple buckets or multiple source rows within the same bucket (the latter is an edge case where the synced id is not unique).ps_oplog_opid: this is used to find unapplied changes fromps_oplogduring thesync_localstep, to allow efficiently performing incremental updates.Then we also have this table:
This is used to:
We combine those to compute all updated rows for sync_local:
And then query the latest data for each row (in the same query, using the above CTE):
This is slightly more complex with partial checkpoints, which I'm not covering in detail here.
Proposal
Part 1
Remove the
ps_oplog_opidindex. Instead, when we sync a new operation, also insert it intops_updated_rows. This simplifies the query forupdated_rowsto a simpleSELECT row_type, row_id FROM ps_updated_rows.The direct advantage is that this should be slightly faster to query: We can iterate through the rows from
ps_updated_rowsdirectly, where we SQLite currently needs to (1) iterate through theps_oplog_opidindex, then (2) lookup those rows fromps_oplogto get therow_typeandrow_id.It is not strictly required to remove
ps_oplog_opidindex here, but it does offset the write overhead from writing more rows tops_updated_rows. Removing this does have implications for partial checkpoints - see below.We'd also need to change the implementation of
powersync_trigger_resync()- the current implementation relies on just settinglast_applied_op = 0for all buckets.Partial checkpoints
Partial checkpoints aren't covered directly by the above. These are more tricky, since they need to separately keep track of priorities of changes, to filter incremental updates by specific priorities. I believe we can't even store the priority on
ps_updated_rowsdirectly, since the priority for an entire bucket can change at any point.One option is to just keep the
ps_oplog_opidindex and use the current approach for partial checkpoints, but that's not ideal.It could work to instead store the relevant bucket(s) on
ps_updated_rows, which may require a re-design of that table. For example:Initially, I had the primary key on
(row_type, row_id, bucket), with the idea that we can efficiently group on(row_type, row_id), and just filter onbucket, having some overhead for partial checkpoints to do that filter. But from doing actual tests, it appears unnecessary to do that grouping here: We can read all rows fromps_updated_rows, then do the grouping onps_oplog, where we need to do that grouping anyway.Part 2 (probably not)
Update: This appears unnecessary given the gains from part 1 above.
A previous idea I had was for bulk sync, we can make sync faster by not computing
updated_rowsat all, but instead re-syncing the entireps_oplogtable, grouped by(row_type, row_id)using theps_oplog_rowindex.The tricky part is to know when this is faster than the incremental version. We could keep track of total
ps_oplogcount versus number of rows inps_updated_rows, and switch over when for examplecount(ps_oplog) * 0.5 > count(ps_updated_rows). Note that counting the tables directly can be expensive by itself, so we'd need to persist separate counters for this.TODO: Test if this is actually faster once we have implemented part 1, since that already removes the extra scan through
ps_oplog_opidand lookups inps_oplog. This may also prevent optimizations from part 3.Part 3: Incremental/chunks sync_local
Once we only use
ps_updated_rowsto keep track of rows that need to be copied, we can use this do the process in separate chunks:ps_updated_rows.ps_updated_rows.We can still keep the atomic nature of sync_local by wrapping all of this in a single transaction. But the client can control that, which means the client can also report progress (we may need to track progress counters if we want to report an actual percentage).
In theory a client could even opt to not wrap that in a single transaction, to avoid the blocking behavior, at the cost of losing consistency properties - see details below.
Optional consistency
For standard full checkpoints, our consistency behavior is always well-defined: All data tables atomically switch to the a checkpoint, and only when all local changes have been uploaded and acknowledged via a write checkpoint.
When using bucket priorities, we relax some of those properties, to get more responsiveness. Currently:
The changes above can make some of this behavior configurable:
Deletes in partial checkpoints
If we're tracking specific buckets in
ps_updated_rows, we can optionally sync deletes in partial checkpoints:Update after discussion with @simolus3: Arguably the case for moving data between different priority buckets is much more of an edge case than needing consistency within a bucket / sync stream, so we should consider whether deletes in partial checkpoints should be the new default. It could be considered a breaking change though.
Avoid overwriting local changes
With priority 0, we can avoid overwriting local changes: Avoid updating rows as long as there is a local entry in
ps_updated_rowswithbucket = 0(local write).That would mean if any row was updated locally, any synced updates would be blocked until the changes are uploaded, and the write checkpoint is synced back. Effectively, it disables the priority 0 behavior for those specific rows.
It would still be in the realm of "eventual consistency" properties for priority 0, but it would avoid the "flicker" currently seen.
Implementing this change is likely to improve apparent consistency in all cases affected by this. Although maybe there is an edge case where the current behavior is desired?
Incremental sync_local
Applying the entire
sync_localstep in one transaction is important if we want to maintain the current consistency properties. But we could add an option to relax those properties, to avoid blocking all local writes for a long time and get better responsiveness:For each individual row, data would be atomically updated from one checkpoint to the next. But the overall local data would not be consistent: different rows would be updated at different times.
We could make this configurable by priority level:
I'm not sure whether that will actually help though: Full checkpoints would still be blocking in that case, so you're only removing that blocking behavior for a small number of cases.
Comparing with JourneyApps Platform
The PowerSync sync system was designed as an evolution of the JourneyApps Platform sync system, but with much stronger consistency properties.
The JourneyApps Platform sync system effectively has "eventual consistency" only. More specifically:
It is effectively similar to PowerSync with all buckets as priority 0, coupled with "Deletes in partial checkpoints", "Avoid overwriting local changes" and "Incremental sync_local" as described above.
Despite the big reduction in consistency properties, we still have apps syncing hundreds of thousands of rows per client, with no reports of issues caused by sync inconsistencies in practice.
All of that to say, there could be a valid case to relaxing the consistency properties as an opt-in option.
Beta Was this translation helpful? Give feedback.
All reactions