-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New ScyllaDB key space partitioning #4049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 06-02-optimize_scylladb_s_batch_writes
Are you sure you want to change the base?
New ScyllaDB key space partitioning #4049
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
e2afc91
to
72b1567
Compare
768208f
to
0b093e3
Compare
72b1567
to
7cfc4d3
Compare
7cfc4d3
to
0b8cf01
Compare
09d6c81
to
a8361a5
Compare
0b8cf01
to
1f38372
Compare
1f38372
to
07ca203
Compare
07ca203
to
314f92f
Compare
a8361a5
to
0ff7949
Compare
0ff7949
to
2bd4ab3
Compare
314f92f
to
63fd134
Compare
2bd4ab3
to
a05e631
Compare
63fd134
to
ff4a945
Compare
@@ -243,6 +243,8 @@ impl TestContextFactory for ScyllaDbContextFactory { | |||
let config = ScyllaDbStore::new_test_config().await?; | |||
let namespace = generate_test_namespace(); | |||
let store = ScyllaDbStore::recreate_and_connect(&config, &namespace).await?; | |||
// TODO(#4065): Remove this once we enforce exclusive mode for views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're going to remove the line you added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you’re right, I guess the tests lacking these calls are an orthogonal issue. I’ll remove the comment tomorrow
d9c9974
to
d04a148
Compare
b72852f
to
c8b5e29
Compare
d04a148
to
d33ab0d
Compare
c8b5e29
to
36cfe7b
Compare
36cfe7b
to
cef2f5d
Compare
d33ab0d
to
1cf50f7
Compare
1cf50f7
to
5161f03
Compare
cef2f5d
to
aca7d91
Compare
5161f03
to
a419c54
Compare
aca7d91
to
9367ee5
Compare
a419c54
to
0428ee1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
back to your queue as we iterate on the design
0428ee1
to
496ba58
Compare
a5ebfeb
to
0a80832
Compare
496ba58
to
0fb57f0
Compare
0a80832
to
7be4b46
Compare
0fb57f0
to
f1df5d1
Compare
Motivation
ScyllaDB hashes each partition key into a token. Each token range gets assigned to a shard. Each shard is pinned to a specific CPU by default. This means that having very few partitions is bad for ScyllaDB's performance.
Right now in the current scheme of things we have just one partition key, which is the
root_key
. For views, which is our mutable data in the DB,root_key
will be set, and we'll be in "exclusive mode". For everything else (Certificate
s,ConfirmedBlock
s,Blob
s, etc), everything is on the same partition. For example when we're doing 1M TPS, we'll have several thousands of blocks being created every second, as well as certificates. This current scheme won't scale for those numbers.Proposal
New schema proposal: instead of having a
root_key
, we'll have apartition_key
instead. Thispartition_key
will have two modes: exclusive mode (mutable data) and non exclusive mode (immutable data). The former will work exactly how the previousroot_key
schema worked.For exclusive mode, the
partition_key
's first byte will be0
, indicating the mode. The rest of the bytes will be theroot_key
that was provided when callingopen_exclusive
. Everything else should work as it previously did.For non exclusive mode, the
partition_key
's first byte will be1
. The rest of the partition key will be a prefix of the key of a predetermined length (through a configuration parameter). This mode we'll have some reservations:partition_key
's prefix size)read_multi_values_internal
andcontains_keys_internal
we currently group the keys bypartition_key
and execute one query perpartition_key
in parallel. This is done to keep the queries token aware across partitionsfind_keys_by_prefix_internal
andfind_key_values_by_prefix_internal
, if the prefix is smaller than thepartition_key
's prefix size, we'll do a full table scan. This happens infrequently enough that we're willing to take the performance hit.Test Plan
CI + will benchmark this to check performance. Some tests had to be altered as they didn't respect the invariant that if we're using a
root_key
, we should be in exclusive mode.Follow ups
There are several follow ups here:
Certificate
andConfirmedBlock
BaseKey
s closer together, as well asBlob
andBlobState
find_keys_by_prefix_internal
for prefix deletes, and take the perf hit of the full table scan, as these seem to be currently infrequentView
s (maybe onload
) that they must always be on exclusive modeWritableKeyValueStore
should havewrite_value
andwrite_multi_values
methods, analogous toReadableKeyValueStore
. Batches should be used when atomicity is wanted, or when we want to save network requests to the DBRelease Plan