Skip to content

(2.12) Read-after-write & monotonic reads #6970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 31, 2025

Conversation

MauriceVanVeen
Copy link
Member

@MauriceVanVeen MauriceVanVeen commented Jun 13, 2025

See ADR: JetStream Read-after-Write for context, problem statement, and design.

Resolves #6557

Signed-off-by: Maurice van Veen [email protected]

@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner June 13, 2025 10:27
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch 3 times, most recently from 7286c89 to 270af06 Compare June 15, 2025 07:20
@derekcollison
Copy link
Member

Hard problem for sure..

I like the opt-in, but why not just make it such that the leader is the only one that listens on a non-DQ subscriber that direct gets, with the read after write opt-in, utilizes?

The redirect seems interesting, but not sure its practical. For one, we could just wait til we feel we are caught up from a replica standpoint (normally this will be faster than a redirect IMO).

@MauriceVanVeen
Copy link
Member Author

I like the opt-in, but why not just make it such that the leader is the only one that listens on a non-DQ subscriber that direct gets, with the read after write opt-in, utilizes?

Multiple reasons for this, one is that KV's design requires direct get to be used. Currently you can do a workaround where you disable this, but this is not intended (and doesn't guarantee fresh data either). @ripienaar might have more info about the design aspect as well.

Just the leader answering is the MsgGet we already have (although it's a different format under the JS API). But, then we haven't solved guaranteeing that an old leader doesn't send stale data. Which can happen during network partitions, or even during a quick publish and message get while leaders are changing.

For one, we could just wait til we feel we are caught up from a replica standpoint

That may be true, but hard to know how long to wait. And if we wait at all, wouldn't redirecting to the leader not just be faster/simpler?
Especially for cases where a follower is behind enough and is churning through backlog of catchup, or generally slow on applies.
Because mirrors can also answer DirectGet if enabled, we also need some way to have the guarantee there. So this proposal solves for both cases.

@derekcollison
Copy link
Member

Adding a new opt in such that direct gets can be only to the leader with new API feels correct.

I understand what you are saying about a new leader who is behind, but I am not sure the redirect to the leader would help that either IMO. WDYT?

@MauriceVanVeen
Copy link
Member Author

MauriceVanVeen commented Jun 16, 2025

The idea is specifically to still have replicas be allowed to respond. That's (likely) why DirectGet also is a requirement for our KV, and disabling DirectGet is considered to be a work around.

Just to clarify, it is an old/outdated leader that can still respond as well as the new leader. So even if we'd only allow "the leader" to respond we can't make any guarantees still. Because an old leader would still be "allowed" to respond and send stale data back to the client.

And because we don't want reads to become heavy and go through Raft, the Nats-Min-Last-Sequence header was introduced. Exposing it in read responses and allowing to require the minimum on subsequent read requests is what makes us able to guarantee read-after-write and monotonic reads.

This PR changes two read paths, namely the MsgGet and DirectGet paths:

  • With MsgGet, if a leader is old/outdated, it knows to delay the error response to let the real leader respond to the client faster. If the error does reach the client, it can simply retry.
  • With DirectGet this error logic is also there, but there's no delay. Not sure if we support delaying non-API error responses?

For the user it would be simpler to only use this header and have consistent reads, than to also need to set a stream setting that's not considered to be KV-compatible. (Although arguably, having the user flip a bool in a read request for it to go through Raft would be easiest for them, but I think that's a tradeoff we don't want to make?)

Note that reads only get redirected to the leader if the follower or mirror knows it doesn't have the data. If it does have the data, it can simply respond knowing its answer is correct. So DirectGet is still effective most of the time, and helps spread the load.

@derekcollison
Copy link
Member

The idea is specifically to still have replicas be allowed to respond. That's (likely) why DirectGet also is a requirement for our KV, and disabling DirectGet is considered to be a work around.

DirectGet has two parts, it does not encode the payload and we allow DQ semantics. The DQ semantics are for horizontal scalability with lots of concurrent GETS.

Just to clarify, it is an old/outdated leader that can still respond as well as the new leader. So even if we'd only allow "the leader" to respond we can't make any guarantees still. Because an old leader would still be "allowed" to respond and send stale data back to the client.

Meaning the new leader has not caught up its state? I am not following, if a new leader is elected that does not have all the data that the previous one did, we will sync followers to our new state no?

And because we don't want reads to become heavy and go through Raft, the Nats-Min-Last-Sequence header was introduced. Exposing it in read responses and allowing to require the minimum on subsequent read requests is what makes us able to guarantee read-after-write and monotonic reads.

I like the concept for sure, but if I had a single app (needed in your solution), I would solve read after write a much different way that did not depend on the external system (The NATS servers).

This PR changes two read paths, namely the MsgGet and DirectGet paths:

  • With MsgGet, if a leader is old/outdated, it knows to delay the error response to let the real leader respond to the client faster. If the error does reach the client, it can simply retry.
  • With DirectGet this error logic is also there, but there's no delay. Not sure if we support delaying non-API error responses?

We could if we believe there is a good reason to do so.

For the user it would be simpler to only use this header and have consistent reads, than to also need to set a stream setting that's not considered to be KV-compatible. (Although arguably, having the user flip a bool in a read request for it to go through Raft would be easiest for them, but I think that's a tradeoff we don't want to make?)

My suggestion does not turn off direct get, it simply changes it to have only one subscriber, the leader. So 100% KV compatible, what you lose is the horizontal scalability of replicas being able to answer also.

Note that reads only get redirected to the leader if the follower or mirror knows it doesn't have the data. If it does have the data, it can simply respond knowing its answer is correct. So DirectGet is still effective most of the time, and helps spread the load.

Like I said, I like the idea with the header of the last sequence number. Would need to see it fleshed out in API form of course.

I do think in most misses we should wait for a short period of time and recheck last, and if we are still behind we can forward to the leader.

@MauriceVanVeen
Copy link
Member Author

if a new leader is elected that does not have all the data that the previous one did, we will sync followers to our new state no?

This is not possible (because it would mean data loss). For someone to become leader they need to have the most up-to-date Raft log. This does not mean that it has applied all entries yet though. During a leader change you'll see the old leader have an up-to-date log and have applied most of those entries. The new leader will also have an up-to-date log, but it will not be up-to-date on applies, so it first needs to wait with responding to read/write until it has quorum on ALL uncommitted entries in its log. That ensures the new leader is in a consistent state before allowing read/writes.

Meaning the new leader has not caught up its state?

The new leader always has sufficient data in its log, but it first needs to get quorum and apply them. Because the previous leader did not inform them yet that those entries could be applied.

One could now write new data to the new leader and get a PubAck, but the old leader can still respond! (At least until it's informed of the leader change, which can take longer depending on RTT and partitions)
This violates read-after-write if the state on the old leader was stale. If you're only reading and not writing, this would also violate monotonic reads.

My suggestion does not turn off direct get, it simply changes it to have only one subscriber, the leader.

Above is also why just having "the leader" respond does not work. Because multiple servers can think they are leader. (Even if for a short period, but the property can be violated within that period)

I like the concept for sure, but if I had a single app (needed in your solution), I would solve read after write a much different way that did not depend on the external system (The NATS servers).

Not sure how you can achieve that without involving the servers that store the state?
The server that's responding to a read needs to know it is allowed to respond with some data. That can be either because it successfully went through Raft, so it's ordered and a valid response at that time in the log. Or with such a header so it can check inline by itself without needing Raft and without needing to be leader per se.
Not sure if there are alternatives?

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch from 270af06 to 300c5c5 Compare June 25, 2025 13:18
@MauriceVanVeen
Copy link
Member Author

MauriceVanVeen commented Jun 25, 2025

I like the idea with the header of the last sequence number. Would need to see it fleshed out in API form of course.

Have implemented this in the Go client almost fully. Halfway through I realised.. consumers also don't guarantee read-after-write!
This means you could do kv.Create and then a kv.Keys. You'd expect the just-created key is reflected in the returned keys list, but currently there are no guarantees.

The PR is updated now to:

  • Not do any direct/msg get read request redirecting. That's fully removed. Either you are okay with getting an error if you try to read on a follower/mirror that's just slightly behind to satisfy read-after-write, or you disable MirrorDirect or AllowDirect so only the leader can answer.
  • The direct/msg get JSON request bodies now contain MinLastSeq uint64 'json:"min_last_seq,omitempty"' as an option, this turned out to be easier to implement in clients than passed as a Nats-Min-Last-Sequence header.
  • Consumer create requests can now also specify MinLastSeq. The consumer will then wait before delivering any messages, until the stream has stored at least up to that last sequence. That satisfies the kv.Create-kv.Keys case as well.
  • The additional exposing of Nats-Min-Last-Sequence on the read request was redundant, because with message gets you already get the sequence, similarly when consuming messages. So those sequences can just be used, no need for an additional field.

In terms of API, it looks like this:

// Write
r, err := kv.Put(ctx, "key", []byte("value"))

// Read request
kve, err := kv.Get(ctx, "key", jetstream.MinLastRevision(r))

// Watch/consumer
kl, err := kv.ListKeys(ctx, jetstream.MinLastRevision(r))

By specifying the MinLastRevision (or MinLastSequence when using a stream normally), you can be sure your read request will be rejected by a follower if it can't be satisfied, or the follower will wait to deliver you messages from the consumer until it's up-to-date.

This satisfies read-after-write and monotonic reads when combining the write and read paths.

@MauriceVanVeen
Copy link
Member Author

Here's the Go client implementation in a branch.
It has full support for both the old and new JetStream API's, as well as for KV. Have not yet touched the ObjectStore API, but that could get similar support to support read-after-write, etc.

@MauriceVanVeen MauriceVanVeen changed the title (2.12) Direct Get read-after-write & monotonic reads (2.12) Read-after-write & monotonic reads Jun 25, 2025
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch from 300c5c5 to 663e294 Compare July 3, 2025 10:53
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch from 663e294 to da76ee2 Compare July 11, 2025 09:08
@MauriceVanVeen MauriceVanVeen marked this pull request as draft July 11, 2025 09:19
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but some questions.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch 3 times, most recently from 74230a8 to 4a5aa79 Compare July 18, 2025 14:25
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch from 4a5aa79 to 2d11e56 Compare July 29, 2025 12:10
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review July 29, 2025 12:38
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch from 05bc7d7 to 4f61a7d Compare July 30, 2025 14:53
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/direct-get-consistent-read branch from 4f61a7d to dace607 Compare July 31, 2025 15:00
@neilalexander neilalexander merged commit 46dec81 into main Jul 31, 2025
90 of 92 checks passed
@neilalexander neilalexander deleted the maurice/direct-get-consistent-read branch July 31, 2025 16:05
neilalexander added a commit that referenced this pull request Aug 13, 2025
Remove read-after-write/monotonic reads for 2.12.

> Although the design (read-after-write/monotonic reads/linearizable)
gives a lot of control and flexibility, we'll need to decide whether
this adds a bunch of complexity which would be (too) hard to understand
by an end-user, in which case we might need to revisit this.
#7146 (comment)

The current design is very flexible, but adds a ton of complexity that
needs to be handled on a per-request basis. We've discussed whether this
should instead be on a per-stream basis. Where on a replicated
stream/KV/ObjectStore the reads are serializable by default, but one can
opt-in to linearizable for that whole stream. That would mean client
code doesn't need to change at all to get this guarantee, only the
stream setting would need to be updated.

We'll need to discuss this further, and likely introduce this early in
the next 2.14 (skipping 2.13) release cycle.

Relates to #6557
Relates to PRs: #6970,
#7146

Signed-off-by: Maurice van Veen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KV Read-after-Write Consistency
4 participants