Skip to content

(2.12) Atomic batch: support large batches #7067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MauriceVanVeen
Copy link
Member

This PR adds support for batches that are larger than a single append entry. Batches are first accumulated before replicating, then once all expected checks pass the batch is proposed (this was already the case). Now when applying the batch, we stage them in-memory until we see the commit message. If the commit message is not seen or gaps are detected, it gets rejected to prevent partially applying batches.

New batchMsgOp and batchCommitMsgOp were introduced to "wrap" the streamMsgOp and compressedStreamMsgOp to also contain the batchId and batchSeq without needing to decompress/decode the message to get the raw headers when doing consistency checks prior to commit.

Applying the streamMsgOp/compressedStreamMsgOp is extracted into a new applyStreamMsgOp function. This supports doing consistency checks prior to commit when batching, and then doing applies in one go.

Resolves #6978

Signed-off-by: Maurice van Veen [email protected]

@MauriceVanVeen
Copy link
Member Author

The PR is ready for review, but put in draft since I need to fix one bug where if you'd constantly use batching and they would be spread over multiple append entries, the applied would not move up which would prevent making snapshots.

Instead of fully blocking n.Applied when having an active batch across append entries, need to track the ce.Index prior to the batch starting so we can gradually move n.Applied up.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not looked at this specific PR, but if we stage and wait to send the batch to replicas this will be a large seesaw in throughput as the app waits a longer period of time for the ack of the commit since we have to replicate all messages after the commit. WDYT?

@MauriceVanVeen
Copy link
Member Author

Have not looked at this specific PR, but if we stage and wait to send the batch to replicas this will be a large seesaw in throughput as the app waits a longer period of time for the ack of the commit since we have to replicate all messages after the commit. WDYT?

I'm thinking there's a slight misunderstanding here around what happens where. But I could also be misinterpreting what you're asking. Let me explain the current flow.

This is how it works already on main:

  • Many publishers can publish batches to the stream leader.
  • Once a publisher says "commit", the stream leader does all consistency checks and only afterward proposes to all followers.

Below is added in this PR, and happens after the above:

  • On the follower(/leader) apply path, we do a consistency check to confirm the leader didn't change halfway when proposing the batch.
  • All followers do this complete batch/consistency check, and then apply as normal. This can happen safely without needing to consult the leader.

So, we need to do consistency checks prior to replicating AND the follower needs to do a consistency check the batch wasn't abandoned midproposal.

@MauriceVanVeen
Copy link
Member Author

What I think you meant is that instead of replicating after the commit as seen on the leader, we replicate prior to seeing the commit. And then we either invalidate the batch if it turns out invalid or we commit it. That should then be some amount faster since we don't need to replicate and then apply.

I do want to consider and tryout that approach. But I'd first want to walk down this implementation path. This approach is going to be fully safe and give all the guarantees the batch needs.
The other approach might be faster, but will be less safe without additional fixes.

What I'd prefer we do:

  • Continue the current implementation path that will have the consistency and safety required.
  • Do benchmarking on this approach. (I already know it is faster than PublishAsync but it can be done more thoroughly)
  • Assert this approach really is 100% safe, not only with unit tests, but also when run under Antithesis.
  • Then look into implementing the alternative/mentioned approach, and do the same benchmarks and testing. If it's faster and just as safe, I'd be happy to use it. (Likely the implementation will not be different too much, so we can start here and extend to that later)

@derekcollison
Copy link
Member

Yes so this means a large latency spike when an app sends COMMIT and waits on that ack since it means we have to replicate all of the batch only on COMMIT.

We could replicated as we go and then commit would check everything (if deterministic) and each one is already ready (essentially) on a commit.

Have not looked at this specific PR, but if we stage and wait to send the batch to replicas this will be a large seesaw in throughput as the app waits a longer period of time for the ack of the commit since we have to replicate all messages after the commit. WDYT?

I'm thinking there's a slight misunderstanding here around what happens where. But I could also be misinterpreting what you're asking. Let me explain the current flow.

This is how it works already on main:

  • Many publishers can publish batches to the stream leader.
  • Once a publisher says "commit", the stream leader does all consistency checks and only afterward proposes to all followers.

Below is added in this PR, and happens after the above:

  • On the follower(/leader) apply path, we do a consistency check to confirm the leader didn't change halfway when proposing the batch.
  • All followers do this complete batch/consistency check, and then apply as normal. This can happen safely without needing to consult the leader.

So, we need to do consistency checks prior to replicating AND the follower needs to do a consistency check the batch wasn't abandoned midproposal.

@MauriceVanVeen
Copy link
Member Author

There are some additional issues I can foresee when doing that, for example counters can't have this optimistic replication because they change the headers and message body on commit. So those can't be replicated first.
Think those issues are solvable, and primarily batched counter updates would need to opt-out.

Do want to note though, there's no large latency spike on commit! If the batch fits in a single append entry, the latency will be exactly the same as with PublishAsync. Point taken though on potentially having better latencies with optimistic replication. Will need to measure if that's indeed the case, and by how much.

@derekcollison
Copy link
Member

Agree if they fit into one AE should not see any effect. And under $SYS from a server that message can be any size (but prefer not to go over 8M). But once hoisted into a user account could be subject to max payload restrictions.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/batch-large branch 2 times, most recently from 6531e0f to d8e92fd Compare July 16, 2025 10:33
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review July 16, 2025 11:34
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner July 16, 2025 11:34
"fmt"
"strconv"
"testing"
"time"

"github.com/klauspost/compress/s2"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove new line here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

continue
} else if batchActiveId != _EMPTY_ {
// If a batch is abandoned without a commit, reject it.
mset.batchMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep repeating ourselves here, maybe these are functions?

Copy link
Member Author

@MauriceVanVeen MauriceVanVeen Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the move into a separate batchApply struct (d85ff64) could remove the batchActiveId variable that needed to be kept up-to-date.

Like mentioned here #7067 (comment), could only introduce a rejectBatchState() that's used in two places, the lock needs to be held for longer in all other cases.


// clearBatchStateLocked clears in-memory apply-batch-related state.
// mset.batchMu lock should be held.
func (mset *stream) clearBatchStateLocked() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe intro no Locked version that acquire the proper lock and release and call these functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have introduced a rejectBatchState() which is now used in two places when that's the only thing that's done.
The others all need the lock to be held for a longer period.

server/stream.go Outdated
batches *batching

// State to check for batch completeness before applying it.
batchMu sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these all be in their own struct that we alloc only when we see first batch?

Copy link
Member Author

@MauriceVanVeen MauriceVanVeen Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into a separate batchApply struct (d85ff64).

Signed-off-by: Maurice van Veen <[email protected]>
Signed-off-by: Maurice van Veen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch publish - large batch support
2 participants