Simulate impact of shard movement using shard-level write load #131406

nicktindall · 2025-07-17T06:00:22Z

I've been back and forth on this a bit, but I think going for something simple is best. When we start receiving shard write load estimates from the nodes that should be able to plug those in and this should "just work" (assuming I've understood shard-level write load correctly.

We ignore queue latency in the modelling because I don't think we're going to look at it in the decider, and I can't see how we could estimate how it would change in response to shard movements (it's a function of the amount the node is overloaded AND how long it's been like that, and back-pressure should ideally keep a lid on it).

elasticsearchmachine · 2025-07-17T06:00:46Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

…ad_modeling_to_balancer

mhl-b · 2025-07-17T17:31:34Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+
+public class WriteLoadPerShardSimulator {
+
+    private final ObjectFloatMap<String> writeLoadDeltas;


nit: simulatedNodesLoad?

👍 I changed to simulatedWriteLoadDeltas, we only store the delta from the reported/original write load here. The idea there is that if no delta is present, we can just return the original NodeUsageStatsForThreadPools instance.

mhl-b · 2025-07-17T22:42:24Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+                }
+            }
+            writeShardsOnNode.forEach(
+                shardId -> writeLoadPerShard.computeIfAbsent(shardId, k -> new Average()).add(writeUtilisation / writeShardsOnNode.size())


Do you equally divide write-load across all write shards on node?

Yeah, this is just a stop-gap until we get actual shard loads, which should work as a drop-in replacement.

Ok, I was thinking maybe we should have some heuristic from already available data. Otherwise signal/noise ratio is too high. It's not uncommon to have hundreds of shards, and estimation has little to no impact on a single shard.

For example use shardSize heuristic, the larger size more likely it would have write-load. Lets say linearly increase weight of those shards as size approaches 15GB. And then decrease weight as they approach to 30GB since we would roll-over them (most of the time) if size <15GB then size/15GB else max(0, 1-size/30GB)

We'll have actual shard write loads shortly. Hopefully we can avoid all this guessing entirely.

#131496

…ad_modeling_to_balancer

mhl-b

LGTM

nicktindall · 2025-07-18T03:56:14Z

I might hold off merging until we get #131496 merged, I think we can avoid fudging the shard write loads

DiannaHohensee

I left one comment where I'm concerned there might be a bug, but the other requests are just improvements.

We ignore queue latency in the modelling because I don't think we're going to look at it in the decider, and I can't see how we could estimate how it would change in response to shard movements (it's a function of the amount the node is overloaded AND how long it's been like that, and back-pressure should ideally keep a lid on it).

I was originally imagining that we could (in future, not now) collect some per shard stats for queuing, and make some kind of estimate for additional shard write load based on that, like auto-scaling except at the shard instead of node level. But it may turn out that we don't need something like that: probably see how it goes in production. And I haven't explored the feasibility of collecting a stat like that.

Alternatively, we could choose to be more generous with how much write load is moved away from a node, based on the queue latency: we don't know how much load to attribute to a particular shard, but we could extrapolate that when the queue latency is X seconds, we then need to move off X seconds of additional thread execution time translated into shard write load (which is thread execution time).

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

DiannaHohensee · 2025-07-21T20:04:46Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+
+public class WriteLoadPerShardSimulator {
+
+    private final ObjectDoubleMap<String> simulatedWriteLoadDeltas;


Is there a performance gain over Map<String, Double>? I'm wondering why use it, essentially.

Because Double is a boxed value, a pointer to heap allocated double. https://docs.oracle.com/javase/tutorial/java/data/autoboxing.html. Unfortunately. For this reason you can see different classes working with generic(boxed) and primitive collections, for performance reasons. Boxed values are tracked by GC, take more space (pointer and value), and require dereference. And can be null :(

IntStream s; -> stream of int rather than Integer LongStream s; -> stream of long Arrays.binarySearch(); -> 17 method overrides for each primitive and generic(boxed)

https://openjdk.org/jeps/402 - somewhere in future, when Brian Goetz celebrates his 80th birthday I guess :D

If they can somehow evolve their way to "valhalla" without breaking backward compatibility and without making the language horribly inconsistent that'll be a marvel of modern engineering.

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

DiannaHohensee · 2025-07-22T19:46:34Z

server/src/test/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulatorTests.java

+                    .flatMap(index -> IntStream.range(0, 3).mapToObj(shardNum -> new ShardId(index.getIndex(), shardNum)))
+                    .collect(Collectors.toUnmodifiableMap(shardId -> shardId, shardId -> randomDoubleBetween(0.1, 5.0, true)))
+            )
+            .build();


Could you log the ClusterInfo to a string? There isn't any debug information to look at if any of the tests fail (I think?), and some logging of the values might help.

I don't think that will add value here, there's a lot of numbers that go into the calculations and all the values are randomised, and it's a unit test with no concurrency so failures should be reliably reproducible with the seed. I would like to leave the logging to whoever's troubleshooting the failure.

DiannaHohensee · 2025-07-22T19:48:40Z

server/src/test/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulatorTests.java

+        final var writeLoadPerShardSimulator = new WriteLoadPerShardSimulator(allocation);
+
+        // Relocate a random shard from node_0 to node_1
+        final var randomShard = randomFrom(StreamSupport.stream(allocation.routingNodes().node("node_0").spliterator(), false).toList());


log randomShard? For debug purposes, then we can match it with the ClusterInfo info I suggest logging in createRoutingAllocation.

Again given that this is a unit test with no concurrency any failure should be reliably reproducible. Going to not log here assuming someone investigating a failure can log the things they're interested in.

server/src/test/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulatorTests.java

DiannaHohensee · 2025-07-22T20:08:11Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+
+    public void simulateShardStarted(ShardRouting shardRouting) {
+        final Double writeLoadForShard = writeLoadsPerShard.get(shardRouting.shardId());
+        if (writeLoadForShard != null) {


Could you test the case where a shard write load is 0/null? Like would be reported for a non-data stream index shard.

This will be covered by an existing test, I renamed it to make that clearer in 58e84a2.

That test randomly nullifies the thread pool stats or the write load stats for the objects in the test

I also increased the likelihood of the utilisation numbers and write loads of being zeros in a072aaf

We will not return a simulated utilisation of < 0.0 as well because that's nonsense

Oh, nice change to max(X, 0.0) to avoid the negative numbers

Seems possible for the node-level and shard-level write loads not to line up exactly. E.g. the latest node-level write load reported for nodeA is 0, and it holds a shardA with peak shard write load is >0. If we then move shardA away from nodeA, we'd go negative. Shards can be moved for reasons other than write load balancing, so maybe it could happen 🧐

Yeah I think that's something we'll have to live with, I don't think it should matter too much for the purpose of the simulation.

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

…ad_modeling_to_balancer

…ilization numbers at 0.0

DiannaHohensee

Sorry for the delay -- I didn't quite manage to finish reviewing on Friday.

LGTM 👍 Just some minor comments. Nice to have this piece :)

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

DiannaHohensee · 2025-07-28T14:50:59Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+                    Maps.copyMapWithAddedOrReplacedEntry(
+                        entry.getValue().threadPoolUsageStatsMap(),
+                        "write",
+                        replaceWritePoolStats(entry.getValue(), simulatedWriteLoadDeltas.get(entry.getKey()))


Suggested change

replaceWritePoolStats(entry.getValue(), simulatedWriteLoadDeltas.get(entry.getKey()))

replaceWritePoolStats(entry.getValue(), simulatedNodeWriteLoadDeltas.get(entry.getKey()))

DiannaHohensee · 2025-07-28T16:30:03Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+    private final Map<ShardId, Double> writeLoadsPerShard;
+
+    public ShardMovementWriteLoadSimulator(RoutingAllocation routingAllocation) {
+        this.originalNodeUsageStatsForThreadPools = routingAllocation.clusterInfo().getNodeUsageStatsForThreadPools();


If a copy were made here, there would be no need to generate the node stats on demand. The original stats also aren't needed ever again in the original form.

The only reason to keep it I can think of would be stats reporting down the line -- e.g. what did this balancer round accomplish.

Anyway, more of a note than a request. Doesn't seem like a big deal.

I kept the originals + deltas so that we could just return the unmodified NodeUsageStatsForThreadPools for any nodes that weren't involved in shard movement. I think this should keep garbage to a minimum? Also if we keep adding and subtracting the write load values we might introduce rounding errors due to the arithmetic.

…vementWriteLoadSimulator.java Co-authored-by: Dianna Hohensee <[email protected]>

…-tracking * upstream/main: (26 commits) Add release notes for v9.1.0 release (elastic#131953) Unmute multi_node generative tests (elastic#132021) Avoid re-enqueueing merge tasks (elastic#132020) Fix file entitlements for shared data dir (elastic#131748) ES|QL brute force l2_norm vector function (elastic#132025) Make ES|QL SAMPLE not a pipeline breaker (elastic#132014) Speed up tail computation in MemorySegmentES91OSQVectorsScorer (elastic#132001) Remove deprecated usages in `TransportPutFollowAction` (elastic#132038) Simulate impact of shard movement using shard-level write load (elastic#131406) Remove RemoteClusterService.getConnections() method (elastic#131948) Fix off by one in ValuesBytesRefAggregator (elastic#132032) Use unicode strings in data generation by default (elastic#132028) Adding index.refresh_interval as a data stream setting (elastic#131482) [ES|QL] Add more Min/MaxOverTime CSV tests (elastic#131070) Restrict remote ENRICH after FORK (elastic#131945) Fix decoding of non-ascii field names in ignored source (elastic#132018) [docs] Use centrally maintained version variables (elastic#131939) Configurable Inference timeout during Query time (elastic#131551) ESQL: Allow pruning columns added by InlineJoin (elastic#131204) ESQL: Fail `profile` on text response formats (elastic#128627) ...

Estimate impact of shard movement using node-level write load

0ebd75d

nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jul 17, 2025

nicktindall requested a review from DiannaHohensee July 17, 2025 06:00

elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Jul 17, 2025

nicktindall added 3 commits July 17, 2025 16:02

Naming

e589db4

More randomness

beb2611

Merge remote-tracking branch 'origin/main' into ES-12000_add_write_lo…

4e0fd1d

…ad_modeling_to_balancer

nicktindall requested a review from mhl-b July 17, 2025 06:11

Pedantry

3f90889

mhl-b reviewed Jul 17, 2025

View reviewed changes

nicktindall added 2 commits July 18, 2025 09:27

Naming

0b1d4a2

Merge remote-tracking branch 'origin/main' into ES-12000_add_write_lo…

9527720

…ad_modeling_to_balancer

nicktindall mentioned this pull request Jul 18, 2025

Add shard write-load to cluster info #131496

Merged

mhl-b approved these changes Jul 18, 2025

View reviewed changes

nicktindall added 2 commits July 21, 2025 16:14

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

9e36975

Use shard write loads instead of estimating

9ca9b4b

nicktindall changed the title ~~Estimate impact of shard movement using node-level write load~~ Simulate impact of shard movement using shard-level write load Jul 21, 2025

DiannaHohensee reviewed Jul 22, 2025

View reviewed changes

nicktindall added 8 commits July 24, 2025 12:00

Add javadoc to WriteLoadPerShardSimulator

988ac3c

Explain simulateShardStarted better for the new shard case

f5ed735

Assert on scale of utilisation change

8501b37

Improve description of relocation

8c21cc0

Typo

519d1dd

Rename test to indicate it also tests missing write loads

58e84a2

Always simulate based on original write loads and thread pool stats

faccc3d

Use for-loop instead of stream

edc259a

nicktindall added 10 commits July 24, 2025 14:16

Consolidate similar tests

f60029f

Naming/description of nodeUsageStatsForThreadPools

94687e3

Naming of test utility methods

466a7e0

WriteLoadPerShardSimulator -> ShardMovementWriteLoadSimulator

827f637

Merge remote-tracking branch 'origin/main' into ES-12000_add_write_lo…

1c876e7

…ad_modeling_to_balancer

Increase likelihood of write loads and utilizations being 0, floor ut…

a072aaf

…ilization numbers at 0.0

Pedantry

6ae4aa4

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

1dbee7f

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

a4d89b9

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

89c4d28

nicktindall requested a review from DiannaHohensee July 25, 2025 04:13

nicktindall added 3 commits July 28, 2025 12:05

Assert that shardStarted only happens on destination node ina relocation

770e04d

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

2b3ebf6

Typo

34472f6

DiannaHohensee approved these changes Jul 28, 2025

View reviewed changes

nicktindall and others added 4 commits July 29, 2025 10:43

Update server/src/main/java/org/elasticsearch/cluster/routing/ShardMo…

cf81780

…vementWriteLoadSimulator.java Co-authored-by: Dianna Hohensee <[email protected]>

Update server/src/main/java/org/elasticsearch/cluster/routing/ShardMo…

27a9ba0

…vementWriteLoadSimulator.java Co-authored-by: Dianna Hohensee <[email protected]>

Naming

1e0a4a0

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

38c79a7

nicktindall merged commit ab2e654 into elastic:main Jul 29, 2025
33 checks passed

nicktindall deleted the ES-12000_add_write_load_modeling_to_balancer branch July 29, 2025 04:39


		public class WriteLoadPerShardSimulator {

		private final ObjectFloatMap<String> writeLoadDeltas;


		public class WriteLoadPerShardSimulator {

		private final ObjectDoubleMap<String> simulatedWriteLoadDeltas;

	replaceWritePoolStats(entry.getValue(), simulatedWriteLoadDeltas.get(entry.getKey()))
	replaceWritePoolStats(entry.getValue(), simulatedNodeWriteLoadDeltas.get(entry.getKey()))

Simulate impact of shard movement using shard-level write load #131406

Simulate impact of shard movement using shard-level write load #131406

Uh oh!

Conversation

nicktindall commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Jul 18, 2025

Uh oh!

DiannaHohensee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nicktindall commented Jul 17, 2025 •

edited

Loading

mhl-b Jul 17, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading

mhl-b Jul 23, 2025 •

edited

Loading