Pauseless Garbage Collector (Question) #115627
Replies: 81 comments 17 replies
-
Tagging subscribers to this area: @dotnet/gc Issue DetailsIs there any reason why something like the Pauseless Garbage Collector wich exists for Java from Azul never was implemented for Dotnet? https://www.azul.com/products/components/pgc/
|
Beta Was this translation helpful? Give feedback.
-
I'll add the generic response that the .NET GC supports functionality that Java's GCs do not and this can complicate or invalidate optimizations that other GCs can take advantage of. For examples, Java's GC doesn't support interior pointers while the .NET GC does |
Beta Was this translation helpful? Give feedback.
-
I'd like to see the GC team to write some insight about benefits and challenges/drawbacks of these options. What's theoretically possible but just complex/low priority for implementation? Which features/goals have fundamentally conflict? |
Beta Was this translation helpful? Give feedback.
-
These types of GCs typically trade throughput for shorter pause times. For example, they often use GC read barriers that make accessing object reference fields significantly slower. If you would like to understand the problem space, read the The Garbage Collection Handbook. It has a full chapter dedicated to real-time garbage collectors. There is nothing fundamental preventing building these types of garbage collectors for .NET. It is just a lot of work to build a production quality garbage collector. We do not see significant demand for these types of garbage collectors in .NET. Building alternative garbage collectors with very different performance tradeoffs has not been at the top of the core .NET team priority list. I would love to see .NET community experimenting with alternative garbage collectors with very different performance tradeoffs. It is how Azul came to be - Azul's garbage collector that you have linked to is not built by the core Java team. |
Beta Was this translation helpful? Give feedback.
-
I guess that having more and more official, simultaneously supported GCs would also increase the amount of work needed to maintain and improve them all, putting even more burden on the GC team which would mean that the existing GCs would be improved slower. |
Beta Was this translation helpful? Give feedback.
-
Yeah, but a pauseless collector for me seams to open a whole new area where .NET could be used. Even if it may be slower, but deterministic, without pauses every now and then, for some applications this could be a huge benefit. |
Beta Was this translation helpful? Give feedback.
-
Right. If it was to follow the Azul model, it would not impact the core GC team much. I believe that the core Java GC team does not spend any cycles on the Azul GC. The Azul GC is maintained by Azul that is a company with a closed source business model.
It comes down to numbers and opportunity costs. For example, how many new developers can pauseless GC bring to .NET? It is hard to make the numbers work. |
Beta Was this translation helpful? Give feedback.
-
For what's it worth, beware that buying the book as eBook from the official publisher only gives access to the book through the VitalSource service. There is no way to download the book except through the DRM encumbered software, and they managed to block my account before I was even able to read a single page (no explanation given, the service just responds with 401 error and logs me out). If you want to get the book, get it as a physical book or through Amazon Kindle and save yourself the trouble. |
Beta Was this translation helpful? Give feedback.
-
I am rather surprised to hear that! I'd love to see an experimental GC, so I'm certainly quite biased. But I'd imagine predictability of latency is a very significant concern for a number of large user bases. Game development (Unity) comes to mind of course. As do many areas in finance & algorithmic trading. Sorry, that's it for my sales pitch; but in short, I'd definitely love to see experimentation in this area. |
Beta Was this translation helpful? Give feedback.
-
Latency of a GC can be a concern in many of the same ways that latency of RAII can be a concern. Having a GC, including a GC that can "stop the world", is not itself strictly a blocker and it may be of interest to note that many of the broader/well known game engines do themselves use GCs (many, but not all, of which are incremental rather than pauseless). Most people's experience with .NET and a GC in environments like game dev, up until this point, has been with either the legacy Mono GC or the Unity GC, neither of which can really be compared with the performance, throughput, latency, or various other metrics of the precise GC that ships with RyuJIT. Having some form of incremental GC is likely still interesting, especially if it can be coordinated to run more so in places where the CPU isn't doing "important" work (such as when you're awaiting a dispatched GPU task to finish executing), but its hardly a requirement with an advanced modern GC, especially if you're appropriately taking memory management into consideration by utilizing pools, spans/views, and other similar techniques (just as you'd have to use in C++ to limit RAII or free overhead). |
Beta Was this translation helpful? Give feedback.
-
In essence, using a pool is no different from manually allocating memory. It does not reduce the mental burden of manual management required to allocate and reclaim memory. Of course, it is also necessary to explore safe programming methods similar to rust. This is a popular article about GC. |
Beta Was this translation helpful? Give feedback.
-
Would be nice to see how this changes in newer versions of .NET and also how JAVA compares against (with the default and the here mentioned pauseless collector) |
Beta Was this translation helpful? Give feedback.
-
Maybe an option like Incremental GC which is being adopted by Unity is feasible here, where it breaks up a "full GC" into several "partial GC" sequence (i.e. doing GC incrementally), so that although the total pausing time doesn't change, each pausing time of a single GC can be minimized to a nearly pauseless one. cc: @Maoni0 |
Beta Was this translation helpful? Give feedback.
-
I'm a game/engine dev on osu!. We've used C# throughout all of .NET 3.5 to .NET 8, and have fully rewritten the game over the years which has brought new challenges in terms of balancing features that wouldn't have been possible prior and what works best with the .NET GC. By far our greatest fight has been with the GC - it is definitely a felt presence and at the forefront of everything we do. I've personally gone pretty deep in minimising pauses with issues such as #48937, #12717, and #76290, but as a team we've always been very conscious about allocations because our main loop is running at potentially 1000Hz, or historically even more than that. What we've found works best for us is turning on
Where it breaks down, however, is areas that require allocs such as menus. This GC mode will cause terrible stutters when doing anything remotely intensive, meaning that we have to very carefully switch GC modes at opportune moments to get the best of both worlds, and sometimes those worlds are intertwined.
|
Beta Was this translation helpful? Give feedback.
-
@smoogipoo it seems to me that you should be working directly with MS folks on this. Your expertise on gamedev would help so many people out. Stuttering in Unity for example has been a blemish on C# for a very long time. It gives people the impression C# is just a bad language which is absolutely disastrous to the community as a whole as more people move away from these tools and end up using other languages. |
Beta Was this translation helpful? Give feedback.
-
It should be caused by splitting GC works into much smaller grain, not doing more work because of "more GC". |
Beta Was this translation helpful? Give feedback.
-
@AlgorithmsAreCool We're getting ready to deploy in production (in some scale - to get more feedback) so I've added a GHA workflow to build it for our sake. If it works for you, you can find the artifacts in releases here: https://github.com/ppy/Satori/releases - I've added added arm64 for you, or otherwise the repo should be a good reference for how to do the same :) Not sure if this is something @VSadov would be interested in having in his repo (it's a bit tailored to our needs) - let me know if so! |
Beta Was this translation helpful? Give feedback.
-
It looks like I've introduced a bug in my last change that enables However, with THP (Transparent Huge Pages) enabled on Linux, which is the default, the commit uses 2Mb granularity. I am still committing in 2Mb chunks, while reserving 1Mb + some extra. I am a bit surprised that it did not fail in regular tests. As a temporary workaround you may try |
Beta Was this translation helpful? Give feedback.
-
I added some GC pause time and framerate analysis to tModLoader and tried out the Satori GC. On current GC, our allocation rate is ~12MB/sec, with G0 every 1.6sec, G1 every 4 G0's and G2's every ~1 minute. Pause times are ~500us for G0, ~1000us with G1, and 2000us when G2 kicks in. Unfortunately the random nature of our game means it's hard to get a true apples to apples comparison, and Satori GC doesn't report to the @VSadov could Satori support Graphs below, 60fps. Ignore the last ~40 frames, application loses focus when taking a screenshot. |
Beta Was this translation helpful? Give feedback.
-
Is there a way to test Satori GC on Android devices by any chance? |
Beta Was this translation helpful? Give feedback.
-
If you mean with .NET Android or MAUI app then no, they use different runtimes. Android support for CoreCLR is being added in .NET 10 but it also needs some GC bridge work specific to Android, so even if Satori was ported it may need Android specific fixes. |
Beta Was this translation helpful? Give feedback.
-
@Chicken-Bones Would it help if i bundled up the GC pause monitoring code that I/VSadov used in my benchmarks? You just call it in Main or wherever and it will record pause information into HDR histogram that you can report on. Also, with your allocation rate being so low (40KB/s) i am surprised your current GC pauses are so high. |
Beta Was this translation helpful? Give feedback.
-
This time I created another GC-aware test: BinaryTree, and uses Test codeusing System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Runtime;
using System.Runtime.CompilerServices;
using Microsoft.Diagnostics.NETCore.Client;
using Microsoft.Diagnostics.Tracing;
using Microsoft.Diagnostics.Tracing.Analysis;
using Microsoft.Diagnostics.Tracing.Parsers;
class Program
{
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
static void Main()
{
var pauses = new List<double>();
var client = new DiagnosticsClient(Environment.ProcessId);
EventPipeSession eventPipeSession = client.StartEventPipeSession([new("Microsoft-Windows-DotNETRuntime",
EventLevel.Informational, (long)ClrTraceEventParser.Keywords.GC)], false);
var source = new EventPipeEventSource(eventPipeSession.EventStream);
source.NeedLoadedDotNetRuntimes();
source.AddCallbackOnProcessStart(proc =>
{
proc.AddCallbackOnDotNetRuntimeLoad(runtime =>
{
runtime.GCEnd += (p, gc) =>
{
if (p.ProcessID == Environment.ProcessId)
{
pauses.Add(gc.PauseDurationMSec);
}
};
});
});
Console.WriteLine($"Warm up start");
for (var i = 0; i < 100; i++)
Test(15);
Console.WriteLine($"Warm up done");
pauses.Clear();
// GCSettings.LatencyMode = GCLatencyMode.LowLatency;
GC.Collect(GC.MaxGeneration, GCCollectionMode.Aggressive, true, true);
GC.WaitForPendingFinalizers();
GC.WaitForFullGCComplete();
Thread.Sleep(5000);
new Thread(() => source.Process()).Start();
Console.WriteLine($"Execution Start");
Test(22);
Console.WriteLine($"Execution Done");
source.StopProcessing();
Console.WriteLine($"Max GC Pause: {pauses.Max()}ms");
Console.WriteLine($"Average GC Pause: {pauses.Average()}ms");
pauses.Sort();
Console.WriteLine($"P99.9 GC Pause: {pauses.Take((int)(pauses.Count * 0.999)).Max()}ms");
Console.WriteLine($"P99 GC Pause: {pauses.Take((int)(pauses.Count * 0.99)).Max()}ms");
Console.WriteLine($"P95 GC Pause: {pauses.Take((int)(pauses.Count * 0.95)).Max()}ms");
Console.WriteLine($"P90 GC Pause: {pauses.Take((int)(pauses.Count * 0.9)).Max()}ms");
Console.WriteLine($"P80 GC Pause: {pauses.Take((int)(pauses.Count * 0.8)).Max()}ms");
Console.WriteLine($"Total GC Pause: {pauses.Sum()}ms");
Console.WriteLine($"GC Count: G0 {GC.CollectionCount(0)}, G1 {GC.CollectionCount(1)}, G2 {GC.CollectionCount(2)}");
using (var process = Process.GetCurrentProcess())
{
Console.WriteLine($"Peak WorkingSet: {process.PeakWorkingSet64} bytes");
}
Console.WriteLine($"Force GC...");
while (true)
{
using var process = Process.GetCurrentProcess();
Console.WriteLine($"...WorkingSet After GC: {process.WorkingSet64} bytes");
GC.Collect(GC.MaxGeneration, GCCollectionMode.Default);
Thread.Sleep(5000);
}
}
static void Test(int size)
{
var bt = new BinaryTrees.Benchmarks();
var sw = Stopwatch.StartNew();
bt.ClassBinaryTree(size);
if (size == 22)
Console.WriteLine($"Elapsed: {sw.Elapsed.TotalMilliseconds}ms");
}
}
public class BinaryTrees
{
class ClassTreeNode
{
class Next { public required ClassTreeNode left, right; }
readonly Next? next;
ClassTreeNode(ClassTreeNode left, ClassTreeNode right) =>
next = new Next { left = left, right = right };
public ClassTreeNode() { }
internal static ClassTreeNode Create(int d)
{
return d == 1 ? new ClassTreeNode(new ClassTreeNode(), new ClassTreeNode())
: new ClassTreeNode(Create(d - 1), Create(d - 1));
}
internal int Check()
{
int c = 1;
var current = next;
while (current != null)
{
c += current.right.Check() + 1;
current = current.left.next;
}
return c;
}
}
public class Benchmarks
{
const int MinDepth = 4;
public int ClassBinaryTree(int maxDepth)
{
var longLivedTree = ClassTreeNode.Create(maxDepth);
var nResults = (maxDepth - MinDepth) / 2 + 1;
for (int i = 0; i < nResults; i++)
{
var depth = i * 2 + MinDepth;
var n = 1 << maxDepth - depth + MinDepth;
var check = 0;
for (int j = 0; j < n; j++)
{
check += ClassTreeNode.Create(depth).Check();
}
}
return longLivedTree.Check();
}
}
} Result: Workstation GC
Server GC
DATAS GC
Satori GC
Satori GC (Low Latency)
Note that the GC count here included those GC run during warmup by mistake. Finally, there is a scenario where Satori didn't perform well by default in terms of throughput (which can be reflected by the execution time), but it's still far better than Workstation GC. btw, Satori becomes a monster that beats any other GC on almost all metrics if I set Result with Satori (Interactive,
cc @VSadov you may be interested in the result with |
Beta Was this translation helpful? Give feedback.
-
There were many questions and suggestions in this thread which makes it hard to keep track of all the things.
Satori is based on 8.0 now, so no Android support. When rebased on a version of .net that supports Android (is that 10?), then it would be interesting to consider.
I think Satori is done with the phases:
And now we are at the phase:
Any further plans would really depend on usefulness vs. tradeoffs in the context of real apps. I think we touched on this much earlier in this thread. The only way right now to have an alternative GC is in a fork of runtime. |
Beta Was this translation helpful? Give feedback.
-
Right. Some scenarios/apps benefit a lot from gen0 and some not so much. The escape check in the barriers has some cost, but the benefit is to allow cheaper nonblocking GCs. If latter is not happening because too many objects escape, then the cost of the escape analysis is wasted. In the context of Roslyn compiling itself I see gen0 to gen1 ratio at 15 to 1, so gen0 is useful. That is with Roslyn aggressively using pools. There is a threshold when escape analysis turns itself off, if too much escapes. Maybe it needs some tuning. Logged: VSadov/Satori#47 |
Beta Was this translation helpful? Give feedback.
-
Later this year, a team I'm in will work on a tail latency-sensitive project. The existing Satori numbers are so good that it's not even a question whether it's worth to attempt to integrate it first before entertaining bruteforce "let's just amortize all allocations" route. Support/fixes/maintenance effort pale in comparison to the advantage a low-pause GC brings. The only requirement at this point is "does not crash or corrupt the heap" - if the project succeeds we will be using Satori for that deployment. There are plenty of features that .NET teams have invested effort with arguably smaller upside, compared to a GC which can benefit the whole ecosystem at large. I don't know what are the politics within .NET teams but not immediately allocating more resources to Satori to me would seem like a lapse of judgement (apologies if this comes across too strongly, I understand this was first of all a personal project, but it does not mean .NET management should not play their cards well). If anything, the existence of Satori GC makes WKS GC no longer needed as it's worse in practically every scenario. .NET could opt into "throughput-optimized" and "latency-optimized" GC feature switch instead, completely dropping WKS for good. It is substantial amount of work but the numbers are clearly even better than async2 experiment demonstrated comparatively, so it would be a shame if Satori GC is not discussed as a new but important priority for .NET 11. |
Beta Was this translation helpful? Give feedback.
-
WKS is still capable of keeping much smaller working set, as shown in most benchmarks. It definitely still has its place, as "resource-optimized".
The story can be somehow more complex when the two GCs are using drastically different implementations, rather than the same implementation with different optimizations. |
Beta Was this translation helpful? Give feedback.
-
I want to take a moment to say thank you @VSadov that you have done multiple amazing things by not only creating this project, but also by putting a bow on it so that laymen can build the repo and get the binaries without deep knowledge of the runtime. THANK YOU for making such an impressive and accessible project.
I certainly agree we need as much data from actual applications as possible. But in turn, a big way to get the community to kick the tires would be to get it position where more people can test it. There is a segment of .NET developers, especially in the game community, but also in other latency sensitive domains, that have been hoping for more competitive latency options in the .NET GC landscape for over a decade. Those developers have gone to great extremes to pool all their objects and avoid angering the GC at all costs, at who knows what cost to application complexity. And although the current SVR GC provides excellent performance characteristics for many workloads, these preliminary numbers from Satori suggest a new pareto frontier on the axes of latency, throughput, and heap size. As neon said, I'm sure there are complex decisions behind how much time and energy to invest in various experiments, but it feels like even at the current level of experimentation we are seeing such compelling results that they deserve investment. |
Beta Was this translation helpful? Give feedback.
-
Should we click "move to discussion" on this issue? Then you get threading |
Beta Was this translation helpful? Give feedback.
-
Not sure how it works, but by the sound of it, it is exactly what we want. Where is the button? |
Beta Was this translation helpful? Give feedback.
-
Is there any reason why something like the Pauseless Garbage Collector wich exists for Java from Azul never was implemented for Dotnet?
https://www.azul.com/products/components/pgc/
https://www.artima.com/articles/azuls-pauseless-garbage-collector
Beta Was this translation helpful? Give feedback.
All reactions