Kepler performance Analysis #391
Replies: 3 comments
-
|
Beta Was this translation helpful? Give feedback.
-
I updated method to collect CPU frequency using BPF instead of reading kernel files. This update has a big performance improvement.
|
Beta Was this translation helpful? Give feedback.
-
The next steps are to improve the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Kepler Performance Analysis
Goal
Issue #365 pointed to a possible performance regression after changing metrics from per-pod to per-container.
This discussion aims to shed light on Kepler's performance.
To understand performance, let's look at three different code versions:
For the analysis, we are going to use the
go pprof tool
for profiling CPU and MEM (heap) analysis for 60s.CPU analysis
v0.3 (per pod metrics)

go tool pprof cpu.prof
before gpu update

go tool pprof cpu.prof
with gpu update

go tool pprof cpu.prof
CPU Analysis Comments
We can see that the number of seconds of the
runtime.cgocall
func execution has increased after the update.In fact, we may find a lot of runtime activity, which often indicates Garbage Collector (GC) activity.
This is also the conclusion from issue #381 that the GC can be the root cause of performance degradation.
There are several reasons why the GC can become more intensive and the main one is because it must free up the heap memory of many objects.
Heap scape is a known issue in our code, we have a test that always fails and we need to improve this in the future.
However, we need to understand which functions are creating the most objects and what is happening in the code.
We'll discuss this in the next section.
Note that to mitigate the GC problem we can decrease the number of heap memory and the possibility to slow down the GC using the GOGC variable. But GOGC increases program memory usage.
Memory analysis
v0.3 (per pod metrics)


go tool pprof -alloc_objects mem.prof
before gpu update


go tool pprof -alloc_objects mem.prof
with gpu update


go tool pprof -alloc_objects mem.prof
MEM Analysis Comments
We can see in the results that functions that are allocating more objects are:
getCPUCoreFrequency
andkubelet ListPods
.getCPUCoreFrequency
kubelet ListPods
In a high-level analysis,
ListPods
appears to be the critical function that is creating more objects after the upgrade, i.e., creating up to 3.2x more objects than version v0.3.ListPods is called when the pod information is not in the cache, so we need to call the kubelet API to get this information.
Notice that in ListPods, we iterate through the list of all pods to find the pods with the target id. Therefore, this function will become more expensive when the system has more pods running.
While we need a deeper look into this, the issue could be related to caches. So maybe we need to improve that.
Another solution we discussed earlier is the introduction of a new component/proxy that will watch for resources (pods, jobs, etc.). This proxy will act as a cache and will transmit event updates to kepler instances, where events will be filtered by node name.
The main reason to create a proxy for apiserver is to avoid overloading it with the List operation when multiple instances of kepler are restarted. With this solution, we could remove the ListPods and ensure consistency in the code as we will be able to promptly identify the deleted pods.
Latency analysis
The logs show how long kepler took to update all metrics and the

num
is the number of pod or containers, depending on the code.v0.3 (per pod metrics)
with gpu update

Latency Analysis Comments
Sometimes the updated code is taking longer to update all the metrics. We need a deeper analysis to fully understand why.
But the number of containers is probably not the reason for the performance degradation as the difference in num is minimal
Signed-off-by: Marcelo Amaral [email protected]
Beta Was this translation helpful? Give feedback.
All reactions