Skip to content

Commit cefe652

Browse files
authored
docs: add Exporter overview (#1636)
Signed-off-by: Maryam Tahhan <[email protected]>
1 parent d16be0d commit cefe652

File tree

5 files changed

+465
-2
lines changed

5 files changed

+465
-2
lines changed

doc/design/images/exporter-seq.svg

Lines changed: 1 addition & 0 deletions
Loading

doc/design/images/exporter.png

222 KB
Loading
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Kepler Exporter Design
2+
3+
This document provides a simple overview of the design of the Kepler
4+
Exporter, a tool for measuring and reporting the power utilization of a
5+
system and its running processes.
6+
7+
> **_NOTE_**: This guide it not intended to replace the detailed Kepler
8+
[overview](https://sustainable-computing.io/usage/deep_dive/) but rather provide
9+
a starting point for a new contributor to this repo.
10+
11+
Kepler monitors system processes by tracking task switches in the Kernel and
12+
logging stats. These stats are then used to estimate the power usage of the system
13+
and its associated processes. Kepler collects power data using:
14+
15+
- EBPF/Hardware Counters
16+
- Real-time Component Power Meters (e.g., RAPL)
17+
- Platform Power Meters (ACPI/IPMI, etc.)
18+
19+
Below is a high-level representation of the Kepler Exporter components:
20+
21+
![Exporter](../images/exporter.png)
22+
23+
Metrics in Kepler can be broken down into 2 categories:
24+
25+
1. Resource metrics.
26+
1. Energy metrics.
27+
28+
## Exporter Introduction
29+
30+
Package: `cmd/exporter`
31+
32+
The `Exporter` is the main Kepler program, performing the following operations:
33+
34+
- Starting various power collection implementations needed to collect metrics from
35+
the platform and its components (DRAM, uncore, core, package).
36+
- Creating a BPF exporter.
37+
- Creating a collector manager to collect and expose the collected metrics.
38+
- Creating an HTTP endpoint that exposes metrics to Prometheus.
39+
40+
Below is the startup sequence for the Kepler Exporter:
41+
42+
![Exporter startup](../images/exporter-seq.svg)
43+
44+
> **_NOTE_**: Depending on the environment that Kepler was deployed in,
45+
the system power consumption metrics collection will vary (Baremetal/VM).
46+
For more details on this please see the [detailed documentation][1].
47+
48+
The following sections will cover the main functionality of the various Kepler
49+
components.
50+
51+
## BPF Exporter
52+
53+
Package: `pkg/bpf`
54+
55+
The bpf exporter is created in the main Kepler program through
56+
the collector manager instantiation:
57+
58+
```golang
59+
m := manager.New(bpfExporter)
60+
```
61+
62+
The role of the bpf exporter is to setup the bpf programs that collect
63+
the low level resource metrics associated with each process. It's
64+
functionality includes:
65+
66+
1. Modifying the eBPF program sampling rate and number of CPUs.
67+
1. Loading the eBPF program.
68+
1. Attaching the `KeplerSchedSwitchTrace` eBPF program.
69+
1. Attaching the `KeplerIrqTrace` eBPF program if `config.ExposeIRQCounterMetrics`
70+
is enabled.
71+
1. Initializing `enabledSoftwareCounters`.
72+
1. Attaching the `KeplerReadPageTrace` eBPF program.
73+
1. Attaching the `KeplerWritePageTrace` eBPF program.
74+
1. If `config.ExposeHardwareCounterMetrics` is enabled it creates the following
75+
hardware events:
76+
1. CpuInstructionsEventReader
77+
1. CpuCyclesEventReader
78+
1. CacheMissEventReader
79+
80+
It also initializes `enabledHardwareCounters`.
81+
82+
## CollectorManager
83+
84+
The Kepler Exporter (cmd/exporter) creates an instance of the `CollectorManager`.
85+
The `CollectorManager` contains the following items:
86+
87+
- `StatsCollector` that is responsible for collecting resource and energy consumption
88+
metrics. It uses the model implementation to estimate the process energy (total
89+
and per component) and node energy (using the resource stats).
90+
- `PrometheusCollector` which is a prometheus exporter that exposes the Kepler
91+
metrics on a Prometheus-friendly URL.
92+
- `Watcher` that watches the Kubernetes API server for pod events.
93+
94+
```golang
95+
type CollectorManager struct {
96+
// StatsCollector is responsible for collecting resource and energy consumption metrics and calculating them when needed
97+
StatsCollector *collector.Collector
98+
99+
// PrometheusCollector implements the external Collector interface provided by the Prometheus client
100+
PrometheusCollector *exporter.PrometheusExporter
101+
102+
// Watcher register in the kubernetes API Server to watch for pod events to add or remove it from the ContainerStats map
103+
Watcher *kubernetes.ObjListWatcher
104+
}
105+
```
106+
107+
On initialization the `CollectorManager` also creates the power estimator models.
108+
109+
### StatsCollector
110+
111+
Package: `pkg/manager`
112+
113+
`StatsCollector` is responsible for updating the following stats:
114+
115+
- Node Stats
116+
- Process stats
117+
- Container stats
118+
- VM stats
119+
120+
> **_NOTE_**: these stats are updated by various subsystems.
121+
122+
When the [collector manager](#collectormanager) is started by the Kepler Exporter,
123+
it kicks off an endless loop that updates the stats periodically.
124+
125+
For process statistics, the Process collector uses a BPF process collector to
126+
retrieve information collected by the Kepler BPF programs and stored in BPF
127+
maps. This information includes the process/executable name, PID, cgroup,
128+
CPU cycles, CPU instructions, cache misses, and cache hits. The BPF process
129+
collector checks if these processes belong to a VM or container. It also
130+
aggregates all the Kernel processes' metrics (which have a cgroup
131+
ID of 1 and a PID of 1). If GPU statistics are available per process,
132+
the stats are extended to include GPU compute and memory utilization.
133+
134+
Node energy stats are also retrieved (if collection is supported). These stats
135+
include the underlying component stats (core, uncore, dram, package, gpus, ...),
136+
as well as the overall platform stats (Idle + Dynamic energy), and the
137+
the process energy consumption. the process energy consumption is estimated
138+
using its resource utilization and the node components energy consumption.
139+
140+
`StatsCollector` eventually passes all the metrics it collects to `pkg/model` through
141+
`UpdateProcessEnergy()`, which estimates the power consumption of each process.
142+
143+
> **_NOTE_**: For details on the Ratio Power Model, refer to this [explanation][2].
144+
145+
### PrometheusCollector
146+
147+
Package: pkg/manager
148+
149+
`PrometheusCollector` supports multiple collectors: container, node, VM and process.
150+
The various collectors implement the `prometheus.Collector` interface. Each of these
151+
collectors fetch the Kepler metrics and expose them on a Prometheus-friendly URL.
152+
In Kepler the stats structures are shared between the PrometheusCollector(s) and
153+
the StatsCollector(s).
154+
155+
The `prometheus.Collector` interface defines the following functions:
156+
157+
- `Describe` sends the super-set of all possible descriptors of metrics
158+
collected by this Collector to the provided channel and returns once
159+
the last descriptor has been sent.
160+
- `Collect` is called by the Prometheus registry when collecting metrics.
161+
The implementation sends each collected metric via the provided channel
162+
and returns once the last metric has been sent.
163+
164+
## Power Model Estimator
165+
166+
Estimates power usage from the low level resource stats.
167+
168+
> **_NOTE_** for more details please see [3]
169+
170+
[1]: https://sustainable-computing.io/usage/deep_dive/#collecting-system-power-consumption-vms-versus-bms
171+
[2]: https://sustainable-computing.io/usage/deep_dive/#ratio-power-model-explained
172+
[3]: https://sustainable-computing.io/kepler_model_server/power_estimation/

0 commit comments

Comments
 (0)