Skip to content

feat(device/cpu): aggregate multi-socket zones into single zone #2183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sthaha
Copy link
Collaborator

@sthaha sthaha commented Jun 23, 2025

This commit implements AggregatedZone to consolidate multiple EnergyZones with same name (e.g., package zones in multi-socket systems) into single zone. The aggregation also handles counter wrapping and is transparent to the caller. This enables power attribution across multi-socket systems while maintaining compatibility with single-socket deployments.

Additionally it also solves the zone label which used to have the index suffix as a hack.

Key changes:

  • New AggregatedZone type that sums energy from multiple zones of the same type
  • Proper handling of counter wrapping at MaxEnergy
  • Thread-safe energy aggregation with overflow protection

image

@sthaha sthaha force-pushed the feat-aggregate-zone branch from 00d830d to 5c778cb Compare June 23, 2025 15:35
@github-actions github-actions bot added the chore Routine tasks or maintenance label Jun 23, 2025
Copy link

codecov bot commented Jun 23, 2025

Codecov Report

Attention: Patch coverage is 96.36364% with 4 lines in your changes missing coverage. Please review.

Project coverage is 92.39%. Comparing base (2c428ac) to head (bdce7f3).
Report is 4 commits behind head on reboot.

Files with missing lines Patch % Lines
internal/device/energy_zone.go 94.73% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           reboot    #2183      +/-   ##
==========================================
- Coverage   92.41%   92.39%   -0.02%     
==========================================
  Files          37       38       +1     
  Lines        3809     3907      +98     
==========================================
+ Hits         3520     3610      +90     
- Misses        230      236       +6     
- Partials       59       61       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sthaha sthaha force-pushed the feat-aggregate-zone branch from 5c778cb to fc6c018 Compare June 26, 2025 05:09
@github-actions github-actions bot removed the chore Routine tasks or maintenance label Jun 26, 2025
@sthaha sthaha force-pushed the feat-aggregate-zone branch from fc6c018 to b5e6898 Compare June 26, 2025 05:22
@sthaha sthaha requested review from vimalk78 and vprashar2929 June 26, 2025 09:58
@sthaha sthaha marked this pull request as ready for review June 26, 2025 10:00
@github-actions github-actions bot added the chore Routine tasks or maintenance label Jul 1, 2025
This commit implements `AggregatedZone` to consolidate multiple  EnergyZones
with same name (e.g., package zones in multi-socket systems) into single
zone. The aggregation also handles  counter wrapping and is transparent to the
caller. This enables power attribution across multi-socket systems while
maintaining compatibility with single-socket deployments.

Additionally it also solves the zone label which used to have the index
suffix as a hack.

Key changes:
- New `AggregatedZone` type that sums energy from multiple zones of the same type
- Proper handling of counter wrapping at MaxEnergy
- Thread-safe energy aggregation with overflow protection

Signed-off-by: Sunil Thaha <[email protected]>
@sthaha sthaha force-pushed the feat-aggregate-zone branch from 79f25bd to bdce7f3 Compare July 1, 2025 04:29
@github-actions github-actions bot removed the chore Routine tasks or maintenance label Jul 1, 2025
@vprashar2929 vprashar2929 added the feat A new feature or enhancement label Jul 1, 2025
@vprashar2929
Copy link
Collaborator

Tested on multi-socket machine.

Available RAPL domains on machine:

root@aa4a13c3:/home/ghrunner/kepler/compose/dev# for d in /sys/devices/virtual/powercap/intel-rapl/intel-rapl:*; do   [ -f "$d/name" ] && echo "${d#/sys/devices/virtual/powercap/intel-rapl/}: $(cat "$d/name")";   for s in "$d"/intel-rapl:*; do     [ -f "$s/name" ] && echo "${s#/sys/devices/virtual/powercap/intel-rapl/}: $(cat "$s/name")";   done; done
intel-rapl:0: package-0
intel-rapl:0/intel-rapl:0:0: dram
intel-rapl:1: package-1
intel-rapl:1/intel-rapl:1:0: dram

Kepler creating aggregated zone:

kepler-dev-1  | time=2025-07-01T12:47:56.812Z level=DEBUG source=internal/device/rapl_sysfs_power_meter.go:177 msg="Created aggregated zone" service=rapl name=package zone_count=2 zones="[package-0 package-1]"
kepler-dev-1  | time=2025-07-01T12:47:56.812Z level=DEBUG source=internal/device/rapl_sysfs_power_meter.go:177 msg="Created aggregated zone" service=rapl name=dram zone_count=2 zones="[dram-0 dram-1]"

Comparing against node exporter

Prometheus-Time-Series-Collection-and-Processing-Server-07-01-2025_06_24_PM

@vprashar2929
Copy link
Collaborator

@sthaha Right now, if I want to filter only the package-0 zone, it's not possible?

@sthaha
Copy link
Collaborator Author

sthaha commented Jul 2, 2025

@sthaha Right now, if I want to filter only the package-0 zone, it's not possible?

Thats right ... but if you run a version before this you should be able to see the values.


}

// Multiple zones with same name - create AggregatedZone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an option to not aggregate in some rare case?

Copy link
Collaborator Author

@sthaha sthaha Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question and I think we will need that when/if kepler supports measuring individual socket consumption and is able to attribute that to process running on those sockets.

Or are you talking about some other requirement that we may have at the moment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some reason we cant think of now. the default can be to aggregate, but if needed a config change (unexposed perhaps) will stop the aggregation.

Comment on lines +98 to +99
az.mu.Lock()
defer az.mu.Unlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need mutex? sysfsRaplZone didn't need this

@vimalk78 vimalk78 merged commit 5ad91c3 into sustainable-computing-io:reboot Jul 3, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat A new feature or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants