Prevent stacking of power requests #1023

daniel-zullo-frequenz · 2024-08-05T20:04:21Z

The power distributing actor processes one power request at a time to prevent multiple requests for the same components from being sent to the microgrid API concurrently. Previously, this could lead to the request channel receiver becoming full if the power request frequency was higher than the processing time. Even worse, the requests could be processed late, causing unexpected behavior for applications setting power requests. Moreover, the actor was blocking power requests with different sets of components from being processed if there was any existing request.

This patch ensures that the actor processes one request at a time for different sets of components and keeps track of the latest pending request if there is an existing request with the same set of components being processed. The pending request will be overwritten by the latest received request with the same set of components, and the actor will process it once the request with the same components is done processing.

daniel-zullo-frequenz · 2024-08-05T20:23:52Z

@shsms or @llucax would you mind to have an initial look at this draft?

llucax

I didn ´t check the docs updates yet, focusing on the code first.

llucax · 2024-08-06T11:57:41Z

src/frequenz/sdk/actor/power_distributing/power_distributing.py

+
+        def cleanup(task: asyncio.Task[None]) -> None:
+            task_name = task.get_name()
+            assert task_name in power_tasks, "Task not found in power_tasks"


I would probably log an error here and return, I guess bringing the whole thing down doesn't make sense only because we can't clean something up. Or maybe this is just ignored, because is called via add_done_callback() 🤔

In any case logging an error I think is the right approach.

I was mainly using this to find programming errors, the logic is wrong if this ever happens. Anyway we can also log it as an error

Yeah, I agree the logic is wrong, but it doesn't seem like a very bad thing if it happens for some reason, it only means someone, somehow, removed the task, or it wasn't added in the first place, but we can keep working with what we have.

Right, I previously saw your point and I updated it. The assertion was mainly useful for me to code and debug the patch-fix

src/frequenz/sdk/actor/power_distributing/power_distributing.py

llucax · 2024-08-06T12:17:53Z

src/frequenz/sdk/actor/power_distributing/power_distributing.py

+                self._component_manager.distribute_power(request),
+                name=task_name,
+            )
+            power_tasks[task_name].add_done_callback(cleanup)


This seems to actually be a perfect use case for a class I'm planning to add to core soon ™️ as part of the revamp of background service (I'm splitting out the task handling code to have a TaskGroup that is more useful for us for cases like this).

Sounds great!

Implement ServiceBase using a new PersistentTaskGroup frequenz-core-python#30

llucax · 2024-08-06T12:18:51Z

src/frequenz/sdk/actor/power_distributing/power_distributing.py

+            if task_name in power_tasks:
+                result = Error(
+                    request=request,
+                    msg="A request for the same components is being processed. Skipped.",
+                )
+                await self._result_sender.send(result)
+                continue


Wouldn't it be better to cancel the ongoing request and do the new one instead? Maybe not because we can end up with no request being finished if we have a burst of requests?

I agree that the latest one needs to be executed, but instead of cancelling the old one, I think we should keep the latest request and send it as soon as the running one is complete. That is, discard only the older requests and execute the latest.

Maybe not because we can end up with no request being finished if we have a burst of requests?

Indeed, my thought here was to avoid entering a loop of cancelling the in-progress request when a new request arrives. This is a bit tricky because we don't really know the state of the current processing request so I decided to ignore/skip the new request. This is the point we all need to agree on. @shsms any thoughts on this?

Wouldn't it be better to cancel the ongoing request and do the new one instead?

According to the outdated documentation, the actor was doing this. And the current state is the actor is just queuing up requests (without this patch)

@shsms I haven't seen your comment when I replied to this thread

Absolutely, I agree with the reasoning above on why cancelling doesn't make sense. We shouldn't cancel.

Updated to keep track of the latest pending power request

tests/timeseries/_battery_pool/test_battery_pool_control_methods.py

src/frequenz/sdk/actor/power_distributing/power_distributing.py

shsms · 2024-08-06T12:38:53Z

src/frequenz/sdk/actor/power_distributing/power_distributing.py

+                result = Error(
+                    request=request,
+                    msg="A request for the same components is being processed. Skipped.",
+                )
+                await self._result_sender.send(result)


I think we shouldn't call this an Error. I guess we need a new result type. Maybe Discarded.

But I would rather not send a message at all, and keep the latest request and discard just the older ones, and execute the latest request as soon as the running one is complete.

But I would rather not send a message at all, and keep the latest request and discard just the older ones, and execute the latest request as soon as the running one is complete.

Interesting. So you'd like to keep the current in-progress request (without cancelling it) and queue only the latest request that couldn't have been processed yet because there is a similar one in progress. It makes sense to me @llucax what do you think about this idea?

This is not longer sending the Error, instead it is keeping track of the latest received power request

The idea of finishing the current request and keep only the latest one to execute next is great!

About not sending responses, I don't remember how responses work, but shouldn't you send a response for every request? If not then great, if yes I also agree Discarded is better than Error.

@shsms your input is needed here?

My understanding is it's only needed to send a response if the request is processed (either succeeded, partial failed, error, etc). The current patch works as a request replacement if the previous request was queued up

Yes. Ever since the power manager was introduced, there has not been a one-to-one mapping between user proposals and response from the power distributor. So just a log message should be enough saying new request received before previous one could start executing, so it is discarded. And I think we have something of this form.

shsms · 2024-08-06T12:41:26Z

src/frequenz/sdk/actor/power_distributing/power_distributing.py

+            if task_name in power_tasks:
+                result = Error(
+                    request=request,
+                    msg="A request for the same components is being processed. Skipped.",
+                )
+                await self._result_sender.send(result)
+                continue


I agree that the latest one needs to be executed, but instead of cancelling the old one, I think we should keep the latest request and send it as soon as the running one is complete. That is, discard only the older requests and execute the latest.

src/frequenz/sdk/actor/power_distributing/power_distributing.py

daniel-zullo-frequenz · 2024-08-07T09:11:37Z

@llucax @shsms I haven't finished updated the documentation. Would you mind to have another look just focusing on the code?

daniel-zullo-frequenz · 2024-08-07T15:19:13Z

Updated documentation

src/frequenz/sdk/actor/power_distributing/power_distributing.py

llucax · 2024-08-08T10:13:12Z

src/frequenz/sdk/actor/power_distributing/power_distributing.py

+                result = Error(
+                    request=request,
+                    msg="A request for the same components is being processed. Skipped.",
+                )
+                await self._result_sender.send(result)


The idea of finishing the current request and keep only the latest one to execute next is great!

About not sending responses, I don't remember how responses work, but shouldn't you send a response for every request? If not then great, if yes I also agree Discarded is better than Error.

llucax · 2024-08-08T10:47:32Z

Just as a side note, I find the code a bit convoluted, but I think we can address that in the future.

daniel-zullo-frequenz · 2024-08-08T17:31:04Z

Replaced the task name previously used as a key to identify processing and pending requests. The request ID, which is a frozenset of the component IDs, is now used instead. This should make the code flexible to address #1030

daniel-zullo-frequenz · 2024-08-08T17:37:41Z

Just as a side note, I find the code a bit convoluted, but I think we can address that in the future.

Ok, let me know if you have some ideas to improve it and I can create an issue at least.
I failed my attempt to use the asyncio high level functionality given our use-case. I also tried to use the tasks set provided by the actor but unfortunately it wasn't suitable for our use-case.

llucax · 2024-08-09T07:53:59Z

Ok, let me know if you have some ideas to improve it and I can create an issue at least.

I actually tried to suggest something where we abstract the ongoing requests in an object that holds the task and the pending request (if any), but it was still complicated to implement like that because how we need to keep track of the tasks. I think the new PersistentTaskGroup in core can help a lot here, but it is not there yet :)

daniel-zullo-frequenz · 2024-08-09T08:56:49Z

I actually tried to suggest something where we abstract the ongoing requests in an object that holds the task and the pending request (if any), but it was still complicated to implement like that because how we need to keep track of the tasks. I think the new PersistentTaskGroup in core can help a lot here, but it is not there yet :)

I had a quick look and make a comment there to check if we can make it 100% suitable for the use case in this PR. Please be aware that I probably misunderstood the PersistentTaskGroup capabilities.

In any case I can create an issue in the SDK to refactor the PowerDistributingActor using PersistentTaskGroup.

llucax · 2024-08-09T08:59:36Z

I actually stayed on this issue thinking about it again...

[.... one hour later due to some of the changes you did with the request ID :D ...]

It was a good exercise to see how to apply the new class. The result is pretty different to what I had in mind before, and it doesn't look particularly simple either in retrospect, but I think it is more efficient, because we only keep one task for the same req_id, as long as we have quests queued up, instead of starting and finishing a task for every pending request.

    async def _run(self) -> None:
        await self._component_manager.start()

        async with PersistentTaskGroup() as group:
            async for request in self._requests_receiver:
                self._handle_request(request, group)
                async for task in group.as_completed(timeout=0):
                    try:
                        task.result()
                    except Exception:  # pylint: disable=broad-except
                        _logger.exception("Failed power request: %s", request)

    async def _handle_request(self, request: ..., group: ...) -> None:
        req_id = self._get_request_id(request)

        if pending_request := self._pending_requests.get(req_id):
            _logger.debug(
                "Pending request: %s, overwritten with request: %s",
                pending_request,
                request,
            )
            return

        self._pending_requests[req_id] = request
        task = group.create_task(
            self._distribute_power(req_ir),
            name=f"{type(self).__name__}:{request}",
        )

    async def _distribute_power(self, req_id: ...) -> None:
        while request := self._pending_requests.get(req_id):
            self._component_manager.distribute_power(request)

At some point I want to add a way to get as_completed() as a Receiver, so we can put it in a select(), like:

        async with PersistentTaskGroup() as group:
            completed_tasks = CompletedTaskReceiver(group)  # This needs to be added to channels
            async for selected in select(self._requests_receiver, completed_tasks):
                if selected_from(self._requests_receiver, selected):
                    self._handle_request(selected.message, group)
                elif selected_from(completed_tasks, selected):
                    try:
                        task.result()
                    except Exception:  # pylint: disable=broad-except
                        _logger.exception("Failed power request: %s", request)

But for now we can live with using it with timeout=0 so it doesn't block, and we only ACK finished tasks after we handle a power request, which should be OK, we should receive requests often enough I guess.

llucax

LGTM. Will leave the final approval to @shsms as he had some comments too.

daniel-zullo-frequenz · 2024-08-09T09:07:45Z

@llucax the final result looks great! I'll create an issue referencing your draft to address it once PersistentTaskGroup is available

src/frequenz/sdk/actor/power_distributing/power_distributing.py

shsms

LGTM as well.

daniel-zullo-frequenz · 2024-08-09T10:05:20Z

Thanks, will rebase it onto latest v1.x.x since there are some conflicts with the release notes

The documentation was updated to reflect the current state of the actor. Signed-off-by: Daniel Zullo <[email protected]>

To add an entry to release-nots related to preventing stacking of power requests to avoid delays in processing when the request frequency exceeds the processing time. Signed-off-by: Daniel Zullo <[email protected]>

The power distributing actor processes one power request at a time to prevent multiple requests for the same components from being sent to the microgrid API concurrently. Previously, this could lead to the request channel receiver becoming full if the power request frequency was higher than the processing time. Even worse, the requests could be processed late, causing unexpected behavior for applications setting power requests. Moreover, the actor was blocking power requests with different sets of components from being processed if there was any existing request. This patch ensures that the actor processes one request at a time for different sets of components and keeps track of the latest pending request if there is an existing request with the same set of components being processed. The pending request will be overwritten by the latest received request with the same set of components, and the actor will process it once the request with the same components is done processing. Signed-off-by: Daniel Zullo <[email protected]>

Changed log type for ignored requests to avoid spamming, as there are many reasons a request might be ignored. Co-authored-by: Leandro Lucarella <[email protected]> Signed-off-by: daniel-zullo-frequenz <[email protected]>

daniel-zullo-frequenz · 2024-08-09T10:34:04Z

Just for the record, created #1032 to improve the actor code

github-actions bot added part:tests Affects the unit, integration and performance (benchmarks) tests part:actor Affects an actor ot the actors utilities (decorator, etc.) labels Aug 5, 2024

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch from e1cdce4 to 85d0409 Compare August 5, 2024 20:16

github-actions bot added the part:docs Affects the documentation label Aug 5, 2024

daniel-zullo-frequenz requested review from llucax and shsms August 5, 2024 20:24

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch from 85d0409 to 7b31914 Compare August 5, 2024 22:08

llucax reviewed Aug 6, 2024

View reviewed changes

shsms reviewed Aug 6, 2024

View reviewed changes

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch 2 times, most recently from a24e784 to 657bb46 Compare August 7, 2024 08:55

daniel-zullo-frequenz self-assigned this Aug 7, 2024

daniel-zullo-frequenz added the priority:high Address this as soon as possible label Aug 7, 2024

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch from 77655e1 to 9d4ad72 Compare August 7, 2024 15:18

daniel-zullo-frequenz marked this pull request as ready for review August 7, 2024 15:18

daniel-zullo-frequenz requested a review from a team as a code owner August 7, 2024 15:18

daniel-zullo-frequenz requested review from Marenz and removed request for a team August 7, 2024 15:18

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch from 9d4ad72 to 0a731af Compare August 7, 2024 15:31

llucax reviewed Aug 8, 2024

View reviewed changes

daniel-zullo-frequenz mentioned this pull request Aug 8, 2024

Checks for components overlaps in power requests #1030

Open

llucax mentioned this pull request Aug 8, 2024

Implement ServiceBase using a new PersistentTaskGroup frequenz-floss/frequenz-core-python#30

Merged

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch 2 times, most recently from f815582 to 8618144 Compare August 8, 2024 17:25

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch from 8618144 to 1d5176d Compare August 8, 2024 17:39

daniel-zullo-frequenz requested review from shsms and llucax August 8, 2024 17:59

llucax reviewed Aug 9, 2024

View reviewed changes

src/frequenz/sdk/actor/power_distributing/power_distributing.py Outdated Show resolved Hide resolved

shsms previously approved these changes Aug 9, 2024

View reviewed changes

daniel-zullo-frequenz and others added 4 commits August 9, 2024 12:06

Update power distributor documentation

94cef0f

The documentation was updated to reflect the current state of the actor. Signed-off-by: Daniel Zullo <[email protected]>

Update release-notes

60d6fde

To add an entry to release-nots related to preventing stacking of power requests to avoid delays in processing when the request frequency exceeds the processing time. Signed-off-by: Daniel Zullo <[email protected]>

Update log type for ignored requests

61faa42

Changed log type for ignored requests to avoid spamming, as there are many reasons a request might be ignored. Co-authored-by: Leandro Lucarella <[email protected]> Signed-off-by: daniel-zullo-frequenz <[email protected]>

daniel-zullo-frequenz dismissed shsms’s stale review via 61faa42 August 9, 2024 10:09

daniel-zullo-frequenz force-pushed the fix/power-distributing-actor branch from 0e4d6d1 to 61faa42 Compare August 9, 2024 10:09

daniel-zullo-frequenz mentioned this pull request Aug 9, 2024

Reduce complexity in PowerDistributingActor #1032

Open

llucax approved these changes Aug 9, 2024

View reviewed changes

daniel-zullo-frequenz added this pull request to the merge queue Aug 9, 2024

Merged via the queue into frequenz-floss:v1.x.x with commit e2a792c Aug 9, 2024
18 checks passed

daniel-zullo-frequenz deleted the fix/power-distributing-actor branch August 9, 2024 10:57

Prevent stacking of power requests #1023

Prevent stacking of power requests #1023

Uh oh!

Conversation

daniel-zullo-frequenz commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-zullo-frequenz commented Aug 5, 2024

Uh oh!

llucax left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-zullo-frequenz Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-zullo-frequenz commented Aug 7, 2024

Uh oh!

daniel-zullo-frequenz commented Aug 7, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llucax commented Aug 8, 2024

Uh oh!

daniel-zullo-frequenz commented Aug 8, 2024

Uh oh!

daniel-zullo-frequenz commented Aug 8, 2024

Uh oh!

llucax commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniel-zullo-frequenz commented Aug 5, 2024 •

edited

Loading

daniel-zullo-frequenz Aug 6, 2024 •

edited

Loading

llucax commented Aug 9, 2024 •

edited

Loading

llucax commented Aug 9, 2024 •

edited

Loading