-
Notifications
You must be signed in to change notification settings - Fork 517
Improved Kubernetes orchestrator pod caching #3719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Documentation Link Check Results✅ Absolute links check passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions, looks good overall.
# In very rare cases, there can be multiple placeholder runs for | ||
# the same deployment. By ordering by the orchestrator_run_id, we | ||
# make sure that we use the placeholder run with the matching | ||
# orchestrator_run_id if it exists, before falling back to the | ||
# placeholder run without any orchestrator_run_id provided. | ||
# Note: This works because both SQLite and MySQL consider NULLs | ||
# to be lower than any other value. If we add support for other | ||
# databases (e.g. PostgreSQL, which considers NULLs to be greater | ||
# than any other value), we need to potentially adjust this. | ||
.order_by(desc(PipelineRunSchema.orchestrator_run_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know !
# As a last resort, we try to reuse the docstring/source code | ||
# from the cached step run. This is part of the cache key | ||
# computation, so it must be identical to the one we would have | ||
# computed ourselves. | ||
if request.source_code is None: | ||
request.source_code = cached_step_run.source_code | ||
if request.docstring is None: | ||
request.docstring = cached_step_run.docstring | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this part a bit ? This wasn't necessary before and the _get_docstring_and_source_code
already attempts to fetch these from various sources (the step instance and run template). Why is it absolutely necessary now that these be populated in the request, if it wasn't necessary before ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was never necessary (both of those fields are optional). It just now occurred to me that we can simply reuse the code/docstring of the cached step if we can't figure it out in any other way, so I added it. Otherwise, all steps that got cached from the orchestrator pod would not have any code/docstring associated with them, as
- the code to import the step instance most likely doesn't exist in the orchestrator pod
- there is no run template involved
src/zenml/integrations/kubernetes/step_operators/kubernetes_step_operator.py
Show resolved
Hide resolved
step_runs = Client().list_run_steps( | ||
size=1, | ||
pipeline_run_id=pipeline_run.id, | ||
name=step_name, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will scale badly, I know it. For 1000 steps, it will make 1000 calls to the API. I think we might need a new endpoint that fetches the status of all steps in one go. Or maybe move this whole thing behind the REST API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, this whole thing feels like something that should happen server-side when the pipeline run status is updated... this whole thing needs to be reconsidered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My change here only improves upon the previous implementation, which fetched all steps of the pipeline run. My solution instead only fetches the steps for which the pod is actually in a failed state. However, I definitely agree that we could batch this into some requests, instead of fetching each step separately. I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now fetching only relevant steps, and doing so in batched chunks.
Describe changes
This PR
Changes to what we consider placeholder runs:
Previously, a run without an
orchestrator_run_id
in the database was considered a placeholder run. This PR changes that behaviour and instead uses the status of a run to decide whether the run is a placeholder run (runs with stateinitializing
are placeholder runs). This change was necessary so that the orchestrator pod can create a pipeline run (with orchestrator run ID) for scheduled runs to enable the orchestrator pod caching.Pre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes