Skip to content

Conversation

Danil-Grigorev
Copy link
Member

@Danil-Grigorev Danil-Grigorev commented Mar 28, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implement AdditionalControlPlaneLBPorts field for the additional inbound control-plane load balancer port configuration, allowing external access for user specified ports on the CP machine.

additionalAPIServerLBPorts:
- name: RKE2
   port: 9345

This change also improves UX while using CAPZ with RKE2 provider, by allowing to specify only additional security rules on the AzureCluster resource, and making existing security rule defaults for APIServer and SSH port more permissive and based on destination port selection.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5511

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

- Added `AdditionalAPIServerLBPorts` field for the additional inbound control-plane load balancer port configuration, allowing to access user specified ports on the CP machines.
- Improved `securityRules` selection for the `RKE2` provider. Defaulting now checks for ports `6443` and `22` in the existing list and appends them if missing, allowing users to specify only additional rules instead of the full set, when an additional rule is needed.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 28, 2025
@k8s-ci-robot k8s-ci-robot requested review from Jont828 and nojnhuh March 28, 2025 14:27
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.

Project coverage is 52.91%. Comparing base (1c6a5d3) to head (04c4b55).
Report is 43 commits behind head on main.

Files with missing lines Patch % Lines
api/v1beta1/types.go 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5525      +/-   ##
==========================================
+ Coverage   52.61%   52.91%   +0.29%     
==========================================
  Files         272      272              
  Lines       29485    29515      +30     
==========================================
+ Hits        15513    15617     +104     
+ Misses      13165    13086      -79     
- Partials      807      812       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 28, 2025
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from bec7326 to fbe11d7 Compare March 28, 2025 14:37
@@ -893,6 +893,20 @@ func (s SubnetSpec) IsIPv6Enabled() bool {
return false
}

// GetSecurityRuleByDestination returns security group rule, which matches provided destination ports.
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
if s.SecurityGroup.SecurityRules == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapping the below for loop in if s.SecurityGroup.SecurityRules != nil { is equivalent and allows us to remove the add'l return nil response (we don't need a specific nil return for this condition because the default return before exiting the function is nil)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed, thanks.

@@ -1020,9 +1020,18 @@ func (s *ClusterScope) SetControlPlaneSecurityRules() {
if !s.ControlPlaneEnabled() {
return
}
if s.ControlPlaneSubnet().SecurityGroup.SecurityRules == nil {

subnet := s.ControlPlaneSubnet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this flow where are we adding any user-provided security rules? (for example if a user specifies TCP 9345)

or is that elsewhere and the purpose of this change is to filter out 22 and apiserver port if it's not included?

Copy link
Member Author

@Danil-Grigorev Danil-Grigorev Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow performs only defaulting on top of the user-provided set of rules, which may not be empty. This way having a field populated does not always need to specify all allowed ports, only additional ones or overrides.

I’m currently validating how this works in tandem with ClusterClass definitions we have. The desired goal is to permit specifying registrationMethod: control-plane-endpoint and allow RKE2 to register on the allowed port. But ideally this should follow internal name resolution, so that the traffic wouldn’t go through external load balancer.

If it doesn’t work, maybe it would still require additional LB rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to have the additional ports of the APIServer LB opened-up/have an opinion about.

Could you please elaborate on why the security rules should not be updated too ?

Copy link
Member Author

@Danil-Grigorev Danil-Grigorev Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I’m not fully understanding. I will try to give my thoughts on this:

I was attempting to describe the flow of how the rules are processed. We only apply defaults on top of what the user provides, so the user can still specify ports like TCP 9345 explicitly in security rules. We’re not overriding or stripping anything out, just adding fallback values when needed. LB rules were needed as well, since CP endpoint is external from the cluster perspective.

Current set of changes is what was required to make cluster healthy, and added e2e test displays that. My previous comment was a set of assumptions, or “paths to explore later” if it doesn’t work.

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 3 times, most recently from 452799e to df2c355 Compare April 1, 2025 11:16
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from df2c355 to cf4c84b Compare April 1, 2025 15:41
@Danil-Grigorev
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@Danil-Grigorev: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiserver-ilb
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-custom-builds
/test pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-load-test-custom-builds
/test pull-cluster-api-provider-azure-windows-custom-builds
/test pull-cluster-api-provider-azure-windows-with-ci-artifacts

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-apiversion-upgrade
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 1, 2025
@Danil-Grigorev Danil-Grigorev changed the title ✨ Allow more permissive extensibility for securityRules ✨ Allow more permissive extensibility for securityRules and additional CP LoadBalancers Apr 1, 2025
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 4 times, most recently from d744062 to 0d5e30f Compare April 2, 2025 11:55
Copy link
Member

@nawazkh nawazkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR, a lot of work has been put in here. Thank you @Danil-Grigorev !
I am glad that the optional RKE2 test is working!

If I could request, could you please add unit test cases to cluster_test.go and loadbalancers/spec_test.go to support your changes.
I also added some comments below, and would love to hear your thoughts!

@@ -111,9 +111,22 @@ type NetworkSpec struct {
// +optional
ControlPlaneOutboundLB *LoadBalancerSpec `json:"controlPlaneOutboundLB,omitempty"`

// AdditionalControlPlaneLBPorts is the configuration for the additional inbound control-plane load balancer ports
// +optional
AdditionalControlPlaneLBPorts []LoadBalancerPort `json:"additionalControlPlaneLBPorts,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of renaming AdditionalControlPlaneLBPorts to AdditionalAPIServerLBPorts ?

CAPZ has APIServerLB, NodeOutboundLB and ControlplaneOutboundLB. Since the additional ports will be used in APIServerLB, it is better, I believe, to have the additional ports renamed to AdditionalAPIServerLBPorts

@@ -893,6 +906,17 @@ func (s SubnetSpec) IsIPv6Enabled() bool {
return false
}

// GetSecurityRuleByDestination returns security group rule, which matches provided destination ports.
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the current implementation does not seem to be iterating over a comma-separated list of ports ?

Comment on lines 78 to 81

// AdditionalControlPlaneLBPorts is the configuration for the additional inbound control-plane load balancer ports
// +optional
AdditionalControlPlaneLBPorts []LoadBalancerPort `json:"additionalControlPlaneLBPorts,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -1020,9 +1020,18 @@ func (s *ClusterScope) SetControlPlaneSecurityRules() {
if !s.ControlPlaneEnabled() {
return
}
if s.ControlPlaneSubnet().SecurityGroup.SecurityRules == nil {

subnet := s.ControlPlaneSubnet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to have the additional ports of the APIServer LB opened-up/have an opinion about.

Could you please elaborate on why the security rules should not be updated too ?

Comment on lines 1033 to 1035
if subnet.SecurityGroup.SecurityRules == nil {
s.AzureCluster.Spec.NetworkSpec.UpdateControlPlaneSubnet(subnet)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add more context on setting s.AzureCluster.Spec.NetworkSpec.UpdateControlPlaneSubnet(subnet) three times in this function ?
Could we not update it once in the end of the function ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but by equivalent aggregation of ifs evaluating to true. Since the method is only setting struct default values, I assumed the cost is negligible, for the improved readability and confidence of data being persisted.

}
count := 0
for _, machine := range machineList.Items {
if machine.Status.NodeRef != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a more granular check here instead of checking for not nil ?

Copy link
Member Author

@Danil-Grigorev Danil-Grigorev Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to Ready condition, which:

is true if the Machine's deletionTimestamp is not set, Machine's BootstrapConfigReady, InfrastructureReady,
	// NodeHealthy and HealthCheckSucceeded

But the main goal of the check is to verify that CP nodes joined cluster in the expected quantity. Previous nil check fulfilled that too…

@@ -141,22 +146,22 @@ spec:
name: my-subnet-cp-nsg
securityRules:
- name: "allow_ssh"
description: "allow SSH"
description: "Deny SSH"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would benefit if we show an overriding scenario the example you added that adds 9345 to the APIServer LB.

Updating the rule to Deny could throw off users in the first glance.

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from 0091001 to 34ffcbb Compare April 10, 2025 16:21
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from 34ffcbb to b408363 Compare April 10, 2025 16:29
@Danil-Grigorev
Copy link
Member Author

/retest

@nawazkh
Copy link
Member

nawazkh commented Apr 10, 2025

/test pull-cluster-api-provider-azure-e2e

Signed-off-by: Danil-Grigorev <[email protected]>
@Danil-Grigorev
Copy link
Member Author

/test pull-cluster-api-provider-azure-e2e-optional

@Danil-Grigorev
Copy link
Member Author

/test pull-cluster-api-provider-azure-e2e-optional

Copy link
Member

@nawazkh nawazkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you for putting this together, @Danil-Grigorev , Kudos!
/lgtm
/approve

Comment on lines +246 to +249
name string
cluster *clusterv1.Cluster
azureCluster *infrav1.AzureCluster
expectedRuleCount int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reformatting this! 🙌🏼

@@ -1108,6 +1109,59 @@ var _ = Describe("Workload cluster creation", func() {
})
})

Context("Creating RKE2 clusters using clusterclass [OPTIONAL]", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ed5b341a8c6e354b764340bd8828be88a2b55f1b

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nawazkh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2025
@nawazkh
Copy link
Member

nawazkh commented Apr 17, 2025

Could you please squash your commits ?
Hold for squash.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2025
@Danil-Grigorev
Copy link
Member Author

Could you please squash your commits ? Hold for squash. /hold

/label tide/merge-method-squash

Sure, would that work @nawazkh, or do you want a manual squash?

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 18, 2025
@nawazkh
Copy link
Member

nawazkh commented Apr 18, 2025

We usually do a manual squash, but this should work too!
Thank you @Danil-Grigorev
/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 18, 2025
@k8s-ci-robot k8s-ci-robot merged commit 65c2014 into kubernetes-sigs:main Apr 18, 2025
32 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Apr 18, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in CAPZ Planning Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Implement arbitrary additional ingress rules to support CAPRKE2
5 participants