Fix SecurityGroup availability status by counting security group rules #651

gndrmnn · 2026-01-21T13:46:48Z

Consider this minimal example:

---
apiVersion: openstack.k-orc.cloud/v1alpha1
kind: SecurityGroup
metadata:
  name: security-group
spec:
  cloudCredentialsRef:
    cloudName: openstack
    secretName: cloud-config
  resource:
    name: security-group
    description: k-orc generated
    rules:
      - description: rule k-orc
        direction: ingress
        protocol: icmp
        # ethertype: IPv4
        # ethertype: IPv6
        remoteIPPrefix: "0.0.0.0/0"
        portRange:
          min: 1
          max: 42

If we select ethertype is IPv4, everything is fine and the security group rules as well as the security group is created successfully.

However, if we select ethertype is IPv6 the remoteIPPrefix field is (expectedly) rejected by openstack with a http error. The security group is already created at this point and, despite the rule error, enters a true availability state.

With this commit SecurityGroups now count their specified rules and compare them to the number of rules in their ORC status and the openstack resource. As the security group rules are only ever part of one security group, this should be enough to reliably determine, if all rules have been successfully created.

As the mismatch between remoteIPPrefix and ethertype is apparently to be solved "client-side" by ORC in the future, I am not sure, if and how to add a test for this? I think the ultimate reason why a security group rule is rejected by openstack in the future is not known yet, right?

Fixes #120

mandre

Comparing the length of security group rules between the spec and the status indeed seems like the right approach for deciding whether the security group should be considered as available.

However I wonder if we should not reschedule a reconcile similarly to what we do in server or volume for example. That's very likely the reason why the flamingo job failed.

gndrmnn · 2026-01-22T10:15:37Z

However I wonder if we should not reschedule a reconcile similarly to what we do in server or volume for example. That's very likely the reason why the flamingo job failed.

Got it.

Do you want me to add tests for this change? If so, is it possible to simulate an openstack error using kuttl?

mandre · 2026-01-27T10:19:39Z

internal/controllers/securitygroup/status.go

+			return metav1.ConditionFalse, progress.WaitingOnOpenStack(progress.WaitingOnReady, securityGroupAvailablePollingPeriod)
+		}
+
+		if len(resourceSpec.Rules) != len(resourceStatus.Rules) {


We should not be comparing the number of rules in the spec and in the status, because they can be different: the security group inherits the default security group rules.
Instead, we should check that the desired state (the rules in the spec) accurately represents the observed state (the rules in the status) AND account for the default security rules. The actuator should already do exactly that for the rules update.

We should not be comparing the number of rules in the spec and in the status, because they can be different: the security group inherits the default security group rules.

I am not sure I fully understand. So every security group implicitly inherits the rules of the security group default, right? But this is not shown explicitly in openstack security group show, or the Horizon Web UI. Do I understand this correctly?

But it seems like ORC also does not currently use the (inherited) default rules to populate the status of the security group, so that should not actually be a problem. Unless this is unwanted behaviour and ORC is supposed to do that in the future, in which case the change in this PR will indeed break.

Instead, we should check that the desired state (the rules in the spec) accurately represents the observed state (the rules in the status) AND account for the default security rules. The actuator should already do exactly that for the rules update.

Yes, my understanding is that the actuator will take care of making sure the rules are matching from a content perspective. That being said, the rules should only become part of the security group resource status if and when they match. So to update the Available status it should be enough to make sure the number of rules is the same, without duplicating the validation logic of the rule's content, or am I not seeing the problem right now?

OK, this confused me as well so I ran some tests:

when creating a security group, we indeed inherit the default security group rules, this is just how OpenStack works.

however, the first thing the controller does is to delete the default rules, meaning that the user never sees these rules in the status.

also, it appears that OpenStack is not trying to add the rules back, which is good.

I believe this is the expected behavior, and makes the experience with ORC deterministic.

This is effectively a consequence of #174, and this comment in the actuator should now be removed because it's outdated.

All that to say that in most cases, we can indeed check if the resource is fully provisioned by comparing the number of rules.

OK, this confused me as well so I ran some tests:

when creating a security group, we indeed inherit the default security group rules, this is just how OpenStack works.

however, the first thing the controller does is to delete the default rules, meaning that the user never sees these rules in the status.

also, it appears that OpenStack is not trying to add the rules back, which is good.

I believe this is the expected behavior, and makes the experience with ORC deterministic.

Got it. At first I thought you meant like some sort of implicit inheritence

This is effectively a consequence of #174, and this comment in the actuator should now be removed because it's outdated.

I amended the commit to remove the comment

Thanks! I'm still a unsure about the ResourceAvailableStatus() changes, but we'll definitely want to get rid of the outdated and confusing comment.

mandre · 2026-01-27T10:25:04Z

internal/controllers/securitygroup/status.go

 		}
 	}

-	// SecurityGroup is available as soon as it exists


On second thoughts, I wonder if this is really what we want: the security group is indeed available right away, meaning we can do operations on it. The Available condition is accurate.

The Progressing condition, on the other hand, should reflect that we still have security group rules being provisioned, and that the desired state does not yet match the observed state.

Issue #120 states that the current way to represent the Available status in security groups is undesired. As far as I understand, security group rules can only ever be part of a single security group. And they can not be cross-referenced. So I would agree with #120 in that a security group is only "available and matches the spec" when the security groups are actually created.

However, this design decision has to be decided by the project according to it's own goals.

I agree, this hinges on understanding what 'Available' most usefully means in the API. We certainly can proceed to create resources as soon as the SecurityGroup exists, but is that creating risk?

Another thought: we create a SecurityGroup, it's fully reconciled and available. We then modify its Rules. Should it still be Available in the period while the new spec has not been applied?

Is it sufficient that a user should wait for 'Available=True AND Progressing=False'?

I don't have the answers, unfortunately. I recommend writing down all the use cases and edge conditions you can think of and deciding what the correct status should be in all cases. If it can't be defined, does that mean we need to add a new Condition?

The more I think about it, the more I believe Available=True should mean we can operate on the resource. Granted, securitygroup is a bit special because we've convoluted 2 resources into 1 controller (security groups and security group rules).

My current thinking is that:

the security group itself is available for use right away upon creation. Neutron did not bother adding a status field for it.

the security group rules, in our representation in ORC, are attributes to the security group. Their creation/update are reflected by the Progressing condition.

Likewise, I'm of the opinion that the Available condition of the security group should not change when the rules are being updated. This can again be tracked via the Progressing condition.

This is what we say in the docs:

Every ORC resource reports its status through two conditions: Available (whether the resource is ready for use) and Progressing (whether ORC is still working on it).

And also, about the ResourceAvailableStatus() method:

This sets the value of the Available condition. This should not return true unless the resource is completely ready to be used.

We should clarify what "ready to be used" means. I'll check how the other controllers implement the ResourceAvailableStatus() method.

Looking at the existing controllers, they're all returning Available as soon as they're created, except for the resources that have a status field. For these we're looking at the following conditions:

Floating IP, active and down statuses

Image, active status

Network, active status

Port, active and down status

Router, active status

Server, active status

Volume, active and in-use statuses

I agree with the spirit of this PR, but I think a better implementation would be to add a third condition rather than modifying Available.

The problem is Available=True, Progressing=True doesn't tell you what it's progressing on. Maybe you don't care if it's updating a tag, but you do care if a security rule failed. With the current situation tag updates a indistinguishable from rule creation error and doesn't provide enough info to the user:

Scenario Available Progressing User can distinguish?

SG created, rules pending True True No - could be tags updating

SG created, rule failed True True No - looks like still progressing

SG created, tag update in progress True True No - same as above

I would propose to add a third condition for subresources. A SubResourcesReady condition for controllers with sub-resources (SecurityGroup→Rules, Router→Interfaces, etc.):

Available: Main resource exists and can be operated on

Progressing: Controller is actively reconciling something

SubResourcesReady: All sub-resources (rules, interfaces, attachments) are in desired state

This gives users clear semantics:

Scenario Available Progressing SubResourcesReady

SG created, rules pending True True False

SG created, rule failed True False False (reason: rule error)

SG fully ready True False True

Tag update in progress True True True

This pattern could apply to other controllers too making it easy to understand sub resource failures and act on them, providing a better user experience. WDYT? @mandre @mdbooth

SecurityGroups now count their specified rules and compare them to the number of rules in their ORC status and the openstack resource. As the security group rules are only ever part of one security group, this should be enough to reliably determine if all rules have been successfully created. On-behalf-of: SAP nils.gondermann@sap.com

github-actions bot added the semver:patch No API change label Jan 21, 2026

mandre reviewed Jan 21, 2026

View reviewed changes

gndrmnn force-pushed the fix_issue_120 branch from 496537f to e0b8b29 Compare January 22, 2026 10:10

mandre reviewed Jan 27, 2026

View reviewed changes

gndrmnn force-pushed the fix_issue_120 branch from e0b8b29 to a24560b Compare January 27, 2026 14:35

Scenario	Available	Progressing	User can distinguish?
SG created, rules pending	True	True	No - could be tags updating
SG created, rule failed	True	True	No - looks like still progressing
SG created, tag update in progress	True	True	No - same as above

Fix SecurityGroup availability status by counting security group rules #651

Are you sure you want to change the base?

Fix SecurityGroup availability status by counting security group rules #651

Uh oh!

Conversation

gndrmnn commented Jan 21, 2026

Uh oh!

mandre left a comment

Choose a reason for hiding this comment

Uh oh!

gndrmnn commented Jan 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mandre Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eshulman2 Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mandre Jan 27, 2026 •

edited

Loading

eshulman2 Jan 27, 2026 •

edited

Loading