Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
75d387c
slack-bot: add optional Activity Type to Slack modals and Jira filing
deepsm007 Apr 20, 2026
25ed6e5
remove the vendor folder
droslean Apr 16, 2026
8d45148
Merge pull request #5109 from droslean/vendors-folder-rip
openshift-merge-bot[bot] Apr 22, 2026
3db905f
prowgen: allow ci-operator config to skip operator presubmits
Prucek Apr 22, 2026
71a44ec
added Claude skill to maange vault group membership
pruan-rht Apr 22, 2026
36c8fd9
promote: resolve QCI digest post-mirror via oc image info to pin spec…
deepsm007 Apr 22, 2026
7032dee
Merge pull request #5115 from deepsm007/slack-modals-activity-type-de…
openshift-merge-bot[bot] Apr 22, 2026
72ca748
Store ephemeral cluster credentials in a Secret instead of the status
amisstea Apr 21, 2026
4e20444
add cluster pool troubleshooting skill for hosted-mgmt
deepsm007 Apr 22, 2026
4a37ae5
Merge pull request #5122 from pruan-rht/add_vault_group_member_skill
openshift-merge-bot[bot] Apr 22, 2026
e280879
feat(slack-bot): automate support-request Jira workflow
jmguzik Apr 17, 2026
be5724b
Merge pull request #5123 from deepsm007/fix/quay-is-digest-from-status
openshift-merge-bot[bot] Apr 23, 2026
9b7393d
Post merge fixes for activity type
jmguzik Apr 23, 2026
e1c7f20
Merge pull request #5125 from deepsm007/cluster-pool-troubleshooting-…
openshift-merge-bot[bot] Apr 23, 2026
37cda1a
Merge pull request #5121 from Prucek/prowgen-skip-presubmits
openshift-merge-bot[bot] Apr 23, 2026
260ebf2
Merge pull request #5111 from jmguzik/jira-ops
openshift-merge-bot[bot] Apr 23, 2026
3042633
prowgen: allow ci-operator config to enable CSI secrets store
Prucek Apr 23, 2026
5bb1e4f
Merge pull request #5120 from Prucek/prowgen-csi
openshift-merge-bot[bot] Apr 23, 2026
0dc0fdf
Merge pull request #5124 from amisstea/KFLUXVNGD-899
openshift-merge-bot[bot] Apr 23, 2026
ece2da2
image-detector: support /image commit directive to force image builds
jmguzik Apr 10, 2026
191feee
build(deps): bump sigs.k8s.io/prow to 5aca44b7f08f (#5059)
Prucek Apr 23, 2026
5b997bd
Added an option to set a body for a PR
Apr 24, 2026
f3270da
Merge pull request #5131 from hector-vido/dispatcher-pr-body
openshift-merge-bot[bot] Apr 24, 2026
3619cf1
prowgen: allow ci-operator config to set private and expose
Prucek Apr 23, 2026
8e2bfbf
Merge pull request #5127 from Prucek/prowgen-private-expose
openshift-merge-bot[bot] Apr 24, 2026
c7c8d03
Add CodeRabbit review configuration
Prucek Apr 24, 2026
def1c62
Merge pull request #5132 from Prucek/coderabbit-config
openshift-merge-bot[bot] Apr 24, 2026
751ad93
Merge pull request #5105 from jmguzik/image-detector
openshift-merge-robot Apr 24, 2026
dec125d
promote: skip digest-resolve for non-release namespaces
deepsm007 Apr 24, 2026
a636842
cmd/ci-secret-bootstrap: Log Secret change keys
wking Apr 23, 2026
9fa14a6
Merge pull request #5133 from deepsm007/promote-skip-ci-ns-digest-res…
openshift-merge-bot[bot] Apr 25, 2026
d40783c
Add multiple underscore replacements
psalajova Apr 27, 2026
2a2401c
Merge pull request #5135 from psalajova/replace-multiple-underscores-gsm
openshift-merge-bot[bot] Apr 27, 2026
dbedc11
feat(ecc): terminate ci-operator gracefully
danilo-gemoli Apr 27, 2026
4fc23ac
Merge pull request #5126 from danilo-gemoli/feat/ephemeralcluster/ci-…
openshift-merge-bot[bot] Apr 27, 2026
77434bf
Merge pull request #5130 from wking/log-secret-change-keys
openshift-merge-bot[bot] Apr 27, 2026
14f099c
Remove deprecated tools and image
jmguzik Apr 28, 2026
b84aa82
prowgen: allow ci-operator config to set max_concurrency
Prucek Apr 28, 2026
5ad440b
Merge pull request #5140 from Prucek/prowgen-max-concurrency
openshift-merge-bot[bot] Apr 28, 2026
353a4a0
feat(prowgen): utilize sparse checkout
Prucek Apr 28, 2026
2ae3e95
promotion-quay: digest via sed; KUBECACHEDIR for promotion pod
deepsm007 Apr 28, 2026
a3598d6
Merge pull request #4990 from Prucek/prowgen-sparse
openshift-merge-bot[bot] Apr 29, 2026
7e81b78
fix(prowgen): only set oauth_token_secret when sparse checkout is active
Prucek Apr 29, 2026
fcadc6b
Merge pull request #5143 from Prucek/prowgen-sparse-fix
openshift-merge-robot Apr 29, 2026
d6675f7
Merge pull request #5138 from jmguzik/removal
openshift-merge-robot Apr 29, 2026
633bf07
Revert "prowgen: utilize sparse checkout"
Prucek Apr 29, 2026
3d0c952
Merge pull request #5145 from Prucek/revert-sparse-checkout
openshift-merge-robot Apr 29, 2026
5fd6f44
config-brancher: skip derived release-* configs for etcd on main/master
jmguzik Apr 29, 2026
b94d869
Merge pull request #5142 from deepsm007/promotion-quay-jsonpath-digest
openshift-merge-bot[bot] Apr 29, 2026
dbe2e97
Merge pull request #5144 from jmguzik/config-brancher-etcd
openshift-merge-robot Apr 30, 2026
fc2c7e9
feat(api): konflux cluster profile ownership
danilo-gemoli Apr 30, 2026
70184f0
prowgen: allow per-test slack reporter config in ci-operator config
Prucek Apr 30, 2026
f7be39b
Always include kubeAdminPassword in EphemeralCluster credentials Secret
amisstea Apr 23, 2026
7a666b5
Add dptp collection
psalajova Apr 27, 2026
c7267a5
Merge pull request #5148 from danilo-gemoli/feat/ecc/konflux-cluster-…
openshift-merge-bot[bot] Apr 30, 2026
596c82d
ci-operator-config-mirror: set prowgen.private for openshift-priv mir…
deepsm007 Apr 30, 2026
8eebb60
Merge pull request #5150 from deepsm007/fix/mirror-prowgen-private
openshift-merge-bot[bot] May 1, 2026
0085d9a
Forward SubSteps() through wrapper steps
mdbooth May 1, 2026
2df81d8
Merge pull request #5129 from amisstea/KFLUXVNGD-899
openshift-merge-bot[bot] May 2, 2026
c5a07ea
NO-JIRA: add permanent exception for 5.0 periodics.yaml
neisw May 3, 2026
c710215
NO-JIRA: address feedback
neisw May 3, 2026
876888f
Merge pull request #5152 from neisw/release-5.0-periodics
openshift-merge-bot[bot] May 4, 2026
0966091
Merge pull request #5136 from psalajova/gsm-config-add-dptp-collection
openshift-merge-bot[bot] May 4, 2026
cd5e241
promotion-quay: retry digest-tag when QCI digest moves after mirror
deepsm007 May 4, 2026
9f3563c
Merge pull request #5154 from deepsm007/promotion-quay-digest-retries
openshift-merge-bot[bot] May 4, 2026
9617f36
feat(check-cluster-profiles-config): add normalize option
danilo-gemoli May 5, 2026
483a4a2
Merge pull request #5153 from danilo-gemoli/feat/check-cluster-profil…
openshift-merge-bot[bot] May 5, 2026
209c761
fix(diffs): ignore unexported prowconfig.Retry fields in cmp.Diff
deepsm007 May 5, 2026
236c68d
chore: bump sigs.k8s.io/prow to 92d6574a4509
deepsm007 May 5, 2026
333b70e
Merge pull request #5156 from deepsm007/fix-rehearse-retry-panic
openshift-merge-bot[bot] May 5, 2026
4468929
Merge pull request #5151 from openshift-cloud-team/substep-reporting
openshift-merge-bot[bot] May 5, 2026
3d1ca73
chore(cluster-profiles): deprecate old config schema
danilo-gemoli May 6, 2026
dfef785
chore: make generate
danilo-gemoli May 6, 2026
6f19cfa
Enable possibility to use capabilities in images jobs
jmguzik Jul 15, 2025
b1cad87
Merge pull request #5158 from danilo-gemoli/chore/cluster-profile/dep…
openshift-merge-bot[bot] May 6, 2026
3b79e40
coderabbit: add high-level summary instructions
Prucek May 6, 2026
c0b53c5
Fix issue DPTP-4756 add STS hub-account role chaining
bear-redhat Apr 8, 2026
1c94b7d
fix(check-cluster-profiles-config): diff with pointers
danilo-gemoli May 6, 2026
7a79153
Merge pull request #5162 from danilo-gemoli/fix/check-cluster-profile…
openshift-merge-robot May 6, 2026
81f171f
Merge pull request #4964 from jmguzik/cherry-pick-approved
openshift-merge-bot[bot] May 7, 2026
1aba780
Merge pull request #5149 from Prucek/prowgen-slack-reporter
openshift-merge-bot[bot] May 7, 2026
651357b
Merge pull request #5159 from Prucek/coderabbit-summary
openshift-merge-bot[bot] May 7, 2026
0e28254
Add new profile: azure-perfscale-qe
mehabhalodiya May 7, 2026
325d36e
pj-rehearse: add concurrency control, changed-files prefilter, and dr…
jmguzik May 7, 2026
53463e8
Merge pull request #5164 from mehabhalodiya/add-azure-perfscale-qe
openshift-merge-bot[bot] May 7, 2026
582d460
Merge pull request #5165 from jmguzik/pj-reharse-change
openshift-merge-bot[bot] May 7, 2026
b2000cb
NO-JIRA: remove release-release template
neisw May 7, 2026
976745e
feat(ci-operator): integrate byoip with cluster profile sets
danilo-gemoli May 8, 2026
3dd1e85
Merge pull request #5169 from danilo-gemoli/fix/byoip/integrate-with-…
openshift-merge-bot[bot] May 8, 2026
56f03f2
payload-testing: ignore non-created issue comment events
petr-muller May 9, 2026
bf87206
multi-pr-prow-plugin: ignore non-created issue comment events
petr-muller May 9, 2026
a9a5e09
backport-verifier: ignore non-created issue comment events
petr-muller May 9, 2026
8038421
pipeline-controller: ignore non-created issue comment events
petr-muller May 9, 2026
e47f9fa
Merge pull request #5168 from neisw/job-name-generator-deprecate-peri…
openshift-merge-bot[bot] May 10, 2026
1afea1e
prowgen: support inline slack reporter config for images jobs (#5166)
Prucek May 11, 2026
df3d50e
Merge pull request #5170 from petr-muller/pj-rehearse-edits
openshift-merge-bot[bot] May 11, 2026
74d595c
resolve digest-only tags and clear reference on stable import
deepsm007 May 11, 2026
6ebae95
Merge pull request #5172 from deepsm007/release-import-stable-local-r…
openshift-merge-bot[bot] May 12, 2026
c1b5e76
feat(ci-operator-checkconfig): add cluster profile sets allowlist
danilo-gemoli May 12, 2026
aeb8c9d
Merge pull request #5171 from danilo-gemoli/feat/ci-operator-checkcon…
openshift-merge-bot[bot] May 12, 2026
a78851f
chore: remove cluster-init from check-breaking-changes
danilo-gemoli May 12, 2026
9a4b9c7
Merge pull request #5176 from danilo-gemoli/chore/remove-cluster-init…
openshift-merge-bot[bot] May 12, 2026
8b31a56
Merge pull request #5094 from bear-redhat/issue/DPTP-4756
openshift-merge-bot[bot] May 12, 2026
93a1b9e
Utilize clonerefs sparse checkout (#5161)
Prucek May 13, 2026
a14f0f7
prowgen: remove .config.prowgen support
Prucek May 7, 2026
2b4f143
prowgen: remove ProwgenInfo wrapper, use Metadata directly
Prucek May 15, 2026
8506637
fix: gofmt alignment in jobbase_test.go
Prucek May 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
273 changes: 273 additions & 0 deletions .claude/.claude-plugin/skills/Troubleshooting/Cluster-pools/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
# Skill: Cluster Pool Troubleshooting on hosted-mgmt

## Purpose

Debug Hive-managed cluster pools running on the `hosted-mgmt` OpenShift cluster.
All commands use `oc --context hosted-mgmt` — never assume the current context is correct.

---

## Step 0 — Identify the pool and its namespace

Pool definitions live in the `release` repo under:

```
clusters/hosted-mgmt/hive/pools/<owner-namespace>/
```

The `namespace:` field in each `*_clusterpool.yaml` is the namespace on hosted-mgmt where
Hive creates ClusterDeployments, provision pods, and related resources.

Common pool namespaces:

| Owner directory | Namespace on hosted-mgmt |
|----------------------|-----------------------------|
| `openshift-ci/` | `ci-cluster-pool` |
| `cvp/` | `cvp-cluster-pool` |
| `serverless/` | `serverless-cluster-pool` |
| `rhdh/` | `rhdh-cluster-pool` |
| `konflux/` | `konflux-cluster-pool` |
| `openshift-observability/` | `obs-cluster-pool` |
| `rh-openshift-ecosystem/` | `rhoe-cluster-pool` |
| *(others)* | Check the YAML's `namespace:` field |

To find the namespace for an unknown pool, run from the `release` repo root:

```bash
grep -r "namespace:" clusters/hosted-mgmt/hive/pools/<owner-dir>/
```

Set a shell variable for convenience throughout the debug session:

```bash
NS=ci-cluster-pool # replace with the actual namespace
CTX=hosted-mgmt
```

---

## Step 1 — Inspect the ClusterPool

```bash
# List all pools in the namespace
oc --context $CTX -n $NS get clusterpool

# Describe the specific pool to check status conditions and ready/standby counts
oc --context $CTX -n $NS describe clusterpool <pool-name>
```

Key fields to check in the output:

- `Status.Ready` — number of clusters ready to claim.
- `Status.Standby` — clusters installed but hibernating.
- `Status.Size` — current total managed by the pool.
- `Conditions` — look for `CapacityAvailable: False`, `AllClustersCurrentImages: False`, or `MissingDependencies: True`.

---

## Step 2 — List ClusterDeployments in the namespace

```bash
# Overview of all ClusterDeployments: installed state and provision status
oc --context $CTX -n $NS get clusterdeployment \
-o custom-columns="NAME:.metadata.name,INSTALLED:.spec.installed,STAGE:.status.installRestarts,PROVISION:.status.provisionRef.name"

# Find specifically un-installed or failed deployments
oc --context $CTX -n $NS get clusterdeployment \
-o jsonpath='{range .items[?(@.spec.installed==false)]}{.metadata.name}{"\n"}{end}'
```

For a suspicious deployment:

```bash
oc --context $CTX -n $NS describe clusterdeployment <name>
```

Look for:

- `Conditions` block — `ProvisionFailed`, `DNSNotReady`, `AuthenticationCertificateNotAvailable`.
- `Status.InstallRestarts` — high count means repeated failures.
- `Status.ProvisionRef` — points to the active ClusterProvision.

---

## Step 3 — Check ClusterProvisions

```bash
# List provisions in the namespace
oc --context $CTX -n $NS get clusterprovision

# Describe a specific provision
oc --context $CTX -n $NS describe clusterprovision <provision-name>
```

Key fields:

- `Stage` — `Provisioning`, `Failed`, `Complete`.
- `Conditions` — `JobCreated`, `Initialized`, `Succeeded`.
- `Status.AdminKubeconfigSecret` / `Status.AdminPasswordSecret` — populated only on success.

---

## Step 4 — Find provision and deprovision pods

Hive spawns `hive-install-*` pods for provisioning and `hive-deprovision-*` pods for deprovisioning.

```bash
# All install/deprovision pods — check phase at a glance
oc --context $CTX -n $NS get pod \
-l hive.openshift.io/job-type \
--sort-by='.status.startTime'

# Filter to only failing pods
oc --context $CTX -n $NS get pod \
-l hive.openshift.io/job-type \
--field-selector=status.phase!=Succeeded,status.phase!=Running
```

Common pod label selectors:

```bash
# Provision pods for a specific ClusterDeployment
oc --context $CTX -n $NS get pod \
-l hive.openshift.io/cluster-deployment-name=<cd-name>

# Deprovision pods
oc --context $CTX -n $NS get pod \
-l hive.openshift.io/cluster-deployment-name=<cd-name>,hive.openshift.io/job-type=deprovision
```

---

## Step 5 — Get error messages from provision/deprovision logs

```bash
# Full logs from a failing install pod (hiveutil container has the installer output)
oc --context $CTX -n $NS logs <pod-name> -c hiveutil --tail=100

# If the pod has multiple containers, list them first
oc --context $CTX -n $NS get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'

# Then fetch logs per container
oc --context $CTX -n $NS logs <pod-name> -c <container-name> --tail=200

# Previous run logs (useful if the pod restarted)
oc --context $CTX -n $NS logs <pod-name> -c <container-name> --previous

# Grep for known error patterns
oc --context $CTX -n $NS logs <pod-name> -c hiveutil 2>&1 | grep -iE "error|fail|timeout|denied|quota"
```

For deprovision pods the main container is usually named `deprovision`:

```bash
oc --context $CTX -n $NS logs <deprovision-pod-name> -c deprovision --tail=200
```

---

## Step 6 — Check all other pods in the namespace for issues

```bash
# Overview of pod health in the namespace
oc --context $CTX -n $NS get pod --sort-by='.status.startTime'

# Pods not in Running or Succeeded state
oc --context $CTX -n $NS get pod \
--field-selector=status.phase!=Running,status.phase!=Succeeded

# Describe a troubled pod
oc --context $CTX -n $NS describe pod <pod-name>
```

In `describe` output look for:

- `Events` section — `FailedScheduling`, `BackOff`, `OOMKilled`, `Evicted`.
- `State.Waiting.Reason` — `CrashLoopBackOff`, `ImagePullBackOff`, `ErrImagePull`.
- `Last State.Terminated.Reason` and `Exit Code`.

```bash
# Quickly surface CrashLoopBackOff or OOMKilled pods
oc --context $CTX -n $NS get pod -o json | \
jq -r '.items[] | select(.status.containerStatuses[]?.state.waiting.reason == "CrashLoopBackOff") | .metadata.name'
```

---

## Step 7 — Check Hive controller pods (hive namespace)

Problems are sometimes in the Hive controller itself, not the pool namespace.

```bash
# Hive runs in the hive namespace by default
oc --context $CTX -n hive get pod

# Controller logs (filter for the pool/CD name you are debugging)
oc --context $CTX -n hive logs -l control-plane=hive-controllers --tail=200 | grep <cd-name>

# HiveConfig status
oc --context $CTX get hiveconfig hive -o yaml | grep -A 20 "conditions:"
```

---

## Step 8 — Check recent events in the pool namespace

```bash
# All events sorted by time — fast way to see what went wrong recently
oc --context $CTX -n $NS get events \
--sort-by='.lastTimestamp' | tail -40

# Filter to Warning events only
oc --context $CTX -n $NS get events \
--field-selector=type=Warning \
--sort-by='.lastTimestamp' | tail -20
```

---

## Quick-reference: common failure patterns

| Symptom | Where to look | Likely cause |
|---------|---------------|--------------|
| Pool `Ready` count stuck at 0 | `describe clusterpool` conditions | Quota exhaustion, cloud creds invalid, image pull failure |
| `ProvisionFailed` on ClusterDeployment | `describe clusterdeployment` + provision pod logs | Cloud API error, DNS failure, machine quota |
| Install pod `OOMKilled` | `describe pod` + previous logs | Insufficient node resources on hosted-mgmt |
| Deprovision pod stuck | Deprovision pod logs | Cloud resource still exists, creds issue |
| `ImagePullBackOff` on install pod | `describe pod` events | Hive installer image not accessible |
| High `InstallRestarts` | ClusterDeployment status | Transient cloud errors or a persistent config problem |

---

## References and further reading

- **Hive troubleshooting guide** (official):
<https://github.com/openshift/hive/blob/master/docs/troubleshooting.md>

- **Hive ClusterPool documentation**:
<https://github.com/openshift/hive/blob/master/docs/clusterpools.md>

- **Hive ClusterDeployment documentation**:
<https://github.com/openshift/hive/blob/master/docs/using-hive.md>

- **Hive API types** (ClusterPool, ClusterDeployment, ClusterProvision conditions):
<https://github.com/openshift/hive/tree/master/apis/hive/v1>

- **Pool definitions in the `release` repo** (namespace mapping source of truth):
`clusters/hosted-mgmt/hive/pools/`

- **Test Platform SOP — cluster pool issues** (internal):
<https://docs.google.com/document/d/1bBkqR1kMmulGSVbbRv2EQ4T86H7oMWNn4y_AKDpDlzE> *(check #forum-testplatform for the current link)*

---

## Tips

- Always confirm you are talking to the right cluster before running write operations:
```bash
oc --context $CTX whoami --show-server
```
- The `oc --context hosted-mgmt` context must be present in your kubeconfig.
Refresh it with the cluster-login skill if you get auth errors.
- Pool namespaces follow the pattern `<owner>-cluster-pool` but confirm from the YAML —
some owners (e.g. `openshift-ci`) use `ci-cluster-pool` instead.
Loading