WINC-1778: [ote] Migrate logging & monitoring tests to OTE (batch 3/6) by weinliu · Pull Request #3921 · openshift/windows-machine-config-operator

weinliu · 2026-04-02T23:10:21Z

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (3 tests):

OCP-33779: Retrieving Windows node logs (oc adm node-logs)
OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

Planned migration batches (WINC-1778):

Batch	Category	Tests	Status
0	wmco golang version	37362	✅ merged
1 (PR #3920)	Node Basics	32554, 32615, 33612	🔄 in review
2 (PR #3922)	Networking	28632, 32273, 31276	🔄 in review
3 (PR #3921)	Logging & Monitoring	33779, 33783, 73595	🔄 in review
4	WICD, Workloads & Misc (no SSH)	50403, 60814, 25593, 42204, 38186	📋 planned
5	SSH-dependent tests	54711, 56354, 76765, 48873, 65980, 66670, 68320	⏳ pending SSH decision
Special	Needs special cluster	74760, 38188	⏳ pending

Test Results (OCP 4.21 nightly, IPI on AWS)

Test	Result	Duration
OCP-33779	✅ passed	2.6s
OCP-33783	✅ passed	366s (must-gather)
OCP-73595	✅ passed	0.4s

Full test output

OCP-33779

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]

OCP-33783

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]

OCP-73595

[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]

Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

openshift-ci-robot · 2026-04-02T23:10:27Z

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (5 tests):

OCP-33779: Retrieving Windows node logs (oc adm node-logs)

OCP-33783: Enable must gather on Windows node [Slow][Disruptive]

OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

OCP-70922: Monitor CPU, Memory, and Filesystem graphs for Windows Pods

OCP-73752: Monitor Network In, and Network Out graphs for Windows Pods

OCP-70922 and OCP-73752 query Thanos directly via the thanos-querier route
using a prometheus-k8s service account token, with no dependency on
compat_otp.NewPrometheusMonitor.

Planned migration batches (WINC-1778):

Batch Category Tests

1 Node Basics 32554, 32615, 33612

2 Networking 28632, 32273, 31276, 39451

3 (this PR) Logging & Monitoring 33779, 33783, 73595, 70922, 73752

4 WICD & Services 54711, 50403, 56354, 76765, 33794

5 Proxy 65980, 66670, 68320

6 Workloads & Misc 25593, 42204, 48873, 38186, 38188, 60814, 74760

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration

OCP-33779 ✅ passed 2.6s

OCP-33783 ✅ passed 366s (must-gather)

OCP-73595 ✅ passed 0.4s

OCP-70922 ⚠️ Thanos query works; metric availability depends on recording rules —

OCP-73752 ⚠️ Thanos query works; metric availability depends on recording rules —
Full test output (33779, 33783, 73595)

OCP-33779
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]
OCP-33783
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]
OCP-73595
[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]
Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-02T23:10:36Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b4fc80e1-bd60-43f4-876f-b030b5d229ba

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request extends the WMCO (Windows Machine Config Operator) test suite with additional observability and validation checks. A new batch of five test specifications was registered in the extension test harness, each invoking a corresponding check function. A new logging_monitoring.go module was introduced containing five exported check functions that validate Windows node logs, must-gather artifacts, filesystem metrics, and pod-level CPU/memory/network/filesystem metrics via Thanos queries. Supporting utilities include Thanos query construction with OAuth token handling and JSON result parsing. The integration leverages the oc CLI client for cluster inspection and metrics aggregation.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Go Best Practices & Build Tags	⚠️ Warning	File `logging_monitoring.go` lacks build tag directive required for Windows-only test code and contains HTTP status code validation gaps in queryThanos function.	Add `//go:build windows` directive before package declaration. Validate `resp.StatusCode == http.StatusOK` immediately after response receipt and replace `InsecureSkipVerify: true` with proper TLS verification using in-cluster certificates.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Security: Secrets, Ssh & Csr	✅ Passed	The new test code does not handle secrets, SSH connections, or CSR approvals; it only executes CLI commands and processes output.
Kubernetes Controller Patterns	✅ Passed	This pull request modifies only test utilities (wmco-tests-ext/main.go and logging_monitoring.go), not controller implementations, so controller pattern validation is not applicable.
Windows Service Management	✅ Passed	Windows Service Management check is not applicable; PR implements test validation only, performing no SCM interactions or service lifecycle operations.
Platform-Specific Requirements	✅ Passed	PR migrates logging/monitoring tests to OTE framework without introducing platform-specific infrastructure provisioning or constraints; tests validate cluster behavior across all cloud providers.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-02T23:10:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: weinliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ote/OWNERS~~ [weinliu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-04-02T23:12:03Z

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (5 tests):

OCP-33779: Retrieving Windows node logs (oc adm node-logs)

OCP-33783: Enable must gather on Windows node [Slow][Disruptive]

OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

OCP-70922: Monitor CPU, Memory, and Filesystem graphs for Windows Pods

OCP-73752: Monitor Network In, and Network Out graphs for Windows Pods

OCP-70922 and OCP-73752 query Thanos directly via the thanos-querier route
using a prometheus-k8s service account token, with no dependency on
compat_otp.NewPrometheusMonitor.

Planned migration batches (WINC-1778):

Batch Category Tests

1 Node Basics 32554, 32615, 33612

2 Networking 28632, 32273, 31276, 39451

3 (this PR) Logging & Monitoring 33779, 33783, 73595, 70922, 73752

4 WICD & Services 54711, 50403, 56354, 76765, 33794

5 Proxy 65980, 66670, 68320

6 Workloads & Misc 25593, 42204, 48873, 38186, 38188, 60814, 74760

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration

OCP-33779 ✅ passed 2.6s

OCP-33783 ✅ passed 366s (must-gather)

OCP-73595 ✅ passed 0.4s

OCP-70922 ⚠️ Thanos query works; metric availability depends on recording rules —

OCP-73752 ⚠️ Thanos query works; metric availability depends on recording rules —
Full test output (33779, 33783, 73595)

OCP-33779
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]
OCP-33783
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]
OCP-73595
[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]
Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)
Summary by CodeRabbit

Tests

Added extended validation tests for Windows node monitoring, including checks for node logs, pod metrics (CPU, memory, network, filesystem), filesystem graphs, and must-gather collection.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

ote/test/extended/logging_monitoring.go (1)
23-35: Avoid re-fetching token and route on every metric query

queryThanos runs oc create token and oc get route each call, which scales poorly across pods/metrics. Resolve these once per check and reuse.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` around lines 23 - 35, queryThanos
currently calls oc create token and oc get route on every invocation which is
inefficient; refactor so the token and thanosHost are retrieved once and reused:
add a helper like getThanosCredentials (or use a sync.Once-backed package-level
cache) that calls oc.AsAdmin().Output for the token and route a single time,
then change queryThanos to accept token and thanosHost as parameters (or read
the cached values) and update its callers to supply them; ensure error handling
remains and that the cache refresh strategy (if any) is documented where you add
getThanosCredentials/queryThanos changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ote/test/extended/logging_monitoring.go`:
- Around line 60-68: thanosHasResult currently only unmarshals into Data.Result
and treats any non-result as "no metric"; update it to also unmarshal/inspect
top-level fields (e.g., status and error) from the Thanos response and return an
error when status == "error" or an error field is present. Specifically, extend
the result struct (or unmarshal into a wrapper) to include Status string and
Error (or ErrorType/ ErrorObject) fields, check those after json.Unmarshal of
response in thanosHasResult, and when an error-status is detected return a
formatted error containing the Thanos error details instead of returning false.
- Line 234: The PromQL selector passed to queryThanos is only filtering by pod
(query+`{pod="`+pod+`"}`), which can match same-named pods in other namespaces;
update the selector to include namespace by changing the call(s) to include
namespace=" + wmcoNamespace (e.g.
query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`) so the query is scoped to
the WMCO namespace; update both occurrences where queryThanos is invoked with
the pod-only selector (including the call around the symbol queryThanos at the
earlier occurrence and the similar call at the later occurrence).
- Around line 46-55: The current client.Do(req) response handling treats non-200
HTTP responses as normal payloads; after performing client.Do(req) and before
returning the body in the code around resp, err := client.Do(req) / defer
resp.Body.Close(), add a status check: if resp.StatusCode is not 200 (or not in
the expected 2xx range) read a short snippet of resp.Body (e.g., first ~200
bytes), and return a formatted error that includes resp.Status and the trimmed
snippet to fail fast on auth/outage errors; keep the existing error wraps for
io.ReadAll and ensure resp.Body is still closed (defer) after the nil err check.
- Around line 153-154: The test currently uses a fixed destDir (destDir :=
"/tmp/must-gather-33783") which can collide across runs; replace this by
creating a unique temp directory via os.MkdirTemp (e.g., call os.MkdirTemp("",
"must-gather-") and assign to destDir) and keep the deferred cleanup via defer
os.RemoveAll(destDir) to remove the created directory; update references that
use destDir accordingly and handle the MkdirTemp error (fail the test or
t.Fatal) before deferring RemoveAll.
- Around line 41-46: The HTTP client currently disables TLS verification, has no
timeout, and doesn't validate HTTP status codes; update the client creation to
remove InsecureSkipVerify (use default TLS config or the cluster CA), add a
sensible timeout on the http.Client (e.g., Client{Timeout: ...}), and after
client.Do(req) check resp != nil then validate resp.StatusCode is 2xx—on non-2xx
return a clear error (including status and body) and ensure resp.Body is closed
before returning; reference the Transport/TLSClientConfig/InsecureSkipVerify,
http.Client{Timeout: ...}, and the client.Do(req)/resp handling code paths to
locate and fix the logic.

---

Nitpick comments:
In `@ote/test/extended/logging_monitoring.go`:
- Around line 23-35: queryThanos currently calls oc create token and oc get
route on every invocation which is inefficient; refactor so the token and
thanosHost are retrieved once and reused: add a helper like getThanosCredentials
(or use a sync.Once-backed package-level cache) that calls oc.AsAdmin().Output
for the token and route a single time, then change queryThanos to accept token
and thanosHost as parameters (or read the cached values) and update its callers
to supply them; ensure error handling remains and that the cache refresh
strategy (if any) is documented where you add getThanosCredentials/queryThanos
changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 04d2ffb6-21e0-43ba-8be2-0b21061beaba

📥 Commits

Reviewing files that changed from the base of the PR and between 13e7a27 and 4956075.

📒 Files selected for processing (2)

ote/cmd/wmco-tests-ext/main.go
ote/test/extended/logging_monitoring.go

coderabbitai · 2026-04-02T23:17:05Z

+	client := &http.Client{
+		Transport: &http.Transport{
+			TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
+		},
+	}
+	resp, err := client.Do(req)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full context of the queryThanos function and surrounding code cat -n ote/test/extended/logging_monitoring.go | head -80

Repository: openshift/windows-machine-config-operator

Length of output: 3080

🏁 Script executed:

# Check if there are any environment-specific comments or configurations explaining the insecure client rg -A 5 -B 5 "InsecureSkipVerify" ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 442

🏁 Script executed:

# Check if CA certificates or kubelet configuration is available in the test environment rg -i "ca.*cert|kubelet|tls.*config" ote/test/extended/ | head -20

Repository: openshift/windows-machine-config-operator

Length of output: 940

🏁 Script executed:

# Look for any timeout or context cancellation mechanisms in queryThanos or its callers rg -A 10 "func queryThanos" ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 555

🏁 Script executed:

# Check if this test runs against real/self-signed certs or mocked endpoints rg -i "thanos|prometheus" ote/test/extended/logging_monitoring.go | grep -E "(url|endpoint|host|cert)" | head -10

Repository: openshift/windows-machine-config-operator

Length of output: 179

🏁 Script executed:

# Check if CA certificates are typically available in OpenShift clusters # Look for any documentation or other test patterns handling TLS rg -i "kubernetes.io.*ca|serviceaccount.*ca" ote/test/ | head -10

Repository: openshift/windows-machine-config-operator

Length of output: 67

🏁 Script executed:

# Check if there are other HTTP clients in the test code with proper TLS handling rg -B 3 -A 3 "http.Client\{" ote/test/

Repository: openshift/windows-machine-config-operator

Length of output: 617

🏁 Script executed:

# Check if there are timeout patterns used elsewhere in test code rg "time.Second|Timeout:" ote/test/ | head -20

Repository: openshift/windows-machine-config-operator

Length of output: 67

🏁 Script executed:

# Look at the actual Thanos API usage - does the code check status codes? cat -n ote/test/extended/logging_monitoring.go | sed -n '46,56p'

Repository: openshift/windows-machine-config-operator

Length of output: 442

🏁 Script executed:

# Check if there's context passed to queryThanos that could provide timeout info rg "queryThanos\(" ote/test/extended/logging_monitoring.go | head -5

Repository: openshift/windows-machine-config-operator

Length of output: 260

Secure Thanos client with proper TLS verification, timeout, and error handling

The HTTP client is missing three critical safeguards:

Line 43: InsecureSkipVerify: true disables certificate validation, exposing the bearer token to MITM attacks. OpenShift clusters provide proper CA certificates in-cluster; use them.

Lines 41–45: No timeout configured. The client will hang indefinitely if Thanos is slow or unresponsive, causing test flakiness.

Line 46–55: No HTTP status code validation. Errors like 401 (auth failures) or 500 (server errors) are silently treated as valid responses and passed to the JSON parser.

Suggested fix

import ( "context" "crypto/tls" "encoding/json" "fmt" "io" "net/http" "net/url" "os" "strconv" "strings" + "time" @@ client := &http.Client{ + Timeout: 30 * time.Second, Transport: &http.Transport{ - TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec + TLSClientConfig: &tls.Config{}, }, } resp, err := client.Do(req) if err != nil { return "", fmt.Errorf("thanos query failed: %w", err) } + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("thanos returned %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode)) + } defer resp.Body.Close()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

client := &http.Client{

Transport: &http.Transport{

TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec

},

}

resp, err := client.Do(req)

import (

"context"

"crypto/tls"

"encoding/json"

"fmt"

"io"

"net/http"

"net/url"

"os"

"strconv"

"strings"

"time"

)

client := &http.Client{

Timeout: 30 * time.Second,

Transport: &http.Transport{

TLSClientConfig: &tls.Config{},

},

}

resp, err := client.Do(req)

if err != nil {

return "", fmt.Errorf("thanos query failed: %w", err)

}

if resp.StatusCode != http.StatusOK {

return "", fmt.Errorf("thanos returned %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode))

}

defer resp.Body.Close()

🧰 Tools

🪛 ast-grep (0.42.0)

[warning] 42-42: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{InsecureSkipVerify: true}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ote/test/extended/logging_monitoring.go` around lines 41 - 46, The HTTP client currently disables TLS verification, has no timeout, and doesn't validate HTTP status codes; update the client creation to remove InsecureSkipVerify (use default TLS config or the cluster CA), add a sensible timeout on the http.Client (e.g., Client{Timeout: ...}), and after client.Do(req) check resp != nil then validate resp.StatusCode is 2xx—on non-2xx return a clear error (including status and body) and ensure resp.Body is closed before returning; reference the Transport/TLSClientConfig/InsecureSkipVerify, http.Client{Timeout: ...}, and the client.Do(req)/resp handling code paths to locate and fix the logic.

coderabbitai · 2026-04-02T23:17:05Z

+	resp, err := client.Do(req)
+	if err != nil {
+		return "", fmt.Errorf("thanos query failed: %w", err)
+	}
+	defer resp.Body.Close()
+	body, err := io.ReadAll(resp.Body)
+	if err != nil {
+		return "", fmt.Errorf("failed to read thanos response: %w", err)
+	}
+	return string(body), nil


⚠️ Potential issue | 🟠 Major

Validate HTTP status codes from Thanos

On Line 46, non-200 responses are currently treated like normal payloads, which can misclassify auth/outage failures as “no metric”. Fail fast on unexpected status with a short body snippet.

Suggested fix

resp, err := client.Do(req) if err != nil { return "", fmt.Errorf("thanos query failed: %w", err) } defer resp.Body.Close() body, err := io.ReadAll(resp.Body) if err != nil { return "", fmt.Errorf("failed to read thanos response: %w", err) } + if resp.StatusCode != http.StatusOK { + snippet := string(body) + if len(snippet) > 200 { + snippet = snippet[:200] + } + return "", fmt.Errorf("thanos query returned HTTP %d: %s", resp.StatusCode, snippet) + } return string(body), nil

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resp, err := client.Do(req)

if err != nil {

return "", fmt.Errorf("thanos query failed: %w", err)

}

defer resp.Body.Close()

body, err := io.ReadAll(resp.Body)

if err != nil {

return "", fmt.Errorf("failed to read thanos response: %w", err)

}

return string(body), nil

resp, err := client.Do(req)

if err != nil {

return "", fmt.Errorf("thanos query failed: %w", err)

}

defer resp.Body.Close()

body, err := io.ReadAll(resp.Body)

if err != nil {

return "", fmt.Errorf("failed to read thanos response: %w", err)

}

if resp.StatusCode != http.StatusOK {

snippet := string(body)

if len(snippet) > 200 {

snippet = snippet[:200]

}

return "", fmt.Errorf("thanos query returned HTTP %d: %s", resp.StatusCode, snippet)

}

return string(body), nil

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ote/test/extended/logging_monitoring.go` around lines 46 - 55, The current client.Do(req) response handling treats non-200 HTTP responses as normal payloads; after performing client.Do(req) and before returning the body in the code around resp, err := client.Do(req) / defer resp.Body.Close(), add a status check: if resp.StatusCode is not 200 (or not in the expected 2xx range) read a short snippet of resp.Body (e.g., first ~200 bytes), and return a formatted error that includes resp.Status and the trimmed snippet to fail fast on auth/outage errors; keep the existing error wraps for io.ReadAll and ensure resp.Body is still closed (defer) after the nil err check.

coderabbitai · 2026-04-02T23:17:05Z

+	var result struct {
+		Data struct {
+			Result []interface{} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal([]byte(response), &result); err != nil {
+		return false, fmt.Errorf("failed to parse thanos response: %w", err)
+	}
+	return len(result.Data.Result) > 0, nil


⚠️ Potential issue | 🟠 Major

Handle Thanos API error payloads explicitly

thanosHasResult only checks data.result. If Thanos returns {"status":"error",...}, this is reported as missing metric instead of query failure. Parse and validate top-level status/error fields.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ote/test/extended/logging_monitoring.go` around lines 60 - 68, thanosHasResult currently only unmarshals into Data.Result and treats any non-result as "no metric"; update it to also unmarshal/inspect top-level fields (e.g., status and error) from the Thanos response and return an error when status == "error" or an error field is present. Specifically, extend the result struct (or unmarshal into a wrapper) to include Status string and Error (or ErrorType/ ErrorObject) fields, check those after json.Unmarshal of response in thanosHasResult, and when an error-status is detected return a formatted error containing the Thanos error details instead of returning false.

coderabbitai · 2026-04-02T23:17:05Z

+	destDir := "/tmp/must-gather-33783"
+	defer os.RemoveAll(destDir)


⚠️ Potential issue | 🟠 Major

Use a unique must-gather destination directory

On Line 153, fixed path /tmp/must-gather-33783 can collide across concurrent runs and stale leftovers. Use os.MkdirTemp and defer cleanup of the created dir.

Suggested fix

- destDir := "/tmp/must-gather-33783" - defer os.RemoveAll(destDir) + destDir, err := os.MkdirTemp("", "must-gather-33783-*") + if err != nil { + return fmt.Errorf("failed to create temp must-gather directory: %w", err) + } + defer os.RemoveAll(destDir) - msg, err := oc.AsAdmin().Output("adm", "must-gather", "--dest-dir="+destDir) + msg, err := oc.AsAdmin().Output("adm", "must-gather", "--dest-dir="+destDir)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

destDir := "/tmp/must-gather-33783"

defer os.RemoveAll(destDir)

destDir, err := os.MkdirTemp("", "must-gather-33783-*")

if err != nil {

return fmt.Errorf("failed to create temp must-gather directory: %w", err)

}

defer os.RemoveAll(destDir)

msg, err := oc.AsAdmin().Output("adm", "must-gather", "--dest-dir="+destDir)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ote/test/extended/logging_monitoring.go` around lines 153 - 154, The test currently uses a fixed destDir (destDir := "/tmp/must-gather-33783") which can collide across runs; replace this by creating a unique temp directory via os.MkdirTemp (e.g., call os.MkdirTemp("", "must-gather-") and assign to destDir) and keep the deferred cleanup via defer os.RemoveAll(destDir) to remove the created directory; update references that use destDir accordingly and handle the MkdirTemp error (fail the test or t.Fatal) before deferring RemoveAll.

coderabbitai · 2026-04-02T23:17:06Z

+		storage, err := oc.AsAdmin().Output("get", "node", host,
+			"-o=jsonpath={.status.allocatable['ephemeral-storage']}")
+		if err != nil {
+			return fmt.Errorf("failed to get ephemeral storage for node %s: %w", host, err)
+		}
+		val, err := strconv.ParseInt(strings.TrimSuffix(storage, "Ki"), 10, 64)
+		if err != nil {
+			return fmt.Errorf("failed to parse storage value %q for node %s: %w", storage, host, err)
+		}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file fd -t f "logging_monitoring.go" | head -5

Repository: openshift/windows-machine-config-operator

Length of output: 124

🏁 Script executed:

# Check the file size before reading wc -l ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 128

🏁 Script executed:

# Read the specific lines in question sed -n '190,210p' ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 735

🏁 Script executed:

# Check current imports in the file head -30 ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 918

🏁 Script executed:

# Search for other uses of resource.ParseQuantity in the codebase to see if this pattern is used elsewhere rg "resource\.ParseQuantity" --type go

Repository: openshift/windows-machine-config-operator

Length of output: 332

🏁 Script executed:

# Check if k8s.io/apimachinery is already a dependency in the project fd -t f "go.mod" | xargs head -50

Repository: openshift/windows-machine-config-operator

Length of output: 2834

🏁 Script executed:

# Check what dependencies ote tests typically need grep -r "import" ote/test/extended/*.go | head -20

Repository: openshift/windows-machine-config-operator

Length of output: 171

🏁 Script executed:

# Look for any other quantity/resource parsing in ote module rg -t go "ParseQuantity|ParseInt|resource\." ote/

Repository: openshift/windows-machine-config-operator

Length of output: 198

🏁 Script executed:

# Check if ote module would be able to add k8s.io/apimachinery as a dependency by checking its usage elsewhere grep -r "k8s.io" ote/ --type go

Repository: openshift/windows-machine-config-operator

Length of output: 202

🏁 Script executed:

# Verify the exact Kubernetes quantity format that gets returned from the jsonpath query # by looking at how ephemeral-storage is documented in similar code rg -t go "ephemeral-storage" -B 3 -A 5

Repository: openshift/windows-machine-config-operator

Length of output: 8946

Use resource.ParseQuantity to handle all Kubernetes quantity formats, not just Ki

The current code breaks for quantities returned in formats other than Ki (e.g., Gi, G, Mi, M). The jsonpath query for ephemeral-storage returns a Kubernetes quantity, which can be expressed with binary (Ki, Mi, Gi, Ti, Pi, Ei) or decimal (K, M, G, T, P, E) suffixes. Replace the fragile strings.TrimSuffix(storage, "Ki") approach with resource.ParseQuantity().

This requires adding k8s.io/apimachinery to ote/go.mod (the root module already has it at v0.35.2, and the pattern is already used in pkg/services/services.go). The fix also updates the sign check from numeric to use Quantity.Sign():

Suggested fix

import ( @@ "strconv" "strings" "github.com/openshift/windows-machine-config-operator/ote/test/extended/cli" + "k8s.io/apimachinery/pkg/api/resource" ) @@ - val, err := strconv.ParseInt(strings.TrimSuffix(storage, "Ki"), 10, 64) + q, err := resource.ParseQuantity(storage) if err != nil { return fmt.Errorf("failed to parse storage value %q for node %s: %w", storage, host, err) } - if val <= 0 { - return fmt.Errorf("expected positive storage value for node %s, got %d", host, val) + if q.Sign() <= 0 { + return fmt.Errorf("expected positive storage value for node %s, got %q", host, storage) }

coderabbitai · 2026-04-02T23:17:06Z

+
+	for _, pod := range pods {
+		for metricName, query := range metrics {
+			resp, err := queryThanos(oc, query+`{pod="`+pod+`"}`)


⚠️ Potential issue | 🟠 Major

Scope PromQL to WMCO namespace to avoid false positives

Both queries filter only by pod. A same-named pod in another namespace can satisfy the check. Add namespace="<wmcoNamespace>" to the selector.

Suggested fix

- resp, err := queryThanos(oc, query+`{pod="`+pod+`"}`) + resp, err := queryThanos(oc, query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`) @@ - resp, err := queryThanos(oc, query+`{pod="`+pod+`"}`) + resp, err := queryThanos(oc, query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`)

Also applies to: 272-272

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ote/test/extended/logging_monitoring.go` at line 234, The PromQL selector passed to queryThanos is only filtering by pod (query+`{pod="`+pod+`"}`), which can match same-named pods in other namespaces; update the selector to include namespace by changing the call(s) to include namespace=" + wmcoNamespace (e.g. query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`) so the query is scoped to the WMCO namespace; update both occurrences where queryThanos is invoked with the pod-only selector (including the call around the symbol queryThanos at the earlier occurrence and the similar call at the later occurrence).

openshift-ci-robot · 2026-04-02T23:47:50Z

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (3 tests):

OCP-33779: Retrieving Windows node logs (oc adm node-logs)

OCP-33783: Enable must gather on Windows node [Slow][Disruptive]

OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

Planned migration batches (WINC-1778):

Batch Category Tests

1 Node Basics 32554, 32615, 33612

2 Networking 28632, 32273, 31276, 39451

3 (this PR) Logging & Monitoring 33779, 33783, 73595

4 WICD & Services 54711, 50403, 56354, 76765, 33794

5 Proxy 65980, 66670, 68320

6 Workloads & Misc 25593, 42204, 48873, 38186, 38188, 60814, 74760

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration

OCP-33779 ✅ passed 2.6s

OCP-33783 ✅ passed 366s (must-gather)

OCP-73595 ✅ passed 0.4s
Full test output

OCP-33779
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]
OCP-33783
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]
OCP-73595
[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]
Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

weinliu · 2026-04-03T00:25:50Z

/retest security

openshift-ci · 2026-04-03T00:25:54Z

@weinliu: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test aws-e2e-operator

/test azure-e2e-operator

/test azure-e2e-upgrade

/test ci-bundle-wmco-bundle

/test gcp-e2e-operator

/test images

/test lint

/test nutanix-e2e-operator

/test platform-none-vsphere-e2e-operator

/test security

/test unit

/test vsphere-disconnected-e2e-operator

/test vsphere-e2e-operator

/test vsphere-proxy-e2e-operator

/test wicd-unit-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-windows-machine-config-operator-master-ci-bundle-wmco-bundle

pull-ci-openshift-windows-machine-config-operator-master-images

pull-ci-openshift-windows-machine-config-operator-master-lint

pull-ci-openshift-windows-machine-config-operator-master-security

pull-ci-openshift-windows-machine-config-operator-master-unit

Details

In response to this:

/retest security

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2026-04-03T22:37:42Z

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (3 tests):

OCP-33779: Retrieving Windows node logs (oc adm node-logs)

OCP-33783: Enable must gather on Windows node [Slow][Disruptive]

OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

Planned migration batches (WINC-1778):

Batch Category Tests Status

0 wmco golang version 37362 ✅ merged

1 (PR #3920) Node Basics 32554, 32615, 33612 🔄 in review

2 (PR #3922) Networking 28632, 32273, 31276 🔄 in review

3 (PR #3921) Logging & Monitoring 33779, 33783, 73595 🔄 in review

4 WICD, Workloads & Misc (no SSH) 50403, 60814, 25593, 42204, 38186 📋 planned

5 SSH-dependent tests 54711, 56354, 76765, 48873, 65980, 66670, 68320 ⏳ pending SSH decision

Special Needs special cluster 74760, 38188 ⏳ pending

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration

OCP-33779 ✅ passed 2.6s

OCP-33783 ✅ passed 366s (must-gather)

OCP-73595 ✅ passed 0.4s
Full test output

OCP-33779
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]
OCP-33783
[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]
OCP-73595
[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]
Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Migrate the following logging & monitoring smokerun tests from openshift-tests-private to the OTE extension binary: - OCP-33779: Retrieving Windows node logs (oc adm node-logs) - OCP-33783: Enable must gather on Windows node [Slow][Disruptive] - OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial] - OCP-70922: Monitor CPU, Memory, and Filesystem graphs for Windows Pods - OCP-73752: Monitor Network In, and Network Out graphs for Windows Pods OCP-70922 and OCP-73752 query Thanos directly via the thanos-querier route using a prometheus-k8s service account token, avoiding the compat_otp PrometheusMonitor dependency.

openshift-ci · 2026-04-06T02:08:44Z

@weinliu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2026-04-09T19:06:31Z

Closing, please open a separate request if this work is still required.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 2, 2026

openshift-ci bot requested review from mansikulkarni96 and rrasouli April 2, 2026 23:10

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

weinliu force-pushed the WINC-1778-ote-batch3-logging-monitoring branch from 4956075 to a7e9d6b Compare April 2, 2026 23:47

This was referenced Apr 3, 2026

WINC-1778: [ote] Migrate node-basics tests to OTE (batch 1/6) #3920

Closed

WINC-1778: [ote] Migrate networking tests to OTE (batch 2/6) #3922

Closed

weinliu mentioned this pull request Apr 3, 2026

WINC-1778: [ote] Migrate WICD, workloads & misc tests to OTE (batch 4) #3923

Closed

weinliu force-pushed the WINC-1778-ote-batch3-logging-monitoring branch from a7e9d6b to 23749c5 Compare April 6, 2026 01:22

openshift-bot closed this Apr 9, 2026

-	client := &http.Client{
-		Transport: &http.Transport{
-			TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
-		},
-	}
-	resp, err := client.Do(req)
+import (
+	"context"
+	"crypto/tls"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net/http"
+	"net/url"
+	"os"
+	"strconv"
+	"strings"
+	"time"
+)
+	client := &http.Client{
+		Timeout: 30 * time.Second,
+		Transport: &http.Transport{
+			TLSClientConfig: &tls.Config{},
+		},
+	}
+	resp, err := client.Do(req)
+	if err != nil {
+		return "", fmt.Errorf("thanos query failed: %w", err)
+	}
+	if resp.StatusCode != http.StatusOK {
+		return "", fmt.Errorf("thanos returned %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode))
+	}
+	defer resp.Body.Close()

		destDir := "/tmp/must-gather-33783"
		defer os.RemoveAll(destDir)

Conversation

weinliu commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (OCP 4.21 nightly, IPI on AWS)

Uh oh!

openshift-ci-robot commented Apr 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (OCP 4.21 nightly, IPI on AWS)

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

openshift-ci bot commented Apr 2, 2026

Uh oh!

openshift-ci-robot commented Apr 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (OCP 4.21 nightly, IPI on AWS)

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Apr 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (OCP 4.21 nightly, IPI on AWS)

Uh oh!

weinliu commented Apr 3, 2026

Uh oh!

openshift-ci bot commented Apr 3, 2026

Uh oh!

openshift-ci-robot commented Apr 3, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results (OCP 4.21 nightly, IPI on AWS)

Uh oh!

openshift-ci bot commented Apr 6, 2026

Uh oh!

openshift-bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weinliu commented Apr 2, 2026 •

edited

Loading

openshift-ci-robot commented Apr 2, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

openshift-ci-robot commented Apr 2, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 2, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 3, 2026 •

edited by openshift-ci bot

Loading