Skip to content

WINC-1778: [ote] Migrate logging & monitoring tests to OTE (batch 3/6)#3921

Closed
weinliu wants to merge 1 commit intoopenshift:masterfrom
weinliu:WINC-1778-ote-batch3-logging-monitoring
Closed

WINC-1778: [ote] Migrate logging & monitoring tests to OTE (batch 3/6)#3921
weinliu wants to merge 1 commit intoopenshift:masterfrom
weinliu:WINC-1778-ote-batch3-logging-monitoring

Conversation

@weinliu
Copy link
Copy Markdown
Contributor

@weinliu weinliu commented Apr 2, 2026

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (3 tests):

  • OCP-33779: Retrieving Windows node logs (oc adm node-logs)
  • OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
  • OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

Planned migration batches (WINC-1778):

Batch Category Tests Status
0 wmco golang version 37362 ✅ merged
1 (PR #3920) Node Basics 32554, 32615, 33612 🔄 in review
2 (PR #3922) Networking 28632, 32273, 31276 🔄 in review
3 (PR #3921) Logging & Monitoring 33779, 33783, 73595 🔄 in review
4 WICD, Workloads & Misc (no SSH) 50403, 60814, 25593, 42204, 38186 📋 planned
5 SSH-dependent tests 54711, 56354, 76765, 48873, 65980, 66670, 68320 ⏳ pending SSH decision
Special Needs special cluster 74760, 38188 ⏳ pending

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration
OCP-33779 ✅ passed 2.6s
OCP-33783 ✅ passed 366s (must-gather)
OCP-73595 ✅ passed 0.4s
Full test output

OCP-33779

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]

OCP-33783

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]

OCP-73595

[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]

Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 2, 2026

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (5 tests):

  • OCP-33779: Retrieving Windows node logs (oc adm node-logs)
  • OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
  • OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]
  • OCP-70922: Monitor CPU, Memory, and Filesystem graphs for Windows Pods
  • OCP-73752: Monitor Network In, and Network Out graphs for Windows Pods

OCP-70922 and OCP-73752 query Thanos directly via the thanos-querier route
using a prometheus-k8s service account token, with no dependency on
compat_otp.NewPrometheusMonitor.

Planned migration batches (WINC-1778):

Batch Category Tests
1 Node Basics 32554, 32615, 33612
2 Networking 28632, 32273, 31276, 39451
3 (this PR) Logging & Monitoring 33779, 33783, 73595, 70922, 73752
4 WICD & Services 54711, 50403, 56354, 76765, 33794
5 Proxy 65980, 66670, 68320
6 Workloads & Misc 25593, 42204, 48873, 38186, 38188, 60814, 74760

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration
OCP-33779 ✅ passed 2.6s
OCP-33783 ✅ passed 366s (must-gather)
OCP-73595 ✅ passed 0.4s
OCP-70922 ⚠️ Thanos query works; metric availability depends on recording rules
OCP-73752 ⚠️ Thanos query works; metric availability depends on recording rules
Full test output (33779, 33783, 73595)

OCP-33779

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]

OCP-33783

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]

OCP-73595

[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]

Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b4fc80e1-bd60-43f4-876f-b030b5d229ba

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request extends the WMCO (Windows Machine Config Operator) test suite with additional observability and validation checks. A new batch of five test specifications was registered in the extension test harness, each invoking a corresponding check function. A new logging_monitoring.go module was introduced containing five exported check functions that validate Windows node logs, must-gather artifacts, filesystem metrics, and pod-level CPU/memory/network/filesystem metrics via Thanos queries. Supporting utilities include Thanos query construction with OAuth token handling and JSON result parsing. The integration leverages the oc CLI client for cluster inspection and metrics aggregation.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Go Best Practices & Build Tags ⚠️ Warning File logging_monitoring.go lacks build tag directive required for Windows-only test code and contains HTTP status code validation gaps in queryThanos function. Add //go:build windows directive before package declaration. Validate resp.StatusCode == http.StatusOK immediately after response receipt and replace InsecureSkipVerify: true with proper TLS verification using in-cluster certificates.
✅ Passed checks (4 passed)
Check name Status Explanation
Security: Secrets, Ssh & Csr ✅ Passed The new test code does not handle secrets, SSH connections, or CSR approvals; it only executes CLI commands and processes output.
Kubernetes Controller Patterns ✅ Passed This pull request modifies only test utilities (wmco-tests-ext/main.go and logging_monitoring.go), not controller implementations, so controller pattern validation is not applicable.
Windows Service Management ✅ Passed Windows Service Management check is not applicable; PR implements test validation only, performing no SCM interactions or service lifecycle operations.
Platform-Specific Requirements ✅ Passed PR migrates logging/monitoring tests to OTE framework without introducing platform-specific infrastructure provisioning or constraints; tests validate cluster behavior across all cloud providers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: weinliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 2, 2026

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (5 tests):

  • OCP-33779: Retrieving Windows node logs (oc adm node-logs)
  • OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
  • OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]
  • OCP-70922: Monitor CPU, Memory, and Filesystem graphs for Windows Pods
  • OCP-73752: Monitor Network In, and Network Out graphs for Windows Pods

OCP-70922 and OCP-73752 query Thanos directly via the thanos-querier route
using a prometheus-k8s service account token, with no dependency on
compat_otp.NewPrometheusMonitor.

Planned migration batches (WINC-1778):

Batch Category Tests
1 Node Basics 32554, 32615, 33612
2 Networking 28632, 32273, 31276, 39451
3 (this PR) Logging & Monitoring 33779, 33783, 73595, 70922, 73752
4 WICD & Services 54711, 50403, 56354, 76765, 33794
5 Proxy 65980, 66670, 68320
6 Workloads & Misc 25593, 42204, 48873, 38186, 38188, 60814, 74760

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration
OCP-33779 ✅ passed 2.6s
OCP-33783 ✅ passed 366s (must-gather)
OCP-73595 ✅ passed 0.4s
OCP-70922 ⚠️ Thanos query works; metric availability depends on recording rules
OCP-73752 ⚠️ Thanos query works; metric availability depends on recording rules
Full test output (33779, 33783, 73595)

OCP-33779

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]

OCP-33783

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]

OCP-73595

[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]

Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Summary by CodeRabbit

  • Tests
  • Added extended validation tests for Windows node monitoring, including checks for node logs, pod metrics (CPU, memory, network, filesystem), filesystem graphs, and must-gather collection.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
ote/test/extended/logging_monitoring.go (1)

23-35: Avoid re-fetching token and route on every metric query

queryThanos runs oc create token and oc get route each call, which scales poorly across pods/metrics. Resolve these once per check and reuse.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` around lines 23 - 35, queryThanos
currently calls oc create token and oc get route on every invocation which is
inefficient; refactor so the token and thanosHost are retrieved once and reused:
add a helper like getThanosCredentials (or use a sync.Once-backed package-level
cache) that calls oc.AsAdmin().Output for the token and route a single time,
then change queryThanos to accept token and thanosHost as parameters (or read
the cached values) and update its callers to supply them; ensure error handling
remains and that the cache refresh strategy (if any) is documented where you add
getThanosCredentials/queryThanos changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ote/test/extended/logging_monitoring.go`:
- Around line 60-68: thanosHasResult currently only unmarshals into Data.Result
and treats any non-result as "no metric"; update it to also unmarshal/inspect
top-level fields (e.g., status and error) from the Thanos response and return an
error when status == "error" or an error field is present. Specifically, extend
the result struct (or unmarshal into a wrapper) to include Status string and
Error (or ErrorType/ ErrorObject) fields, check those after json.Unmarshal of
response in thanosHasResult, and when an error-status is detected return a
formatted error containing the Thanos error details instead of returning false.
- Line 234: The PromQL selector passed to queryThanos is only filtering by pod
(query+`{pod="`+pod+`"}`), which can match same-named pods in other namespaces;
update the selector to include namespace by changing the call(s) to include
namespace=" + wmcoNamespace (e.g.
query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`) so the query is scoped to
the WMCO namespace; update both occurrences where queryThanos is invoked with
the pod-only selector (including the call around the symbol queryThanos at the
earlier occurrence and the similar call at the later occurrence).
- Around line 46-55: The current client.Do(req) response handling treats non-200
HTTP responses as normal payloads; after performing client.Do(req) and before
returning the body in the code around resp, err := client.Do(req) / defer
resp.Body.Close(), add a status check: if resp.StatusCode is not 200 (or not in
the expected 2xx range) read a short snippet of resp.Body (e.g., first ~200
bytes), and return a formatted error that includes resp.Status and the trimmed
snippet to fail fast on auth/outage errors; keep the existing error wraps for
io.ReadAll and ensure resp.Body is still closed (defer) after the nil err check.
- Around line 153-154: The test currently uses a fixed destDir (destDir :=
"/tmp/must-gather-33783") which can collide across runs; replace this by
creating a unique temp directory via os.MkdirTemp (e.g., call os.MkdirTemp("",
"must-gather-") and assign to destDir) and keep the deferred cleanup via defer
os.RemoveAll(destDir) to remove the created directory; update references that
use destDir accordingly and handle the MkdirTemp error (fail the test or
t.Fatal) before deferring RemoveAll.
- Around line 41-46: The HTTP client currently disables TLS verification, has no
timeout, and doesn't validate HTTP status codes; update the client creation to
remove InsecureSkipVerify (use default TLS config or the cluster CA), add a
sensible timeout on the http.Client (e.g., Client{Timeout: ...}), and after
client.Do(req) check resp != nil then validate resp.StatusCode is 2xx—on non-2xx
return a clear error (including status and body) and ensure resp.Body is closed
before returning; reference the Transport/TLSClientConfig/InsecureSkipVerify,
http.Client{Timeout: ...}, and the client.Do(req)/resp handling code paths to
locate and fix the logic.

---

Nitpick comments:
In `@ote/test/extended/logging_monitoring.go`:
- Around line 23-35: queryThanos currently calls oc create token and oc get
route on every invocation which is inefficient; refactor so the token and
thanosHost are retrieved once and reused: add a helper like getThanosCredentials
(or use a sync.Once-backed package-level cache) that calls oc.AsAdmin().Output
for the token and route a single time, then change queryThanos to accept token
and thanosHost as parameters (or read the cached values) and update its callers
to supply them; ensure error handling remains and that the cache refresh
strategy (if any) is documented where you add getThanosCredentials/queryThanos
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 04d2ffb6-21e0-43ba-8be2-0b21061beaba

📥 Commits

Reviewing files that changed from the base of the PR and between 13e7a27 and 4956075.

📒 Files selected for processing (2)
  • ote/cmd/wmco-tests-ext/main.go
  • ote/test/extended/logging_monitoring.go

Comment thread ote/test/extended/logging_monitoring.go Outdated
Comment on lines +41 to +46
client := &http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
},
}
resp, err := client.Do(req)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full context of the queryThanos function and surrounding code
cat -n ote/test/extended/logging_monitoring.go | head -80

Repository: openshift/windows-machine-config-operator

Length of output: 3080


🏁 Script executed:

# Check if there are any environment-specific comments or configurations explaining the insecure client
rg -A 5 -B 5 "InsecureSkipVerify" ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 442


🏁 Script executed:

# Check if CA certificates or kubelet configuration is available in the test environment
rg -i "ca.*cert|kubelet|tls.*config" ote/test/extended/ | head -20

Repository: openshift/windows-machine-config-operator

Length of output: 940


🏁 Script executed:

# Look for any timeout or context cancellation mechanisms in queryThanos or its callers
rg -A 10 "func queryThanos" ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 555


🏁 Script executed:

# Check if this test runs against real/self-signed certs or mocked endpoints
rg -i "thanos|prometheus" ote/test/extended/logging_monitoring.go | grep -E "(url|endpoint|host|cert)" | head -10

Repository: openshift/windows-machine-config-operator

Length of output: 179


🏁 Script executed:

# Check if CA certificates are typically available in OpenShift clusters
# Look for any documentation or other test patterns handling TLS
rg -i "kubernetes.io.*ca|serviceaccount.*ca" ote/test/ | head -10

Repository: openshift/windows-machine-config-operator

Length of output: 67


🏁 Script executed:

# Check if there are other HTTP clients in the test code with proper TLS handling
rg -B 3 -A 3 "http.Client\{" ote/test/

Repository: openshift/windows-machine-config-operator

Length of output: 617


🏁 Script executed:

# Check if there are timeout patterns used elsewhere in test code
rg "time.Second|Timeout:" ote/test/ | head -20

Repository: openshift/windows-machine-config-operator

Length of output: 67


🏁 Script executed:

# Look at the actual Thanos API usage - does the code check status codes?
cat -n ote/test/extended/logging_monitoring.go | sed -n '46,56p'

Repository: openshift/windows-machine-config-operator

Length of output: 442


🏁 Script executed:

# Check if there's context passed to queryThanos that could provide timeout info
rg "queryThanos\(" ote/test/extended/logging_monitoring.go | head -5

Repository: openshift/windows-machine-config-operator

Length of output: 260


Secure Thanos client with proper TLS verification, timeout, and error handling

The HTTP client is missing three critical safeguards:

  1. Line 43: InsecureSkipVerify: true disables certificate validation, exposing the bearer token to MITM attacks. OpenShift clusters provide proper CA certificates in-cluster; use them.
  2. Lines 41–45: No timeout configured. The client will hang indefinitely if Thanos is slow or unresponsive, causing test flakiness.
  3. Line 46–55: No HTTP status code validation. Errors like 401 (auth failures) or 500 (server errors) are silently treated as valid responses and passed to the JSON parser.
Suggested fix
 import (
 	"context"
 	"crypto/tls"
 	"encoding/json"
 	"fmt"
 	"io"
 	"net/http"
 	"net/url"
 	"os"
 	"strconv"
 	"strings"
+	"time"
@@
 	client := &http.Client{
+		Timeout: 30 * time.Second,
 		Transport: &http.Transport{
-			TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
+			TLSClientConfig: &tls.Config{},
 		},
 	}
 	resp, err := client.Do(req)
 	if err != nil {
 		return "", fmt.Errorf("thanos query failed: %w", err)
 	}
+	if resp.StatusCode != http.StatusOK {
+		return "", fmt.Errorf("thanos returned %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode))
+	}
 	defer resp.Body.Close()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
client := &http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, //nolint:gosec
},
}
resp, err := client.Do(req)
import (
"context"
"crypto/tls"
"encoding/json"
"fmt"
"io"
"net/http"
"net/url"
"os"
"strconv"
"strings"
"time"
)
client := &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
TLSClientConfig: &tls.Config{},
},
}
resp, err := client.Do(req)
if err != nil {
return "", fmt.Errorf("thanos query failed: %w", err)
}
if resp.StatusCode != http.StatusOK {
return "", fmt.Errorf("thanos returned %d: %s", resp.StatusCode, http.StatusText(resp.StatusCode))
}
defer resp.Body.Close()
🧰 Tools
🪛 ast-grep (0.42.0)

[warning] 42-42: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{InsecureSkipVerify: true}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` around lines 41 - 46, The HTTP
client currently disables TLS verification, has no timeout, and doesn't validate
HTTP status codes; update the client creation to remove InsecureSkipVerify (use
default TLS config or the cluster CA), add a sensible timeout on the http.Client
(e.g., Client{Timeout: ...}), and after client.Do(req) check resp != nil then
validate resp.StatusCode is 2xx—on non-2xx return a clear error (including
status and body) and ensure resp.Body is closed before returning; reference the
Transport/TLSClientConfig/InsecureSkipVerify, http.Client{Timeout: ...}, and the
client.Do(req)/resp handling code paths to locate and fix the logic.

Comment thread ote/test/extended/logging_monitoring.go Outdated
Comment on lines +46 to +55
resp, err := client.Do(req)
if err != nil {
return "", fmt.Errorf("thanos query failed: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("failed to read thanos response: %w", err)
}
return string(body), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate HTTP status codes from Thanos

On Line 46, non-200 responses are currently treated like normal payloads, which can misclassify auth/outage failures as “no metric”. Fail fast on unexpected status with a short body snippet.

Suggested fix
 	resp, err := client.Do(req)
 	if err != nil {
 		return "", fmt.Errorf("thanos query failed: %w", err)
 	}
 	defer resp.Body.Close()
 	body, err := io.ReadAll(resp.Body)
 	if err != nil {
 		return "", fmt.Errorf("failed to read thanos response: %w", err)
 	}
+	if resp.StatusCode != http.StatusOK {
+		snippet := string(body)
+		if len(snippet) > 200 {
+			snippet = snippet[:200]
+		}
+		return "", fmt.Errorf("thanos query returned HTTP %d: %s", resp.StatusCode, snippet)
+	}
 	return string(body), nil
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resp, err := client.Do(req)
if err != nil {
return "", fmt.Errorf("thanos query failed: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("failed to read thanos response: %w", err)
}
return string(body), nil
resp, err := client.Do(req)
if err != nil {
return "", fmt.Errorf("thanos query failed: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("failed to read thanos response: %w", err)
}
if resp.StatusCode != http.StatusOK {
snippet := string(body)
if len(snippet) > 200 {
snippet = snippet[:200]
}
return "", fmt.Errorf("thanos query returned HTTP %d: %s", resp.StatusCode, snippet)
}
return string(body), nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` around lines 46 - 55, The current
client.Do(req) response handling treats non-200 HTTP responses as normal
payloads; after performing client.Do(req) and before returning the body in the
code around resp, err := client.Do(req) / defer resp.Body.Close(), add a status
check: if resp.StatusCode is not 200 (or not in the expected 2xx range) read a
short snippet of resp.Body (e.g., first ~200 bytes), and return a formatted
error that includes resp.Status and the trimmed snippet to fail fast on
auth/outage errors; keep the existing error wraps for io.ReadAll and ensure
resp.Body is still closed (defer) after the nil err check.

Comment thread ote/test/extended/logging_monitoring.go Outdated
Comment on lines +60 to +68
var result struct {
Data struct {
Result []interface{} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal([]byte(response), &result); err != nil {
return false, fmt.Errorf("failed to parse thanos response: %w", err)
}
return len(result.Data.Result) > 0, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle Thanos API error payloads explicitly

thanosHasResult only checks data.result. If Thanos returns {"status":"error",...}, this is reported as missing metric instead of query failure. Parse and validate top-level status/error fields.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` around lines 60 - 68,
thanosHasResult currently only unmarshals into Data.Result and treats any
non-result as "no metric"; update it to also unmarshal/inspect top-level fields
(e.g., status and error) from the Thanos response and return an error when
status == "error" or an error field is present. Specifically, extend the result
struct (or unmarshal into a wrapper) to include Status string and Error (or
ErrorType/ ErrorObject) fields, check those after json.Unmarshal of response in
thanosHasResult, and when an error-status is detected return a formatted error
containing the Thanos error details instead of returning false.

Comment thread ote/test/extended/logging_monitoring.go Outdated
Comment on lines +153 to +154
destDir := "/tmp/must-gather-33783"
defer os.RemoveAll(destDir)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a unique must-gather destination directory

On Line 153, fixed path /tmp/must-gather-33783 can collide across concurrent runs and stale leftovers. Use os.MkdirTemp and defer cleanup of the created dir.

Suggested fix
-	destDir := "/tmp/must-gather-33783"
-	defer os.RemoveAll(destDir)
+	destDir, err := os.MkdirTemp("", "must-gather-33783-*")
+	if err != nil {
+		return fmt.Errorf("failed to create temp must-gather directory: %w", err)
+	}
+	defer os.RemoveAll(destDir)
 
-	msg, err := oc.AsAdmin().Output("adm", "must-gather", "--dest-dir="+destDir)
+	msg, err := oc.AsAdmin().Output("adm", "must-gather", "--dest-dir="+destDir)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
destDir := "/tmp/must-gather-33783"
defer os.RemoveAll(destDir)
destDir, err := os.MkdirTemp("", "must-gather-33783-*")
if err != nil {
return fmt.Errorf("failed to create temp must-gather directory: %w", err)
}
defer os.RemoveAll(destDir)
msg, err := oc.AsAdmin().Output("adm", "must-gather", "--dest-dir="+destDir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` around lines 153 - 154, The test
currently uses a fixed destDir (destDir := "/tmp/must-gather-33783") which can
collide across runs; replace this by creating a unique temp directory via
os.MkdirTemp (e.g., call os.MkdirTemp("", "must-gather-") and assign to destDir)
and keep the deferred cleanup via defer os.RemoveAll(destDir) to remove the
created directory; update references that use destDir accordingly and handle the
MkdirTemp error (fail the test or t.Fatal) before deferring RemoveAll.

Comment on lines +195 to +203
storage, err := oc.AsAdmin().Output("get", "node", host,
"-o=jsonpath={.status.allocatable['ephemeral-storage']}")
if err != nil {
return fmt.Errorf("failed to get ephemeral storage for node %s: %w", host, err)
}
val, err := strconv.ParseInt(strings.TrimSuffix(storage, "Ki"), 10, 64)
if err != nil {
return fmt.Errorf("failed to parse storage value %q for node %s: %w", storage, host, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
fd -t f "logging_monitoring.go" | head -5

Repository: openshift/windows-machine-config-operator

Length of output: 124


🏁 Script executed:

# Check the file size before reading
wc -l ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 128


🏁 Script executed:

# Read the specific lines in question
sed -n '190,210p' ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 735


🏁 Script executed:

# Check current imports in the file
head -30 ote/test/extended/logging_monitoring.go

Repository: openshift/windows-machine-config-operator

Length of output: 918


🏁 Script executed:

# Search for other uses of resource.ParseQuantity in the codebase to see if this pattern is used elsewhere
rg "resource\.ParseQuantity" --type go

Repository: openshift/windows-machine-config-operator

Length of output: 332


🏁 Script executed:

# Check if k8s.io/apimachinery is already a dependency in the project
fd -t f "go.mod" | xargs head -50

Repository: openshift/windows-machine-config-operator

Length of output: 2834


🏁 Script executed:

# Check what dependencies ote tests typically need
grep -r "import" ote/test/extended/*.go | head -20

Repository: openshift/windows-machine-config-operator

Length of output: 171


🏁 Script executed:

# Look for any other quantity/resource parsing in ote module
rg -t go "ParseQuantity|ParseInt|resource\." ote/

Repository: openshift/windows-machine-config-operator

Length of output: 198


🏁 Script executed:

# Check if ote module would be able to add k8s.io/apimachinery as a dependency by checking its usage elsewhere
grep -r "k8s.io" ote/ --type go

Repository: openshift/windows-machine-config-operator

Length of output: 202


🏁 Script executed:

# Verify the exact Kubernetes quantity format that gets returned from the jsonpath query
# by looking at how ephemeral-storage is documented in similar code
rg -t go "ephemeral-storage" -B 3 -A 5

Repository: openshift/windows-machine-config-operator

Length of output: 8946


Use resource.ParseQuantity to handle all Kubernetes quantity formats, not just Ki

The current code breaks for quantities returned in formats other than Ki (e.g., Gi, G, Mi, M). The jsonpath query for ephemeral-storage returns a Kubernetes quantity, which can be expressed with binary (Ki, Mi, Gi, Ti, Pi, Ei) or decimal (K, M, G, T, P, E) suffixes. Replace the fragile strings.TrimSuffix(storage, "Ki") approach with resource.ParseQuantity().

This requires adding k8s.io/apimachinery to ote/go.mod (the root module already has it at v0.35.2, and the pattern is already used in pkg/services/services.go). The fix also updates the sign check from numeric to use Quantity.Sign():

Suggested fix
 import (
@@
 	"strconv"
 	"strings"
 
 	"github.com/openshift/windows-machine-config-operator/ote/test/extended/cli"
+	"k8s.io/apimachinery/pkg/api/resource"
 )
@@
-		val, err := strconv.ParseInt(strings.TrimSuffix(storage, "Ki"), 10, 64)
+		q, err := resource.ParseQuantity(storage)
 		if err != nil {
 			return fmt.Errorf("failed to parse storage value %q for node %s: %w", storage, host, err)
 		}
-		if val <= 0 {
-			return fmt.Errorf("expected positive storage value for node %s, got %d", host, val)
+		if q.Sign() <= 0 {
+			return fmt.Errorf("expected positive storage value for node %s, got %q", host, storage)
 		}

Comment thread ote/test/extended/logging_monitoring.go Outdated

for _, pod := range pods {
for metricName, query := range metrics {
resp, err := queryThanos(oc, query+`{pod="`+pod+`"}`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope PromQL to WMCO namespace to avoid false positives

Both queries filter only by pod. A same-named pod in another namespace can satisfy the check. Add namespace="<wmcoNamespace>" to the selector.

Suggested fix
-			resp, err := queryThanos(oc, query+`{pod="`+pod+`"}`)
+			resp, err := queryThanos(oc, query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`)
@@
-			resp, err := queryThanos(oc, query+`{pod="`+pod+`"}`)
+			resp, err := queryThanos(oc, query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`)

Also applies to: 272-272

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ote/test/extended/logging_monitoring.go` at line 234, The PromQL selector
passed to queryThanos is only filtering by pod (query+`{pod="`+pod+`"}`), which
can match same-named pods in other namespaces; update the selector to include
namespace by changing the call(s) to include namespace=" + wmcoNamespace (e.g.
query+`{pod="`+pod+`",namespace="`+wmcoNamespace+`"}`) so the query is scoped to
the WMCO namespace; update both occurrences where queryThanos is invoked with
the pod-only selector (including the call around the symbol queryThanos at the
earlier occurrence and the similar call at the later occurrence).

@weinliu weinliu force-pushed the WINC-1778-ote-batch3-logging-monitoring branch from 4956075 to a7e9d6b Compare April 2, 2026 23:47
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 2, 2026

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (3 tests):

  • OCP-33779: Retrieving Windows node logs (oc adm node-logs)
  • OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
  • OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

Planned migration batches (WINC-1778):

Batch Category Tests
1 Node Basics 32554, 32615, 33612
2 Networking 28632, 32273, 31276, 39451
3 (this PR) Logging & Monitoring 33779, 33783, 73595
4 WICD & Services 54711, 50403, 56354, 76765, 33794
5 Proxy 65980, 66670, 68320
6 Workloads & Misc 25593, 42204, 48873, 38186, 38188, 60814, 74760

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration
OCP-33779 ✅ passed 2.6s
OCP-33783 ✅ passed 366s (must-gather)
OCP-73595 ✅ passed 0.4s
Full test output

OCP-33779

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]

OCP-33783

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]

OCP-73595

[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]

Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@weinliu
Copy link
Copy Markdown
Contributor Author

weinliu commented Apr 3, 2026

/retest security

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 3, 2026

@weinliu: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test aws-e2e-operator
/test azure-e2e-operator
/test azure-e2e-upgrade
/test ci-bundle-wmco-bundle
/test gcp-e2e-operator
/test images
/test lint
/test nutanix-e2e-operator
/test platform-none-vsphere-e2e-operator
/test security
/test unit
/test vsphere-disconnected-e2e-operator
/test vsphere-e2e-operator
/test vsphere-proxy-e2e-operator
/test wicd-unit-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-windows-machine-config-operator-master-ci-bundle-wmco-bundle
pull-ci-openshift-windows-machine-config-operator-master-images
pull-ci-openshift-windows-machine-config-operator-master-lint
pull-ci-openshift-windows-machine-config-operator-master-security
pull-ci-openshift-windows-machine-config-operator-master-unit
Details

In response to this:

/retest security

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 3, 2026

@weinliu: This pull request references WINC-1778 which is a valid jira issue.

Details

In response to this:

Summary

Migrates the Logging & Monitoring batch of WINC tests as part of WINC-1778.

Batch 3 — Logging & Monitoring (3 tests):

  • OCP-33779: Retrieving Windows node logs (oc adm node-logs)
  • OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
  • OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]

Planned migration batches (WINC-1778):

Batch Category Tests Status
0 wmco golang version 37362 ✅ merged
1 (PR #3920) Node Basics 32554, 32615, 33612 🔄 in review
2 (PR #3922) Networking 28632, 32273, 31276 🔄 in review
3 (PR #3921) Logging & Monitoring 33779, 33783, 73595 🔄 in review
4 WICD, Workloads & Misc (no SSH) 50403, 60814, 25593, 42204, 38186 📋 planned
5 SSH-dependent tests 54711, 56354, 76765, 48873, 65980, 66670, 68320 ⏳ pending SSH decision
Special Needs special cluster 74760, 38188 ⏳ pending

Test Results (OCP 4.21 nightly, IPI on AWS)

Test Result Duration
OCP-33779 ✅ passed 2.6s
OCP-33783 ✅ passed 366s (must-gather)
OCP-73595 ✅ passed 0.4s
Full test output

OCP-33779

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33779-[wmco] Retrieving Windows node logs [OTP]", "duration": 2631, "startTime": "2026-04-02 22:57:23.184498 UTC", "endTime": "2026-04-02 22:57:25.816161 UTC", "result": "passed", "output": ""}]

OCP-33783

[{"name": "[sig-windows] Windows_Containers Author:sgao-Smokerun-Critical-33783-[wmco] Enable must gather on Windows node [Slow][Disruptive] [OTP]", "duration": 366717, "startTime": "2026-04-02 22:59:45.240881 UTC", "endTime": "2026-04-02 23:05:51.958555 UTC", "result": "passed", "output": ""}]

OCP-73595

[{"name": "[sig-windows] Windows_Containers Author:weinliu-Smokerun-Medium-73595-[wmco] Verify Display of Filesystem Graphs (metrics) for Windows Nodes [Serial] [OTP]", "duration": 424, "startTime": "2026-04-02 22:57:31.315903 UTC", "endTime": "2026-04-02 22:57:31.740254 UTC", "result": "passed", "output": ""}]

Cluster: winc-regression 6211 (OCP 4.21.0-0.nightly-2026-04-02, IPI on AWS)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Migrate the following logging & monitoring smokerun tests from
openshift-tests-private to the OTE extension binary:

- OCP-33779: Retrieving Windows node logs (oc adm node-logs)
- OCP-33783: Enable must gather on Windows node [Slow][Disruptive]
- OCP-73595: Verify Display of Filesystem Graphs for Windows Nodes [Serial]
- OCP-70922: Monitor CPU, Memory, and Filesystem graphs for Windows Pods
- OCP-73752: Monitor Network In, and Network Out graphs for Windows Pods

OCP-70922 and OCP-73752 query Thanos directly via the thanos-querier route
using a prometheus-k8s service account token, avoiding the compat_otp
PrometheusMonitor dependency.
@weinliu weinliu force-pushed the WINC-1778-ote-batch3-logging-monitoring branch from a7e9d6b to 23749c5 Compare April 6, 2026 01:22
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

@weinliu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown

Closing, please open a separate request if this work is still required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants