Skip to content

Fix logs forwarding: input validation, missing return bug, and isSafeURL localhost regression#515

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/improve-logs-forwarding
Draft

Fix logs forwarding: input validation, missing return bug, and isSafeURL localhost regression#515
Copilot wants to merge 4 commits intomainfrom
copilot/improve-logs-forwarding

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 2, 2026

The logs forwarding handler had a silent bug that forwarded requests to the sidecar even after writing a conflict error, missing input validation that left path traversal open, and an isSafeURL check that blocked localhost—breaking all httptest-based tests and any deployment where the sidecar runs locally.

Fixes

logs.go

  • Bug: SinceSeconds+SinceTime conflict wrote an error response but fell through to forward the request anyway. Both conflict checks now return immediately.
  • Security: Added validateLogRequest() with regex validation for Namespace (RFC 1123 label), PodUID (UUID), and ContainerName (RFC 1123 label). The sidecar constructs file paths from these values—unvalidated input allows path traversal.
  • Status codes: Validation and parameter conflict errors now return 400 Bad Request instead of 500.
  • Context propagation: http.NewRequesthttp.NewRequestWithContext(r.Context(), ...) so client disconnects cancel the in-flight sidecar request.
  • Removed unused statusCode variable and dead commented-out transport code.

handler.go / execute.go

  • isSafeURL over-blocked localhost: Removed the localhost/127.0.0.1/::1/.internal restriction. These URLs come from operator config files, not user input; the sidecar and interLink API routinely run on the same host or private network. The scheme check (http/https only) is sufficient.
  • Removed duplicate isSafeURL call in ReqWithError.

func.go

  • Updated stale log messages: "Docker Sidecar""Sidecar".

New tests (logs_test.go)

  • TestValidateLogRequest: valid inputs, path traversal attempts (../../etc/passwd), bad formats
  • TestGetLogsHandler_Validation: end-to-end handler behavior for valid requests, invalid fields, conflicting options
  • TestIsSafeURL: confirms file:///ftp:// are blocked, http://localhost is allowed
Original prompt

Overview

The SLURM plugin's logs forwarding implementation currently has several areas that need improvement to ensure reliability, performance, security, and maintainability. This issue tracks a comprehensive list of enhancements across 13 categories.

Critical Issues 🔴

1. Memory Management & Streaming

Problem: Current implementation loads entire log files into memory using os.ReadFile(), causing memory spikes with large logs (>100MB).

Impact: OOM errors, performance degradation with verbose applications

  • Replace os.ReadFile() with streaming approach using buffered readers
  • Implement max log size limits to prevent OOM
  • Optimize Tail/LimitBytes operations using reverse file seek instead of loading entire file
  • Add configurable buffer sizes (currently hardcoded to 4KB)

2. Follow Mode - Replace Polling with Event-Driven Approach

Problem: Current follow mode uses 4-second polling (lines 63, 134 in GetLogs.go), causing inefficiency and delays.

Impact: High latency in log streaming, unnecessary filesystem stress

  • Implement file change notifications using fsnotify instead of polling
  • Reduce latency from 4 seconds to near-real-time
  • Add timeout fallback for fsnotify failures
  • Make poll interval configurable

3. Connection Lifecycle Management

Problem: No proper handling of client disconnect (line 92 TODO: "handle the Ctrl+C of kubectl logs").

Impact: Orphaned goroutines, resource leaks

  • Detect HTTP client disconnects using context cancellation
  • Clean up resources when client closes connection
  • Implement graceful shutdown of follow loops
  • Add metrics for abandoned connections

4. Container Death Detection - Race Conditions

Problem: Current container death detection uses 3 different signals (JID disappearance, status file, EOF), making logic fragile.

Impact: Lost logs, premature exit or infinite loops

  • Consolidate into explicit state machine: RUNNING → TERMINATING → TERMINATED
  • Implement fallback using SLURM job status (sacct) directly
  • Track job exit codes and store in status file
  • Improve status file creation reliability

5. Security - Input Validation & Path Traversal

Problem: Container and pod names are concatenated directly without validation (line 195: path + "/run-" + req.ContainerName + ".out").

Impact: Potential path traversal attacks, arbitrary file access

  • Use filepath.Join() instead of string concatenation
  • Validate container names, namespaces, and pod UIDs
  • Implement permission checks for log access
  • Add request size limits to prevent DoS

6. Concurrency & File Access

Problem: No file locking or synchronization when multiple requests access same pod logs.

Impact: Partial reads, race conditions, data corruption

  • Implement file-level locking or atomic operations
  • Handle concurrent writes during pod deletion (Delete.go line 32)
  • Synchronize log file access between readers/writers

High Priority Issues 🟡

7. Timestamp Support Implementation

Problem: Timestamps are only partially supported (line 199 TODO).

Impact: Cannot filter logs by time, feature incomplete

  • Implement full timestamp parsing in RFC3339 format
  • Add timestamps to container/job output at creation time
  • Properly integrate timestamp filtering with Since/SinceTime options
  • Handle timezone information correctly

8. Error Handling Edge Cases

Problem: Silent failures when logs not found (lines 210-211), missing bounds checking in slicing operations.

Impact: Difficult debugging, silent data loss

  • Add proper bounds checking for line/byte slicing (lines 233, 244)
  • Return meaningful errors instead of empty responses
  • Fix potential off-by-one errors in SinceSeconds logic (line 266)
  • Add logging for all error conditions

9. Log Concatenation Issues

Problem: Appending job.out + container.out (lines 220-221) may produce malformed logs.

Impact: Confused log output, missing context

  • Add separator or metadata between log streams
  • Handle overlapping content between files
  • Implement proper ordering by timestamp when available
  • Support selective log retrieval (job only, container only, or combined)

10. Multi-Container Pod Support

Problem: Limited support for pods with multiple containers or init containers.

Impact: Hard to debug multi-container applications

  • Add container name prefixes to log lines
  • Implement proper log ordering across containers
  • Support init container logs separately from regular containers
  • Clearly separate and identify logs by container

Medium Priority Issues 🟠

11. Configuration & Flexibility

Problem: Polling interval, log rotation, and retention are hardcoded or missing.

Impact: Suboptimal for different use cases

  • Add LogFollowPollInterval config option
  • Implement l...

This pull request was created from Copilot chat.

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for interlink-dev canceled.

Name Link
🔨 Latest commit 6bee16d
🔍 Latest deploy log https://app.netlify.com/projects/interlink-dev/deploys/69da6432d8895b00083a7e14

Copilot AI changed the title [WIP] Improve SLURM plugin logs forwarding implementation Fix logs forwarding: input validation, missing return bug, and isSafeURL localhost regression Apr 2, 2026
Copilot AI requested a review from dciangot April 2, 2026 06:05
Copy link
Copy Markdown
Member

@dciangot dciangot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Diego Ciangottini <dciangot@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants