Fix loop #1122

giurgiur99 · 2025-12-15T15:39:56Z

Changes proposed in this PR:

Await for jobs to be updated
Use set timeout
Validate docker image on pull. It will throw error if bad maifest/arch

alexcos20 · 2025-12-16T14:30:45Z

src/components/c2d/compute_engine_docker.ts

    if (queueMaxWaitTime === 0) {
      if (algorithm.meta.container && algorithm.meta.container.dockerfile) {
-        this.buildImage(job, additionalDockerFiles)
+        const buildResult = await this.buildImage(job, additionalDockerFiles)


hm.. if building the image takes 30 mins, this means that the entire compute queue goes unprocessed. As result:

compute jobs running more than allowed

new compute jobs are not started

alexcos20

AI automated code review (Gemini 3).

Overall risk: low

Summary:
This pull request refactors the C2D (Compute-to-Data) Docker engine, primarily focusing on improving the stability of the internal job processing loop, enhancing error handling during Docker image operations, and removing a previous image manifest validation step. The cronTimer for the internal loop has been switched from setInterval to setTimeout to ensure sequential, non-overlapping job processing. Docker image pullImage and buildImage methods now return explicit success/error statuses and are awaited, leading to more robust error propagation. The proactive image manifest validation has been removed, and the system now relies on reactive error handling during image pull/build. Error responses for image-related failures are now more specific (400 Bad Request instead of 500 Internal Server Error). Tests have been updated to reflect these changes.

Comments:
• [INFO][other] The removal of parseImage, getDockerManifest, checkDockerImage, and checkManifestPlatform signifies a shift from proactive image validation (checking manifests before pull/build) to reactive error handling during the image operations themselves. Could you please clarify the reasoning behind this removal? Was the manifest validation deemed unreliable, too slow, or simply no longer necessary given the improved error handling during actual image operations?
• [INFO][performance] Changing setNewTimer to be synchronous and adding if (this.cronTimer) { return; } is a good defensive measure to prevent multiple timers from being accidentally created, which could lead to race conditions or resource issues. This, along with the change from setInterval to setTimeout in the InternalLoop, seems to address potential issues with the loop stability.
• [INFO][bug] The change from setInterval to setTimeout for the cronTimer is a critical fix. This pattern ensures that InternalLoop completes its execution before another iteration is scheduled, preventing concurrent executions, potential infinite loops, or resource exhaustion if job processing takes longer than the configured cronTime. This directly addresses the 'fix-loop' branch name.
• [INFO][bug] Adding await for buildImage and pullImage in processJob, along with the subsequent return, is a good improvement. This ensures that the job's status is correctly updated in the database after the image operation completes (or fails) and that the job is re-evaluated in the next InternalLoop iteration, rather than proceeding with potentially stale information. This fixes a potential race condition.
• [INFO][style] Returning an object { success: boolean; error?: string } from pullImage and buildImage is a cleaner way to handle outcomes and allows for more explicit error propagation to callers. This is a good refactoring that improves the clarity and robustness of these operations.
• [INFO][other] Changing the HTTP status code to 400 Bad Request for Unable to pull/build docker image errors is a good improvement for client feedback. It clearly indicates that the request itself had an issue (e.g., bad image reference) rather than a server-side internal error.

giurgiur99 added 4 commits December 15, 2025 17:25

try fix

0b205f1

comment image validation

1098c9c

use set timeout instead of interval

97f1cc1

check image during pull

555d94e

giurgiur99 marked this pull request as ready for review December 16, 2025 12:14

giurgiur99 requested review from alexcos20, andreip136, bogdanfazakas, denisiuriet and ndrpp as code owners December 16, 2025 12:14

alexcos20 reviewed Dec 16, 2025

View reviewed changes

alexcos20 added the Status: DoNotMerge label Jan 6, 2026

alexcos20 reviewed Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix loop #1122

Fix loop #1122

giurgiur99 commented Dec 15, 2025 •

edited

Loading

Uh oh!

alexcos20 Dec 16, 2025 •

edited

Loading

Uh oh!

alexcos20 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix loop #1122

Are you sure you want to change the base?

Fix loop #1122

Conversation

giurgiur99 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcos20 Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexcos20 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

giurgiur99 commented Dec 15, 2025 •

edited

Loading

alexcos20 Dec 16, 2025 •

edited

Loading