Skip to content

Conversation

@giurgiur99
Copy link
Contributor

@giurgiur99 giurgiur99 commented Dec 15, 2025

Fixes #1121

Changes proposed in this PR:

  • Await for jobs to be updated
  • Use set timeout
  • Validate docker image on pull. It will throw error if bad maifest/arch

@giurgiur99 giurgiur99 marked this pull request as ready for review December 16, 2025 12:14
if (queueMaxWaitTime === 0) {
if (algorithm.meta.container && algorithm.meta.container.dockerfile) {
this.buildImage(job, additionalDockerFiles)
const buildResult = await this.buildImage(job, additionalDockerFiles)
Copy link
Member

@alexcos20 alexcos20 Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.. if building the image takes 30 mins, this means that the entire compute queue goes unprocessed. As result:

  • compute jobs running more than allowed
  • new compute jobs are not started

Copy link
Member

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: low

Summary:
This pull request refactors the C2D (Compute-to-Data) Docker engine, primarily focusing on improving the stability of the internal job processing loop, enhancing error handling during Docker image operations, and removing a previous image manifest validation step. The cronTimer for the internal loop has been switched from setInterval to setTimeout to ensure sequential, non-overlapping job processing. Docker image pullImage and buildImage methods now return explicit success/error statuses and are awaited, leading to more robust error propagation. The proactive image manifest validation has been removed, and the system now relies on reactive error handling during image pull/build. Error responses for image-related failures are now more specific (400 Bad Request instead of 500 Internal Server Error). Tests have been updated to reflect these changes.

Comments:
• [INFO][other] The removal of parseImage, getDockerManifest, checkDockerImage, and checkManifestPlatform signifies a shift from proactive image validation (checking manifests before pull/build) to reactive error handling during the image operations themselves. Could you please clarify the reasoning behind this removal? Was the manifest validation deemed unreliable, too slow, or simply no longer necessary given the improved error handling during actual image operations?
• [INFO][performance] Changing setNewTimer to be synchronous and adding if (this.cronTimer) { return; } is a good defensive measure to prevent multiple timers from being accidentally created, which could lead to race conditions or resource issues. This, along with the change from setInterval to setTimeout in the InternalLoop, seems to address potential issues with the loop stability.
• [INFO][bug] The change from setInterval to setTimeout for the cronTimer is a critical fix. This pattern ensures that InternalLoop completes its execution before another iteration is scheduled, preventing concurrent executions, potential infinite loops, or resource exhaustion if job processing takes longer than the configured cronTime. This directly addresses the 'fix-loop' branch name.
• [INFO][bug] Adding await for buildImage and pullImage in processJob, along with the subsequent return, is a good improvement. This ensures that the job's status is correctly updated in the database after the image operation completes (or fails) and that the job is re-evaluated in the next InternalLoop iteration, rather than proceeding with potentially stale information. This fixes a potential race condition.
• [INFO][style] Returning an object { success: boolean; error?: string } from pullImage and buildImage is a cleaner way to handle outcomes and allows for more explicit error propagation to callers. This is a good refactoring that improves the clarity and robustness of these operations.
• [INFO][other] Changing the HTTP status code to 400 Bad Request for Unable to pull/build docker image errors is a good improvement for client feedback. It clearly indicates that the request itself had an issue (e.g., bad image reference) rather than a server-side internal error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Monitor loop is over-checking for running jobs

3 participants