-
Notifications
You must be signed in to change notification settings - Fork 19
Fix loop #1122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix loop #1122
Conversation
| if (queueMaxWaitTime === 0) { | ||
| if (algorithm.meta.container && algorithm.meta.container.dockerfile) { | ||
| this.buildImage(job, additionalDockerFiles) | ||
| const buildResult = await this.buildImage(job, additionalDockerFiles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. if building the image takes 30 mins, this means that the entire compute queue goes unprocessed. As result:
- compute jobs running more than allowed
- new compute jobs are not started
alexcos20
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI automated code review (Gemini 3).
Overall risk: low
Summary:
This pull request refactors the C2D (Compute-to-Data) Docker engine, primarily focusing on improving the stability of the internal job processing loop, enhancing error handling during Docker image operations, and removing a previous image manifest validation step. The cronTimer for the internal loop has been switched from setInterval to setTimeout to ensure sequential, non-overlapping job processing. Docker image pullImage and buildImage methods now return explicit success/error statuses and are awaited, leading to more robust error propagation. The proactive image manifest validation has been removed, and the system now relies on reactive error handling during image pull/build. Error responses for image-related failures are now more specific (400 Bad Request instead of 500 Internal Server Error). Tests have been updated to reflect these changes.
Comments:
• [INFO][other] The removal of parseImage, getDockerManifest, checkDockerImage, and checkManifestPlatform signifies a shift from proactive image validation (checking manifests before pull/build) to reactive error handling during the image operations themselves. Could you please clarify the reasoning behind this removal? Was the manifest validation deemed unreliable, too slow, or simply no longer necessary given the improved error handling during actual image operations?
• [INFO][performance] Changing setNewTimer to be synchronous and adding if (this.cronTimer) { return; } is a good defensive measure to prevent multiple timers from being accidentally created, which could lead to race conditions or resource issues. This, along with the change from setInterval to setTimeout in the InternalLoop, seems to address potential issues with the loop stability.
• [INFO][bug] The change from setInterval to setTimeout for the cronTimer is a critical fix. This pattern ensures that InternalLoop completes its execution before another iteration is scheduled, preventing concurrent executions, potential infinite loops, or resource exhaustion if job processing takes longer than the configured cronTime. This directly addresses the 'fix-loop' branch name.
• [INFO][bug] Adding await for buildImage and pullImage in processJob, along with the subsequent return, is a good improvement. This ensures that the job's status is correctly updated in the database after the image operation completes (or fails) and that the job is re-evaluated in the next InternalLoop iteration, rather than proceeding with potentially stale information. This fixes a potential race condition.
• [INFO][style] Returning an object { success: boolean; error?: string } from pullImage and buildImage is a cleaner way to handle outcomes and allows for more explicit error propagation to callers. This is a good refactoring that improves the clarity and robustness of these operations.
• [INFO][other] Changing the HTTP status code to 400 Bad Request for Unable to pull/build docker image errors is a good improvement for client feedback. It clearly indicates that the request itself had an issue (e.g., bad image reference) rather than a server-side internal error.
Fixes #1121
Changes proposed in this PR: