-
Notifications
You must be signed in to change notification settings - Fork 4
Modernize, minimize, and cleanup containers #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
eflumerf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look reasonable to me, however I do not have any kind of testing environment so I can't speak too much about functionality.
72b65f9 to
c8d8f52
Compare
284efa9 to
1258c6a
Compare
You can always manualy inspect the environment
Better sanity checking for generate_postgresql_partitions.sh
962d078 to
016b2d5
Compare
* Fix Python 3.9 compatibility by replacing union syntax with Optional Co-authored-by: jcpunk <3534830+jcpunk@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jcpunk <3534830+jcpunk@users.noreply.github.com>
|
Autogenerated by CodeRabbit: 📊 Executive SummaryThis PR represents a major modernization initiative touching every service in the repository. The changes fall into five key categories: 1. 🐋 Docker Infrastructure Overhaul
2. 🔐 Security Hardening (Kubernetes)
3. 🔧 Configuration Standardization
4. 💻 Code Modernization
5. 📉 Service Changes & Removals
🗺️ Human Reviewer RoadmapPhase 1: Infrastructure & Build System (Est. 45 min)Goal: Understand the new build and deployment model
Phase 2: Breaking Changes & Migration (Est. 60 min)Goal: Identify impact on your systems
Phase 3: Service-by-Service Review (Est. 90 min)Goal: Validate code quality and logic changes
Phase 4: Tooling & Edge Cases (Est. 20 min)Goal: Understand supporting tools
Phase 5: Testing & Validation Plan (Est. 30 min)Goal: Create deployment validation checklist
🚦 Risk Assessment
📝 Pre-Merge Checklist
⏱️ Estimated Review Time
I'll now trigger a complete re-review of all commits to provide fresh insights. ✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR modernizes the microservices container infrastructure with significant improvements to security, maintainability, and best practices. The changes focus on containerization improvements, database access pattern updates, and shell script hardening.
Key Changes:
- Migrated from AlmaLinux 9 to AlmaLinux 10 (requires verification) with multi-stage Docker builds separating dependencies from application code
- Enhanced Kubernetes security posture by enforcing restricted pod security standards, adding read-only root filesystems, and implementing proper user/group configurations
- Modernized database access by migrating from legacy query patterns to SQLAlchemy 2.0's select() API and consolidating database URIs
Reviewed changes
Copilot reviewed 42 out of 43 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| dockerfiles/Dockerfile.dependencies | New multi-stage build for dependency layer using AlmaLinux 10 base |
| dockerfiles/Dockerfile | Application layer that builds on dependencies image |
| dockerfiles/requirements.txt | Streamlined Python dependencies with updated versions |
| .github/workflows/build_docker_layers.yaml | New CI/CD workflow with conditional dependency rebuilds |
| entrypoint_functions.sh | Enhanced variable validation with sensitive data redaction |
| runregistry-rest/, runnumber-rest/, elisa-logbook/* | Database access modernization and security improvements |
| ers-protobuf-dbwriter/dbwriter.py | Complete rewrite using SQLAlchemy with retry logic |
| opmon-protobuf-dbwriter/dbwriter.py | Consolidated database configuration using URI pattern |
| */deployment.yaml | Enhanced security contexts and resource management |
| tools/sample_postgresql_partitions_for_table.sh | Added input validation and transaction wrapping |
| tools/postgresql_partitions_for_all_tables.sh | New batch processing script for partition management |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if table_recreated: | ||
| # Already tried recreating, this is a persistent schema issue | ||
| logging.error("Table already recreated, schema issue persists") | ||
| if attempt >= MAX_RETRIES: | ||
| logging.error("Failed to deliver issue after all retry attempts") | ||
| logging.error(pb_json.MessageToJson(chain)) | ||
| raise | ||
| # Wait a bit before retrying | ||
| time.sleep(0.1 * attempt) | ||
| continue |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if table_recreated is True, but table_recreated is only set within the except ProgrammingError block. If a ProgrammingError occurs again after table recreation, the code will correctly detect it. However, the logic could be clearer by moving the table_recreated check after attempting recreation, not before, to avoid confusion about when it's actually being used to prevent infinite recreation loops.
| needs: [build-dependencies] | ||
| if: | | ||
| always() && | ||
| (needs.build-dependencies.result == 'success' || needs.build-dependencies.result == 'skipped') |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job build-microservices has needs: [build-dependencies] which means it will always wait for the build-dependencies job, even if build-dependencies is skipped. The if condition handles this with needs.build-dependencies.result == 'skipped', but this creates unnecessary job dependencies. Consider restructuring to make the dependency conditional only when build-dependencies actually runs.
| (github.event_name == 'push' && github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && ( | ||
| contains(github.event.head_commit.modified, 'dockerfiles/requirements.txt') || | ||
| contains(github.event.head_commit.added, 'dockerfiles/requirements.txt') || | ||
| contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile.dependencies') || | ||
| contains(github.event.head_commit.added, 'dockerfiles/Dockerfile.dependencies') || | ||
| contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile') || | ||
| contains(github.event.head_commit.added, 'dockerfiles/Dockerfile') | ||
| )) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow condition on lines 36-42 checks for modified/added files in the head_commit, but this may not work correctly for all push events. The github.event.head_commit object is not always populated (e.g., for force pushes). Consider using a paths filter at the workflow level or checking out the code and using git diff to determine if relevant files changed.
| (github.event_name == 'push' && github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && ( | |
| contains(github.event.head_commit.modified, 'dockerfiles/requirements.txt') || | |
| contains(github.event.head_commit.added, 'dockerfiles/requirements.txt') || | |
| contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile.dependencies') || | |
| contains(github.event.head_commit.added, 'dockerfiles/Dockerfile.dependencies') || | |
| contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile') || | |
| contains(github.event.head_commit.added, 'dockerfiles/Dockerfile') | |
| )) | |
| (github.event_name == 'push' && github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && | |
| github.event.head_commit != null && | |
| ( | |
| contains(github.event.head_commit.modified, 'dockerfiles/requirements.txt') || | |
| contains(github.event.head_commit.added, 'dockerfiles/requirements.txt') || | |
| contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile.dependencies') || | |
| contains(github.event.head_commit.added, 'dockerfiles/Dockerfile.dependencies') || | |
| contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile') || | |
| contains(github.event.head_commit.added, 'dockerfiles/Dockerfile') | |
| )) |
| ensure_required_variables "DATABASE_URI" | ||
|
|
||
| exec gunicorn -b 0.0.0.0:5000 --workers=1 --worker-class=gevent --timeout 5000000000 --log-level=debug rest:app | ||
| exec gunicorn -b 0.0.0.0:5000 --workers=1 --worker-class=gevent --timeout 9000 --log-level=debug rest:app |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value of 9000 seconds (2.5 hours) is extremely high for a web service. This could lead to resource exhaustion as connections hold workers for extended periods. Consider using a more reasonable timeout value (e.g., 300-600 seconds) or implementing proper request timeout handling at the application level instead of relying on such a high gunicorn timeout.
| def utc_now(): | ||
| # Returns naive UTC datetime matching your existing schema (no timezone) | ||
| return datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecated datetime.datetime.utcnow() has been replaced with timezone-aware datetime that's then made naive by removing tzinfo. While this works, consider keeping the timezone information throughout the application for better timestamp handling, especially in a distributed system. If naive timestamps are required for database compatibility, document why tzinfo is being removed.
| def start_new_thread( | ||
| self, subject: str, body: str, command: str, author: str, systems: list[str] |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint list[str] uses Python 3.9+ syntax. Ensure the base Docker image and all deployment environments support Python 3.9 or later, otherwise use typing.List[str] for backward compatibility.
| def get_login(self, service: str, user: Optional[str] = None): | ||
| for auth in self.authentications: | ||
| if service == auth.service: | ||
| return auth | ||
| self.log.error(f"Couldn't find login for service: {service}") | ||
| if user is None or user == auth.username: | ||
| return auth |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function get_login has been modified to accept an optional user parameter but the old logic checking username against auth.user should check auth.username (as seen on line 201) for consistency. Verify this change doesn't break existing callers.
| echo "dep_tag=$(git rev-parse --short HEAD)" >> "${GITHUB_OUTPUT}" | ||
| else | ||
| echo "dep_tag=latest" >> "${GITHUB_OUTPUT}" |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for determining dep_tag uses the result of the build-dependencies job. When skipped, it falls back to 'latest', but this might not be the correct tag if build-dependencies was explicitly skipped due to no dependency changes. The 'latest' tag might not exist yet or might be stale. Consider pulling the actual latest commit SHA of the dependencies image from the container registry or using a more robust fallback strategy.
| echo "dep_tag=$(git rev-parse --short HEAD)" >> "${GITHUB_OUTPUT}" | |
| else | |
| echo "dep_tag=latest" >> "${GITHUB_OUTPUT}" | |
| # Dependencies image was rebuilt for this commit, so use current short SHA | |
| echo "dep_tag=$(git rev-parse --short HEAD)" >> "${GITHUB_OUTPUT}" | |
| else | |
| # Dependencies were not rebuilt; find the most recent dependencies image tag from GHCR | |
| GH_ORG="DUNE-DAQ" | |
| PACKAGE_NAME="microservices_dependencies" | |
| API_URL="https://api.github.com/orgs/${GH_ORG}/packages/container/${PACKAGE_NAME}/versions?per_page=1" | |
| # Fetch latest container package version metadata from GitHub Container Registry | |
| latest_tag=$( | |
| curl -fsSL \ | |
| -H "Authorization: Bearer ${{ github.token }}" \ | |
| -H "Accept: application/vnd.github+json" \ | |
| -H "X-GitHub-Api-Version: 2022-11-28" \ | |
| "${API_URL}" | \ | |
| jq -r '.[0].metadata.container.tags[] | select(test("^[0-9a-f]{7}$"))' | \ | |
| head -n 1 | |
| ) | |
| if [[ -n "${latest_tag}" && "${latest_tag}" != "null" ]]; then | |
| echo "dep_tag=${latest_tag}" >> "${GITHUB_OUTPUT}" | |
| else | |
| # Fallback: use 'latest' if no SHA-like tag was found | |
| echo "dep_tag=latest" >> "${GITHUB_OUTPUT}" | |
| fi |
|
|
||
| python3 ./logbook.py | ||
|
|
||
| exec gunicorn -b 0.0.0.0:5005 --workers=1 --worker-class=gevent --timeout 9000 --log-level=debug logbook:app |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value of 9000 seconds (2.5 hours) is extremely high for a web service. This could lead to resource exhaustion as connections hold workers for extended periods. Consider using a more reasonable timeout value (e.g., 300-600 seconds) or implementing proper request timeout handling at the application level instead of relying on such a high gunicorn timeout.
| exec gunicorn -b 0.0.0.0:5005 --workers=1 --worker-class=gevent --timeout 9000 --log-level=debug logbook:app | |
| GUNICORN_TIMEOUT="${GUNICORN_TIMEOUT:-600}" | |
| exec gunicorn -b 0.0.0.0:5005 --workers=1 --worker-class=gevent --timeout "${GUNICORN_TIMEOUT}" --log-level=debug logbook:app |
There is a lot going on here.
I recommend reviewing the commits one at a time rather than the full diff as a blob.
In theory each commit is a rational step with clear self contained logic.
Copilot and coderabbit identified bits of code that probably never worked. These are probably bits that could be dropped, but that would require an expert to review.
The changes to the application specific code should probably be reviewed.
With NP04 offline right now I couldn't check against the current runtime for some things.
To be clear, I really only care about the containers...
I've tried to set the docker build step for the
microservicescontainer to run after themicroservices_dependenciescontainer. Those workflows wont really work until this PR is merged.