Double pg-dump (2Gi) & increase hg-repos (175Gi)#2222
Double pg-dump (2Gi) & increase hg-repos (175Gi)#2222
Conversation
📝 WalkthroughWalkthroughKubernetes manifests in production and restore-scripts directories receive storage allocation updates. The Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deployment/production/backup.yaml (1)
37-41:⚠️ Potential issue | 🟠 MajorAdd
ephemeral-storageresource requests/limits alongside the2GiemptyDirvolumes.Lines 101 and 65 (kopia-maint.yaml) establish a 2Gi cap on the
pg-dumpemptyDir volume, but withoutephemeral-storageresource requests/limits on the consuming containers, these pods can still be scheduled onto disk-constrained nodes and face eviction under disk pressure.The postgres-dump container (lines 37–41) and create-snapshots container (lines 65–69) in backup.yaml, as well as both containers in kopia-maint.yaml, need explicit ephemeral-storage constraints:
Suggested manifest update
resources: requests: memory: 150Mi + ephemeral-storage: 2300Mi limits: memory: 150Mi + ephemeral-storage: 2300Mi @@ resources: requests: memory: 1Gi + ephemeral-storage: 512Mi limits: memory: 1Gi + ephemeral-storage: 512MiAlso applies to kopia-maint.yaml containers (lines 7–15 and 32–40).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deployment/production/backup.yaml` around lines 37 - 41, The postgres-dump and create-snapshots containers in backup.yaml (and both containers defined in kopia-maint.yaml) lack ephemeral-storage resource requests/limits, so add ephemeral-storage requests: 2Gi and limits: 2Gi to each container's resources block (the same place where memory: 150Mi is set) so the containers declare the 2Gi emptyDir disk usage and will not be evicted under node disk pressure; update the resources section for the containers named postgres-dump, create-snapshots and the two containers in kopia-maint.yaml to include ephemeral-storage: 2Gi for both requests and limits.
🧹 Nitpick comments (1)
deployment/restore-scripts/kopia-maint.yaml (1)
7-40: Harden container securityContext selectively based on image capabilities.
postgres:15-alpinecan reliably run withrunAsNonRoot: true(UID/GID 70) and should be hardened. However,kopia/kopia:latest(current Ubuntu-based builds) does not supportrunAsNonRoot: truewithout custom configuration or entrypoints—it defaults to root with no explicit USER directive. Official documentation typically requires root or privileged access for FUSE-based features.Consider hardening postgres fully:
securityContext: allowPrivilegeEscalation: false runAsNonRoot: true runAsUser: 70 runAsGroup: 70 fsGroup: 70 capabilities: drop: ["ALL"]For kopia, either: (1) keep the container as-is if FUSE or privileged features are needed, or (2) if root is unnecessary, apply
allowPrivilegeEscalation: falseandcapabilities.drop: ["ALL"]withoutrunAsNonRoot: true, and verify compatibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deployment/restore-scripts/kopia-maint.yaml` around lines 7 - 40, Add per-container securityContext entries: for the pg container set allowPrivilegeEscalation: false, runAsNonRoot: true, runAsUser: 70, runAsGroup: 70, fsGroup: 70 and drop all capabilities to harden Postgres (reference the container named "pg"); for the kopia container either leave it unchanged if FUSE/privileged behavior is required or add allowPrivilegeEscalation: false and capabilities.drop: ["ALL"] but do NOT set runAsNonRoot for "kopia" unless you verify the image supports a non-root user; update the pod spec's containers section where "name: pg" and "name: kopia" are defined and ensure secrets and volume mounts remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deployment/production/hg-repos-volume.yaml`:
- Line 9: The PVC resize request currently sets storage: 175Gi but may silently
fail if the bound StorageClass (storageClassName: weekly-snapshots-retain-4)
does not have allowVolumeExpansion: true; update the manifest and workflow to
either (1) validate that the StorageClass used by this PVC has
allowVolumeExpansion: true before applying the patch, or (2) switch the PVC to a
StorageClass that supports expansion, or (3) implement a safe migration path
that creates a new PVC with 175Gi, copies data, and swaps claims; reference the
PVC definition that includes storage: 175Gi and the storageClassName
weekly-snapshots-retain-4 when making these changes.
---
Outside diff comments:
In `@deployment/production/backup.yaml`:
- Around line 37-41: The postgres-dump and create-snapshots containers in
backup.yaml (and both containers defined in kopia-maint.yaml) lack
ephemeral-storage resource requests/limits, so add ephemeral-storage requests:
2Gi and limits: 2Gi to each container's resources block (the same place where
memory: 150Mi is set) so the containers declare the 2Gi emptyDir disk usage and
will not be evicted under node disk pressure; update the resources section for
the containers named postgres-dump, create-snapshots and the two containers in
kopia-maint.yaml to include ephemeral-storage: 2Gi for both requests and limits.
---
Nitpick comments:
In `@deployment/restore-scripts/kopia-maint.yaml`:
- Around line 7-40: Add per-container securityContext entries: for the pg
container set allowPrivilegeEscalation: false, runAsNonRoot: true, runAsUser:
70, runAsGroup: 70, fsGroup: 70 and drop all capabilities to harden Postgres
(reference the container named "pg"); for the kopia container either leave it
unchanged if FUSE/privileged behavior is required or add
allowPrivilegeEscalation: false and capabilities.drop: ["ALL"] but do NOT set
runAsNonRoot for "kopia" unless you verify the image supports a non-root user;
update the pod spec's containers section where "name: pg" and "name: kopia" are
defined and ensure secrets and volume mounts remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9bb12a29-f2b3-4c28-980a-ad2bc4c86630
📒 Files selected for processing (3)
deployment/production/backup.yamldeployment/production/hg-repos-volume.yamldeployment/restore-scripts/kopia-maint.yaml
| resources: | ||
| requests: | ||
| storage: 150Gi | ||
| storage: 175Gi |
There was a problem hiding this comment.
Please don't bundle raising hg-repos and pg-dump volumes in one commit, these are volumes serve two quite different purposes, pg-dump is a temporary space to store postgress dumps during backup, while hg-repos is the linguistic data we wish to backup. Therefore it is important to get the backup issue resolved before doing anything with the hg-repos volume. Also does hg-repos even need increasing yet?
There was a problem hiding this comment.
@tim-eves thanks for the feedback!
As explained in the issue description, I do think increasing the size of hg-repos makes sense. It's time. I feel like I was fairly conservative (i.e. +25Gi), because a deep clean should happen at some time, but that will be a decent amount of work, because we've already deleted the obvious large junk.
And I have increased the volumes in 2 separate commits. I'd plan to merge them that way too. I suppose it could have been separate PRs. But...considering all that, do you approve of these changes?
Kopia backups are apparently failing with: (according to @g3mackay)
And failure emails mention:
Regarding hg-repos
About 6 months ago we deleted all the obvious large junk.
I'm sure there's still some cleanup that could happen, but it would be quite a bit more effort this time.
This simple size bump seems reasonable and I'm now on holiday for a week.
Regarding pg-dump
We're definitely putting way more stuff in the DB these days: the entire data/change history of every project that starts using FW-Lite.
While I'm sure there's some cleanup that could happen, it will definitely continue to grow and grow.
Regarding debugging the backup failures
I have not been able to view/find any logs on any of the failed jobs. That puzzles me and makes it guess work on my side.
It seems that every time the job is triggered it first fails and then succeeds. Which also puzzles me.