MM-67613 Adding encryption key rotation with graceful fallback#664
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughPlugin snapshots the previous encryption key on config changes, spawns async re-encryption of KV-stored user tokens (migrating tokens or forcing per-user disconnects on failures), adds connection gating, hardens PKCS#7 unpadding, records re-encryption audit types, and adds comprehensive tests. Changes
Sequence DiagramsequenceDiagram
participant Admin as Admin/User
participant Config as Configuration
participant Plugin as Plugin
participant KV as KV Store
participant UserDB as User Data & Mappings
Admin->>Config: Update encryption key
Config->>Config: Detect key change, snapshot previous key
Config->>Plugin: Async: reEncryptUserData(newKey, previousKey)
Plugin->>KV: Acquire cluster mutex
Plugin->>KV: List keys matching `_usertoken` (paged)
loop for each token key
KV-->>Plugin: return token key
Plugin->>KV: Get encrypted token
KV-->>Plugin: encrypted blob
alt decryptable with new key
Plugin->>Plugin: skip (already migrated)
else decryptable with previous key
Plugin->>Plugin: decrypt with previous key
Plugin->>Plugin: re-encrypt with new key
Plugin->>KV: KVCompareAndSet store re-encrypted token
Plugin->>Plugin: increment migrated_count
else decryption fails
Plugin->>Plugin: forceDisconnectUser(userID)
Plugin->>UserDB: delete `_usertoken`, `_userinfo`, mappings
Plugin->>Plugin: increment force_disconnect_count
end
end
Plugin->>KV: Release cluster mutex
Plugin->>Plugin: Publish audit params/result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
| if err := p.setDefaultInstance(instanceName); err != nil { | ||
| return p.getCommandResponse(args, err.Error(), true), nil | ||
| } |
There was a problem hiding this comment.
Note: This setting of default instance behavior is going to be removed once #663 is merged and this PR syncs to the master branch
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/plugin.go (1)
1074-1077: Consider whether KV read errors should trigger disconnect.Currently, if
p.client.KV.Getfails with an error (not just nil data), the function returns(false, nil)- silently skipping the user without migration or disconnect. A transient KV error could leave a user's token encrypted with the old key, causing auth failures on subsequent requests.This fail-open approach may be intentional to avoid mass disconnections on transient KV issues. If so, consider adding a warning log for the error case to aid debugging:
🔧 Suggested improvement
var tokenBytes []byte -if err := p.client.KV.Get(kvKey, &tokenBytes); err != nil || tokenBytes == nil { +if err := p.client.KV.Get(kvKey, &tokenBytes); err != nil { + p.client.Log.Warn("Failed to read token during re-encryption, skipping user", + "user_id", userID, "error", err.Error()) + return false, nil +} +if tokenBytes == nil { return false, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/plugin.go` around lines 1074 - 1077, The KV read silently ignores errors; update the block that calls p.client.KV.Get(kvKey, &tokenBytes) so that if err != nil you emit a warning log with the error and kvKey (e.g. via the plugin's logger, such as p.logger or p.log) before returning (false, nil), rather than failing silently — keep the existing behavior (no disconnect) but add the log to surface transient KV failures that could block token migration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/plugin.go`:
- Around line 1074-1077: The KV read silently ignores errors; update the block
that calls p.client.KV.Get(kvKey, &tokenBytes) so that if err != nil you emit a
warning log with the error and kvKey (e.g. via the plugin's logger, such as
p.logger or p.log) before returning (false, nil), rather than failing silently —
keep the existing behavior (no disconnect) but add the log to surface transient
KV failures that could block token migration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ff9d4ba4-4a43-4a65-84e5-dfee8aae5852
📒 Files selected for processing (7)
server/audit.goserver/command.goserver/command_test.goserver/configuration.goserver/plugin.goserver/reencrypt_test.goserver/utils.go
# Conflicts: # server/command.go
ghost
left a comment
There was a problem hiding this comment.
Thanks @avasconcelos114 a few comments!
| if previousEncryptionKey != "" && configuration.EncryptionKey != "" && | ||
| previousEncryptionKey != configuration.EncryptionKey { | ||
| newKey := configuration.EncryptionKey | ||
| go p.reEncryptUserData(newKey, previousEncryptionKey) |
There was a problem hiding this comment.
reEncryptUserData runs in a goroutine. If OnConfigurationChange fires again before re-encryption completes a second goroutine could start with a different previousEncryptionKey that is actually the new key from the first rotation and data corruption. Consider adding a check inside reEncryptUserData to verify that previousEncryptionKey still matches what's in the KV store
There was a problem hiding this comment.
Ah to prevent such a scenario I applied cluster.NewMutex to ensure only one goroutine handles the re-encryption and we don't end up with two competing goroutines, and in the event that we have two quick succession of migrations happening, the forceDisconnectUser flow should kick in and notify users to reconnect their accounts
|
|
||
| // Idempotency check: if the token can already be decrypted with the new key, it was already migrated. | ||
| if _, err := decrypt([]byte(newEncryptionKey), tokenStr); err == nil { | ||
| return false, nil |
There was a problem hiding this comment.
There's a small chance decrypting with the wrong key produces valid padding, causing a false-positive. Consider adding a version prefix to re-encrypted values instead.
| return false, err | ||
| } | ||
|
|
||
| reEncrypted, err := encrypt([]byte(newEncryptionKey), plainJSON) |
There was a problem hiding this comment.
If decrypt returns garbage that passes padding corrupted data gets re encrypted silently. Add json.Unmarshal validation before re-encrypting.
There was a problem hiding this comment.
I think this also resolves the false-positive issue you listed above so I'll keep this as a way to resolve both, thanks for the suggestion!
| } | ||
|
|
||
| auditRec.Success() | ||
| auditRec.AddEventResultState(result) |
There was a problem hiding this comment.
Even if ForceDisconnectCount == len(allKeys) audit says success. Mark as failure orpartial when disconnects occurred.
There was a problem hiding this comment.
Oh yes this makes more sense, I'll adjust this
edgarbellot
left a comment
There was a problem hiding this comment.
Thank you @avasconcelos114 ! I just have one comment :)
| p.configurationLock.RUnlock() | ||
| newGitlabGroup := strings.TrimSpace(configuration.GitlabGroup) | ||
|
|
||
| p.setConfiguration(configuration, serverConfiguration) |
There was a problem hiding this comment.
There's a window between setConfiguration (211) activating the new key and reEncryptUserData (216) finishing. During that window, getGitlabUserTokenByMattermostID (476-497) tries to decrypt with the new key, but the tokens are still encrypted with the old one - so every user gets a decryption failure until their individual token is re-encrypted. In deployments with many users this could be a noticeable outage.
Would it make sense to add a fallback decryption attempt with the previous key inside getGitlabUserTokenByMattermostID (486-489)? Something like storing the previous key temporarily in the config (e.g. PreviousEncryptionKey) and, if decrypt with the current key fails, retrying with the previous one before returning an error. That way users stay functional while the background migration runs.
An alternative would be to run the re-encryption synchronously before activating the new key, but that could block OnConfigurationChange for a while depending on user count, so the fallback approach seems more practical.
There was a problem hiding this comment.
Ah yep that makes sense to me, I'll add this temporary config value as a fallback and this should make the process smoother, thanks for the suggestion!
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/configuration.go`:
- Around line 44-47: Persist the previous encryption key and an in-progress
migration marker to the plugin KV so async re-encryption can survive restarts:
extend the Configuration struct's PreviousEncryptionKey persistence (or add a
RotationInProgress flag) and on rotation start write {PreviousEncryptionKey,
RotationInProgress=true} to KV; modify reEncryptUserData to read the persisted
state at startup and resume/finish re-encryption when RotationInProgress is
true; on successful completion clear RotationInProgress and remove/overwrite
PreviousEncryptionKey in KV. Ensure functions/methods that initiate rotation and
perform re-encryption (e.g., reEncryptUserData and wherever the rotation is
started) read/write these KV keys atomically so the goroutine can resume after
process restart.
In `@server/reencrypt_test.go`:
- Around line 136-147: The test "wrong-key decrypt garbage does not pass unpad"
is using a probabilistic assertion that decrypt must always return an error;
instead make it deterministic by accepting either a decrypt error or a JSON
unmarshal failure on the decrypted payload. Update the test around the calls to
encrypt/decrypt (functions encrypt and decrypt) to: call decrypt into a
plaintext variable, if decrypt returns an error accept it, otherwise attempt
json.Unmarshal on the decrypted plaintext into oauth2.Token and assert that
unmarshalling fails (or assert that the token fields are not usable). This
ensures the test passes deterministically when ciphertext decrypts to non-JSON
garbage or when decrypt fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b74df689-51dd-43c7-8fd8-20c682ad6b45
📒 Files selected for processing (4)
server/command_test.goserver/configuration.goserver/plugin.goserver/reencrypt_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- server/plugin.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/reencrypt_test.go`:
- Around line 330-334: The tests TestReEncryptUserData_MultiplePages and
TestReEncryptUserData_PageErrorAfterFirstPage currently use Maybe() for
KVGet/KVSetWithOptions so per-token work can be skipped; modify each test to
record and assert that at least one page-0 key was processed by replacing or
augmenting the api.On("KVSetWithOptions", ...) mock: attach a .Run(func(args
mock.Arguments){ /* inspect args[1] or key arg and increment a local counter
when it matches a page-0 token (use encryptedTokenWithKey or known key prefix)
*/ }) and keep the Return(...).Maybe() (or use .Once().Return(...) if
appropriate), then after calling ReEncryptUserData assert counter > 0 to prove
page-0 processing, and call api.AssertExpectations(t) to ensure mocked
expectations are checked; do this in both tests
(TestReEncryptUserData_MultiplePages and
TestReEncryptUserData_PageErrorAfterFirstPage).
- Around line 445-459: The tests
TestGetGitlabUserTokenByMattermostID_FallbackToPreviousKey and
TestGetGitlabUserTokenByMattermostID_BothKeysFail set .Once() on the API mock
for KVGet but never verify those expectations; update each test to call
api.AssertExpectations(t) (or api.AssertCalled/AssertNumberOfCalls as
appropriate) after the existing require/assert checks to ensure the KVGet mock
was actually invoked as expected, referencing the API mock variable api and the
KVGet expectation for kvKey.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 34353ec7-d85c-4e03-a36f-aeb763612d70
📒 Files selected for processing (1)
server/reencrypt_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/plugin.go`:
- Around line 1062-1071: Add tracking for users skipped due to KV read errors by
extending the audit result and incrementing it when reEncryptUserToken returns
(false, nil). Update the ReEncryptUserDataAuditResult struct in server/audit.go
to include a SkippedCount (uint or int), then in server/plugin.go where the loop
over allKeys calls reEncryptUserToken, change the logic so: if reErr != nil keep
existing behavior (increment ForceDisconnectCount), else if migrated increment
MigratedCount, else increment result.SkippedCount for the (false, nil) case;
also update any serialization/consumers of ReEncryptUserDataAuditResult to
handle the new field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 62822ea5-da25-4a89-966a-e2b56368b525
📒 Files selected for processing (2)
server/plugin.goserver/reencrypt_test.go
|
@coderabbitai approve |
✅ Actions performedComments resolved and changes approved. |
edgarbellot
left a comment
There was a problem hiding this comment.
Perfectly addressed, as always! Thank you Andre!
|
@nang2049 Could you approve this PR again? GitHub disabled the merge button with a message that it needs an approval from |
Summary
This PR applies a more graceful behavior for rotation of encryptionkey values in the settings.
This PR adds a rotation of encrypted values alongside a fallback if the rotation fails that clear user sessions and notifies the user via DM. It also fills a gap in backwards compatibility, where a user is unable to run
/gitlab connectunless instances have been installed, even when the global plugin settings have all the correct values set upTicket Link
https://mattermost.atlassian.net/browse/MM-67613
QA Steps
The two main flows for this are for either a successful or a failed rotation, with:
Success case
/gitlab me) to confirm things continue to work normallyFailed rotation case
/gitlab connectin order to reconnect accounts/gitlab connectworks normally and the user is able to connect their account and continue to use the plugin normallyChange Impact: 🟠 Medium
Reasoning: The PR modifies encryption/decryption, token re-encryption, and session/disconnect behavior — areas tied to authentication and data migration — but includes extensive unit tests and a clear fallback that limits broad failures.
Regression Risk: Medium — changes touch KV storage, crypto (padding/unpad), session cleanup, and connection gating. Tests cover many flows (including success, multiple failure modes, and fallback), but real-world migration edge cases (partial KV/CAS failures, multi-node concurrency, and forced disconnects) could surface in production.
** QA Recommendation:** Targeted manual QA required: (1) verify normal operations with no rotation; (2) validate successful rotation end-to-end (tokens re-encrypted, users remain connected); (3) simulate rotation failure and confirm affected users are disconnected, receive the plugin DM, and can reconnect via /gitlab connect. Given session-impacting behavior, do not skip manual QA.
Generated by CodeRabbitAI