Skip to content

MM-67613 Adding encryption key rotation with graceful fallback#664

Merged
avasconcelos114 merged 8 commits into
masterfrom
MM-67613
Apr 20, 2026
Merged

MM-67613 Adding encryption key rotation with graceful fallback#664
avasconcelos114 merged 8 commits into
masterfrom
MM-67613

Conversation

@avasconcelos114
Copy link
Copy Markdown
Contributor

@avasconcelos114 avasconcelos114 commented Apr 2, 2026

Summary

This PR applies a more graceful behavior for rotation of encryptionkey values in the settings.

This PR adds a rotation of encrypted values alongside a fallback if the rotation fails that clear user sessions and notifies the user via DM. It also fills a gap in backwards compatibility, where a user is unable to run /gitlab connect unless instances have been installed, even when the global plugin settings have all the correct values set up

Ticket Link

https://mattermost.atlassian.net/browse/MM-67613

QA Steps

The two main flows for this are for either a successful or a failed rotation, with:
Success case

  1. Install the build with rotation logic present
  2. Connect an account to a Gitlab instance
  3. Go on the System Console and regenerate the encryption key for the Gitlab plugin
  4. Exit the system console and run a gitlab plugin command (e.g. /gitlab me) to confirm things continue to work normally

Failed rotation case

  1. Install a build without rotation logic
  2. Connect an account to a gitlab instance
  3. regenerate the encryption key for the gitlab plugin
  4. exit the system console and run a plugin command to confirm that the command fails due to auth errors
  5. Install a build with the rotation logic
  6. Regenerate the encryption key and confirm through the system logs that the rotation failed
  7. Exit the system console and confirm that a DM has been received by the plugin's bot notifying that the user has to run /gitlab connect in order to reconnect accounts
  8. Running /gitlab connect works normally and the user is able to connect their account and continue to use the plugin normally

Change Impact: 🟠 Medium

Reasoning: The PR modifies encryption/decryption, token re-encryption, and session/disconnect behavior — areas tied to authentication and data migration — but includes extensive unit tests and a clear fallback that limits broad failures.

Regression Risk: Medium — changes touch KV storage, crypto (padding/unpad), session cleanup, and connection gating. Tests cover many flows (including success, multiple failure modes, and fallback), but real-world migration edge cases (partial KV/CAS failures, multi-node concurrency, and forced disconnects) could surface in production.

** QA Recommendation:** Targeted manual QA required: (1) verify normal operations with no rotation; (2) validate successful rotation end-to-end (tokens re-encrypted, users remain connected); (3) simulate rotation failure and confirm affected users are disconnected, receive the plugin DM, and can reconnect via /gitlab connect. Given session-impacting behavior, do not skip manual QA.

Generated by CodeRabbitAI

@avasconcelos114 avasconcelos114 self-assigned this Apr 2, 2026
@avasconcelos114 avasconcelos114 requested a review from a team as a code owner April 2, 2026 14:10
@avasconcelos114 avasconcelos114 added the 2: Dev Review Requires review by a core committer label Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Plugin snapshots the previous encryption key on config changes, spawns async re-encryption of KV-stored user tokens (migrating tokens or forcing per-user disconnects on failures), adds connection gating, hardens PKCS#7 unpadding, records re-encryption audit types, and adds comprehensive tests.

Changes

Cohort / File(s) Summary
Audit Types
server/audit.go
Added ReEncryptUserDataAuditParams (total_users) and ReEncryptUserDataAuditResult (migrated_count, force_disconnect_count) with Auditable() exporters.
Connect handling & tests
server/command.go, server/command_test.go
Added canConnect() gating in connect flow to return ephemeral "No instance is configured" when connections are not allowed; updated connect tests to cover default instance, legacy config, and no-instance scenarios.
Configuration trigger
server/configuration.go
Snapshot prior EncryptionKey into non-persisted PreviousEncryptionKey when changing keys; spawn goroutine to call reEncryptUserData(newKey, previousKey) if keys differ and non-empty.
Re-encryption core
server/plugin.go
Added canConnect(); key-rotation workflow: reEncryptUserData(), reEncryptUserToken(), forceDisconnectUser(); updated token retrieval to attempt decrypt with previous key; uses cluster mutex, paged KV listing, per-token migration, KV CAS updates, forced-disconnect cleanup, and audit recording.
Padding utilities
server/utils.go
Hardened unpad() to handle empty input and validate padding length/range and padding bytes; returns explicit errors for invalid padding.
Tests
server/reencrypt_test.go
New extensive tests covering PKCS#7 unpad validation, token re-encryption flows (happy path, already-migrated, nil token, errors, CAS behavior), KV pagination, forced-disconnect cleanup, and fallback decryption with previous key.

Sequence Diagram

sequenceDiagram
    participant Admin as Admin/User
    participant Config as Configuration
    participant Plugin as Plugin
    participant KV as KV Store
    participant UserDB as User Data & Mappings

    Admin->>Config: Update encryption key
    Config->>Config: Detect key change, snapshot previous key
    Config->>Plugin: Async: reEncryptUserData(newKey, previousKey)

    Plugin->>KV: Acquire cluster mutex
    Plugin->>KV: List keys matching `_usertoken` (paged)

    loop for each token key
        KV-->>Plugin: return token key
        Plugin->>KV: Get encrypted token
        KV-->>Plugin: encrypted blob

        alt decryptable with new key
            Plugin->>Plugin: skip (already migrated)
        else decryptable with previous key
            Plugin->>Plugin: decrypt with previous key
            Plugin->>Plugin: re-encrypt with new key
            Plugin->>KV: KVCompareAndSet store re-encrypted token
            Plugin->>Plugin: increment migrated_count
        else decryption fails
            Plugin->>Plugin: forceDisconnectUser(userID)
            Plugin->>UserDB: delete `_usertoken`, `_userinfo`, mappings
            Plugin->>Plugin: increment force_disconnect_count
        end
    end

    Plugin->>KV: Release cluster mutex
    Plugin->>Plugin: Publish audit params/result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • edgarbellot
  • ogi-m

Poem

🐰 I nibble keys both old and new,
I hop through bytes and tidy you,
If tokens trip, I softly mend,
Or clear the burrow, send a friend,
Hooray — I re-encrypt and hop anew!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: implementing encryption key rotation with a graceful fallback mechanism when rotation fails, which is the primary objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread server/command.go Outdated
Comment on lines +343 to +345
if err := p.setDefaultInstance(instanceName); err != nil {
return p.getCommandResponse(args, err.Error(), true), nil
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This setting of default instance behavior is going to be removed once #663 is merged and this PR syncs to the master branch

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/plugin.go (1)

1074-1077: Consider whether KV read errors should trigger disconnect.

Currently, if p.client.KV.Get fails with an error (not just nil data), the function returns (false, nil) - silently skipping the user without migration or disconnect. A transient KV error could leave a user's token encrypted with the old key, causing auth failures on subsequent requests.

This fail-open approach may be intentional to avoid mass disconnections on transient KV issues. If so, consider adding a warning log for the error case to aid debugging:

🔧 Suggested improvement
 var tokenBytes []byte
-if err := p.client.KV.Get(kvKey, &tokenBytes); err != nil || tokenBytes == nil {
+if err := p.client.KV.Get(kvKey, &tokenBytes); err != nil {
+    p.client.Log.Warn("Failed to read token during re-encryption, skipping user",
+        "user_id", userID, "error", err.Error())
+    return false, nil
+}
+if tokenBytes == nil {
     return false, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/plugin.go` around lines 1074 - 1077, The KV read silently ignores
errors; update the block that calls p.client.KV.Get(kvKey, &tokenBytes) so that
if err != nil you emit a warning log with the error and kvKey (e.g. via the
plugin's logger, such as p.logger or p.log) before returning (false, nil),
rather than failing silently — keep the existing behavior (no disconnect) but
add the log to surface transient KV failures that could block token migration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/plugin.go`:
- Around line 1074-1077: The KV read silently ignores errors; update the block
that calls p.client.KV.Get(kvKey, &tokenBytes) so that if err != nil you emit a
warning log with the error and kvKey (e.g. via the plugin's logger, such as
p.logger or p.log) before returning (false, nil), rather than failing silently —
keep the existing behavior (no disconnect) but add the log to surface transient
KV failures that could block token migration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ff9d4ba4-4a43-4a65-84e5-dfee8aae5852

📥 Commits

Reviewing files that changed from the base of the PR and between 4176e8c and 7cd5fc9.

📒 Files selected for processing (7)
  • server/audit.go
  • server/command.go
  • server/command_test.go
  • server/configuration.go
  • server/plugin.go
  • server/reencrypt_test.go
  • server/utils.go

@avasconcelos114 avasconcelos114 added the 3: Security Review Review requested from Security Team label Apr 2, 2026
@avasconcelos114 avasconcelos114 requested review from a user and edgarbellot April 2, 2026 14:17
Copy link
Copy Markdown

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avasconcelos114 a few comments!

Comment thread server/configuration.go Outdated
if previousEncryptionKey != "" && configuration.EncryptionKey != "" &&
previousEncryptionKey != configuration.EncryptionKey {
newKey := configuration.EncryptionKey
go p.reEncryptUserData(newKey, previousEncryptionKey)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reEncryptUserData runs in a goroutine. If OnConfigurationChange fires again before re-encryption completes a second goroutine could start with a different previousEncryptionKey that is actually the new key from the first rotation and data corruption. Consider adding a check inside reEncryptUserData to verify that previousEncryptionKey still matches what's in the KV store

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah to prevent such a scenario I applied cluster.NewMutex to ensure only one goroutine handles the re-encryption and we don't end up with two competing goroutines, and in the event that we have two quick succession of migrations happening, the forceDisconnectUser flow should kick in and notify users to reconnect their accounts

Comment thread server/plugin.go Outdated

// Idempotency check: if the token can already be decrypted with the new key, it was already migrated.
if _, err := decrypt([]byte(newEncryptionKey), tokenStr); err == nil {
return false, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small chance decrypting with the wrong key produces valid padding, causing a false-positive. Consider adding a version prefix to re-encrypted values instead.

Comment thread server/plugin.go
return false, err
}

reEncrypted, err := encrypt([]byte(newEncryptionKey), plainJSON)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If decrypt returns garbage that passes padding corrupted data gets re encrypted silently. Add json.Unmarshal validation before re-encrypting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also resolves the false-positive issue you listed above so I'll keep this as a way to resolve both, thanks for the suggestion!

Comment thread server/plugin.go
}

auditRec.Success()
auditRec.AddEventResultState(result)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if ForceDisconnectCount == len(allKeys) audit says success. Mark as failure orpartial when disconnects occurred.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes this makes more sense, I'll adjust this

Copy link
Copy Markdown

@edgarbellot edgarbellot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @avasconcelos114 ! I just have one comment :)

Comment thread server/configuration.go Outdated
p.configurationLock.RUnlock()
newGitlabGroup := strings.TrimSpace(configuration.GitlabGroup)

p.setConfiguration(configuration, serverConfiguration)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a window between setConfiguration (211) activating the new key and reEncryptUserData (216) finishing. During that window, getGitlabUserTokenByMattermostID (476-497) tries to decrypt with the new key, but the tokens are still encrypted with the old one - so every user gets a decryption failure until their individual token is re-encrypted. In deployments with many users this could be a noticeable outage.

Would it make sense to add a fallback decryption attempt with the previous key inside getGitlabUserTokenByMattermostID (486-489)? Something like storing the previous key temporarily in the config (e.g. PreviousEncryptionKey) and, if decrypt with the current key fails, retrying with the previous one before returning an error. That way users stay functional while the background migration runs.

An alternative would be to run the re-encryption synchronously before activating the new key, but that could block OnConfigurationChange for a while depending on user count, so the fallback approach seems more practical.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep that makes sense to me, I'll add this temporary config value as a fallback and this should make the process smoother, thanks for the suggestion!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/configuration.go`:
- Around line 44-47: Persist the previous encryption key and an in-progress
migration marker to the plugin KV so async re-encryption can survive restarts:
extend the Configuration struct's PreviousEncryptionKey persistence (or add a
RotationInProgress flag) and on rotation start write {PreviousEncryptionKey,
RotationInProgress=true} to KV; modify reEncryptUserData to read the persisted
state at startup and resume/finish re-encryption when RotationInProgress is
true; on successful completion clear RotationInProgress and remove/overwrite
PreviousEncryptionKey in KV. Ensure functions/methods that initiate rotation and
perform re-encryption (e.g., reEncryptUserData and wherever the rotation is
started) read/write these KV keys atomically so the goroutine can resume after
process restart.

In `@server/reencrypt_test.go`:
- Around line 136-147: The test "wrong-key decrypt garbage does not pass unpad"
is using a probabilistic assertion that decrypt must always return an error;
instead make it deterministic by accepting either a decrypt error or a JSON
unmarshal failure on the decrypted payload. Update the test around the calls to
encrypt/decrypt (functions encrypt and decrypt) to: call decrypt into a
plaintext variable, if decrypt returns an error accept it, otherwise attempt
json.Unmarshal on the decrypted plaintext into oauth2.Token and assert that
unmarshalling fails (or assert that the token fields are not usable). This
ensures the test passes deterministically when ciphertext decrypts to non-JSON
garbage or when decrypt fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b74df689-51dd-43c7-8fd8-20c682ad6b45

📥 Commits

Reviewing files that changed from the base of the PR and between 86737b4 and d3ade68.

📒 Files selected for processing (4)
  • server/command_test.go
  • server/configuration.go
  • server/plugin.go
  • server/reencrypt_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/plugin.go

Comment thread server/configuration.go
Comment thread server/reencrypt_test.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/reencrypt_test.go`:
- Around line 330-334: The tests TestReEncryptUserData_MultiplePages and
TestReEncryptUserData_PageErrorAfterFirstPage currently use Maybe() for
KVGet/KVSetWithOptions so per-token work can be skipped; modify each test to
record and assert that at least one page-0 key was processed by replacing or
augmenting the api.On("KVSetWithOptions", ...) mock: attach a .Run(func(args
mock.Arguments){ /* inspect args[1] or key arg and increment a local counter
when it matches a page-0 token (use encryptedTokenWithKey or known key prefix)
*/ }) and keep the Return(...).Maybe() (or use .Once().Return(...) if
appropriate), then after calling ReEncryptUserData assert counter > 0 to prove
page-0 processing, and call api.AssertExpectations(t) to ensure mocked
expectations are checked; do this in both tests
(TestReEncryptUserData_MultiplePages and
TestReEncryptUserData_PageErrorAfterFirstPage).
- Around line 445-459: The tests
TestGetGitlabUserTokenByMattermostID_FallbackToPreviousKey and
TestGetGitlabUserTokenByMattermostID_BothKeysFail set .Once() on the API mock
for KVGet but never verify those expectations; update each test to call
api.AssertExpectations(t) (or api.AssertCalled/AssertNumberOfCalls as
appropriate) after the existing require/assert checks to ensure the KVGet mock
was actually invoked as expected, referencing the API mock variable api and the
KVGet expectation for kvKey.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 34353ec7-d85c-4e03-a36f-aeb763612d70

📥 Commits

Reviewing files that changed from the base of the PR and between d3ade68 and 8ce0215.

📒 Files selected for processing (1)
  • server/reencrypt_test.go

Comment thread server/reencrypt_test.go Outdated
Comment thread server/reencrypt_test.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/plugin.go`:
- Around line 1062-1071: Add tracking for users skipped due to KV read errors by
extending the audit result and incrementing it when reEncryptUserToken returns
(false, nil). Update the ReEncryptUserDataAuditResult struct in server/audit.go
to include a SkippedCount (uint or int), then in server/plugin.go where the loop
over allKeys calls reEncryptUserToken, change the logic so: if reErr != nil keep
existing behavior (increment ForceDisconnectCount), else if migrated increment
MigratedCount, else increment result.SkippedCount for the (false, nil) case;
also update any serialization/consumers of ReEncryptUserDataAuditResult to
handle the new field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 62822ea5-da25-4a89-966a-e2b56368b525

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce0215 and 992090e.

📒 Files selected for processing (2)
  • server/plugin.go
  • server/reencrypt_test.go

Comment thread server/plugin.go
@avasconcelos114
Copy link
Copy Markdown
Contributor Author

@coderabbitai approve

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

✅ Actions performed

Comments resolved and changes approved.

@avasconcelos114 avasconcelos114 requested review from a user and edgarbellot April 9, 2026 18:45
Copy link
Copy Markdown

@edgarbellot edgarbellot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfectly addressed, as always! Thank you Andre!

Copy link
Copy Markdown

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@avasconcelos114 avasconcelos114 added the 3: QA Review Requires review by a QA tester label Apr 10, 2026
@avasconcelos114 avasconcelos114 requested a review from ogi-m April 10, 2026 15:27
Copy link
Copy Markdown

@ogi-m ogi-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogi-m ogi-m added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer 3: QA Review Requires review by a QA tester 3: Security Review Review requested from Security Team labels Apr 15, 2026
@avasconcelos114
Copy link
Copy Markdown
Contributor Author

@nang2049 Could you approve this PR again? GitHub disabled the merge button with a message that it needs an approval from mattermost/integrations code owners

@avasconcelos114 avasconcelos114 merged commit a97e4bd into master Apr 20, 2026
17 checks passed
@avasconcelos114 avasconcelos114 deleted the MM-67613 branch April 20, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4: Reviews Complete All reviewers have approved the pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants