This document covers operational procedures for the admin dashboard: how to deploy with admin enabled, how operators log in, how to roll out role / key / TLS changes safely, and which runbooks fire when something breaks.
For the configuration reference (per-flag semantics, hard-error
startup conditions, audit log shape, troubleshooting catalogue),
read docs/admin.md first — this doc assumes you have
already skimmed it.
For design rationale, see
docs/design/2026_04_24_implemented_admin_dashboard.md.
Before turning the admin listener on, prepare the following on a machine you control (not on a shared bastion):
Exactly 64 raw bytes. The validator rejects any other length at startup with a precise error.
# Generate 64 random bytes, base64-encode with standard padding (88 chars).
openssl rand 64 | base64 > admin-hs256.b64
# Or with RawURLEncoding (86 chars, no padding) — both are accepted.
openssl rand 64 | base64 | tr '+/' '-_' | tr -d '=' > admin-hs256.b64Verify the length:
# Standard padding: 88 chars including newline trailer = 89 bytes total
wc -c admin-hs256.b64
# Decoded byte count:
base64 -d admin-hs256.b64 | wc -c # → 64Distribute the same file to every node — under /etc/elastickv/
on each remote host with 0400 permissions owned by the runtime
user. The dashboard read paths re-validate JWTs locally on each
node, so a key mismatch breaks follower → leader forwarding the
moment a follower is elected leader.
The same -s3CredentialsFile the data-plane S3 adapter consumes.
Add an entry per admin operator (each operator gets a distinct
access key):
{
"credentials": [
{"access_key_id": "AKIA_ADMIN_ALICE", "secret_access_key": "..."},
{"access_key_id": "AKIA_OBSERVER_BOB", "secret_access_key": "..."}
]
}The dashboard authenticates against this file using the same
subtle.ConstantTimeCompare the SigV4 path uses; secrets must
match exactly (trailing whitespace counts).
Decide who gets full (mutate) vs read-only (inspect) and write
the access keys into two comma-separated lists. The same key must
NOT appear in both — startup rejects an overlap with a hard error.
ADMIN_FULL_ACCESS_KEYS="AKIA_ADMIN_ALICE"
ADMIN_READ_ONLY_ACCESS_KEYS="AKIA_OBSERVER_BOB"
The role index is checked on every state-changing request, not
only at login. Promotions and demotions take effect on the next
admin write — but the JWT freezes the role at login (the live store
can only narrow the JWT's role, never widen it; see
docs/admin.md §Roles).
If the listener will bind to anything other than 127.0.0.1 /
::1, prepare a PEM cert + key pair on each remote host. Both
files must be readable by the runtime user. Hot-reload is not
supported — rotating certs requires a full container restart, so
prefer cert lifetimes that align with your existing rolling-update
cadence.
If you genuinely need plaintext on a non-loopback bind for a
short-lived test rig, see §3.3 below for the two-flag escape hatch
(both ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=true and
ADMIN_ALLOW_INSECURE_DEV_COOKIE=true are needed — the former
without the latter is a broken-end-to-end configuration).
Use scripts/rolling-update.sh for both first-time deploys and
subsequent rollouts. The script bind-mounts the referenced files
into the container read-only, validates that each path exists on
the remote host before stopping the previous container, and passes
the corresponding flags to the daemon.
Copy scripts/rolling-update.env.example and fill in the
ADMIN_* block:
# admin-cluster.env
NODES="n1=raft-1.example,n2=raft-2.example,n3=raft-3.example"
IMAGE="ghcr.io/bootjp/elastickv:v0.42.0"
SSH_USER="deploy"
S3_CREDENTIALS_FILE="/etc/elastickv/s3-credentials.json"
ADMIN_ENABLED="true"
ADMIN_ADDRESS="0.0.0.0:8080"
ADMIN_FULL_ACCESS_KEYS="AKIA_ADMIN_ALICE"
ADMIN_READ_ONLY_ACCESS_KEYS="AKIA_OBSERVER_BOB"
ADMIN_SESSION_SIGNING_KEY_FILE="/etc/elastickv/admin-hs256.b64"
ADMIN_TLS_CERT_FILE="/etc/elastickv/admin-tls.crt"
ADMIN_TLS_KEY_FILE="/etc/elastickv/admin-tls.key"Place the credentials file, signing key, and TLS pair on every remote host at the configured paths before invoking the script — the validation pass aborts the rollout on the first node where any referenced path is missing, so you do not get partial-cluster states.
ROLLING_UPDATE_ENV_FILE=admin-cluster.env ./scripts/rolling-update.shThe script rolls one node at a time:
- Transfers Raft leadership away from the node (if it is leader).
- Stops the existing container.
- Starts a new container with the admin flags + bind-mounts.
- Waits for the gRPC health endpoint to come up.
- Moves on to the next node after
ROLLING_DELAY_SECONDS.
If a node fails the health check, the script exits non-zero before touching the next node — the cluster is left with at most one partially-updated node, which preserves quorum on a 3-node deployment.
After the rollout finishes, hit the listener from a host that can reach the admin port:
# Loopback (TLS optional)
curl -sf http://127.0.0.1:8080/admin/healthz
# Non-loopback (TLS required by default)
curl -sf https://raft-1.example:8080/admin/healthzThe endpoint returns 204 No Content when the listener is up.
- Open
https://<admin-host>:<admin-port>/admin/in a browser. - Enter the access key + secret pair from
-s3CredentialsFile. The SPA POSTs to/admin/api/v1/auth/login, the server validates the credentials with constant-time comparison, and on success sets two cookies — the session JWT and the CSRF double-submit token, bothSecure; HttpOnly; SameSite=Strict; Path=/admin/. - The dashboard becomes available. Every state-changing request carries the CSRF token in a header so a cross-site page cannot force an action against the session.
| HTTP code | Cause | Remediation |
|---|---|---|
401 invalid_credentials |
access key + secret pair did not match -s3CredentialsFile |
Verify the credentials file is the one the running process loaded (it is read once at startup) and that the secret matches exactly. |
403 forbidden (login) |
credentials matched, but the access key is not in -adminFullAccessKeys / -adminReadOnlyAccessKeys |
Add the key to one of the role flags and restart every node. |
403 forbidden (write) |
session is read-only | Move the access key into -adminFullAccessKeys, then restart every node. |
503 leader_unavailable |
Raft mid-election | Re-issue after a moment; the SPA does not auto-retry today (see docs/admin.md §Election-period 503). |
The leader's audit log distinguishes 401 vs 403 with the precise
reason; the dashboard does not surface it for security. See
docs/admin.md §Audit log for the slog shapes.
Loopback only (the default 127.0.0.1:8080 bind): no TLS
required. Cookies are minted with Secure=true; modern browsers
accept them on the loopback origin without TLS via the
loopback-is-trusted policy. If a specific browser refuses the
cookie, set ADMIN_ALLOW_INSECURE_DEV_COOKIE=true to drop the
Secure flag.
Plaintext non-loopback (test rig only): set both
ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=true AND
ADMIN_ALLOW_INSECURE_DEV_COOKIE=true. The former bypasses the
startup TLS guard; the latter drops Secure from the cookie. Each
flag is independent on purpose so a misconfigured deployment fails
closed on either axis instead of silently downgrading both.
- Edit
ADMIN_FULL_ACCESS_KEYS/ADMIN_READ_ONLY_ACCESS_KEYSin your env file. The key must NOT appear in both lists — startup will reject the overlap. - If the key is being added, also add the access key + secret to
S3_CREDENTIALS_FILEon every node. (Removing a key only requires the role-flag edit; the credentials file can be trimmed later.) - Re-run
scripts/rolling-update.sh. - The change takes effect on each node immediately at restart. During the rollout window, sessions whose JWT was minted on a not-yet-restarted node will see role enforcement against the stale list on those nodes — operators may briefly see 403s on read-only operations replayed against a not-yet-updated node.
If you cannot tolerate the rollout-window inconsistency, drain admin traffic before starting the rollout (close all dashboard tabs and wait for the JWT TTL to expire on every issued token).
The cluster supports a two-key verification window so a rotation does not invalidate existing sessions.
primary previous
before rotation: key A (none)
rotation step 1: key B key A ← roll out
rotation step 2: key B (none) ← roll out (after JWT TTL)
Procedure:
- Generate the new key:
openssl rand 64 | base64 > admin-hs256-new.b64. - Copy
admin-hs256-new.b64to every node at a stable path (e.g./etc/elastickv/admin-hs256.next.b64) and the previous key remains at its existing path. - Edit the env file:
ADMIN_SESSION_SIGNING_KEY_FILE="/etc/elastickv/admin-hs256.next.b64" ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE="/etc/elastickv/admin-hs256.b64" - Roll the cluster. New tokens are minted with the new key; existing tokens minted under the previous key continue to verify until they expire.
- Wait at least one full session TTL (default 1h — see
sessionTTLininternal/admin/auth_handler.go) to ensure every previously-issued JWT has expired. - Promote the new key to be the only key — clear the previous-
file variable and rename the path on each node:
ADMIN_SESSION_SIGNING_KEY_FILE="/etc/elastickv/admin-hs256.b64" # ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE= - Roll the cluster again. The previous-key file can now be deleted from every node.
Hot-reload is not implemented (out of scope for the dashboard's maintenance model). Rotation steps:
- Place the new cert + key at a temporary path on every node
(e.g.
/etc/elastickv/admin-tls.next.crt/admin-tls.next.key). - Edit the env file to point at the new paths.
- Roll the cluster.
- Once all nodes have started with the new cert, delete the old files at your leisure.
A bind change is a regular config change — edit ADMIN_ADDRESS
and roll. Operators with the dashboard open will see their next
request fail (the old listener is gone); reload the SPA.
If you are switching from loopback-only to a reachable bind, you
must prepare TLS material (§1.4) before flipping the bind.
The startup hard-error fires the moment a non-loopback bind is
configured without TLS or without the
ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK flag. The script's pre-flight
validation will catch a missing TLS pair before stopping the
existing container.
Set ADMIN_ENABLED="false" (or remove ADMIN_ENABLED from the
env file entirely) and roll. The script emits no admin flags and
no admin bind-mounts; the daemon comes up with the listener off.
The signing key file and TLS material can stay on disk — they
only have effect when --adminEnabled is passed.
AdminDeleteBucket (and the SigV4 DeleteBucket path) follows
the standard S3 contract: the bucket must be empty. The dashboard
returns 409 BucketNotEmpty if any object exists.
Race-window contract change: a PutObject that returns 200
OK during the empty-probe → commit window of an AdminDeleteBucket
running concurrently can have its data swept along with the
bucket meta. The delete commits a DEL_PREFIX safety net across
every per-bucket key family in the same atomic
OperationGroup, so any object that landed in the race window
is tombstoned at the same commitTS rather than left as an
unreachable orphan. The behaviour is intentional — the
alternative was orphan objects that no API can enumerate or
remove. See
docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md
for the full race analysis.
Operationally:
- For planned bucket deletes, pause writes against the bucket
before issuing the delete. The 5-second window between the
empty-probe and the commit is small but non-zero; pausing
writes guarantees no in-flight
PutObjectis swept. - For emergency bucket deletes (e.g. cleaning up a
compromised bucket while traffic is still live), the safety
net guarantees the cluster reaches a clean state immediately
— but accept that any client whose
PutObjectwas in flight may have received200 OKfor data that no longer exists. - Re-creating a bucket with the same name after delete is safe.
BucketGenerationKeysurvives the delete, so the new bucket starts at a strictly higher generation; any object that ever escaped a previous delete (under the old generation prefix) stays invisible to the new bucket.
Daemon hard-errored at startup. Fix on the affected node:
- Verify
S3_CREDENTIALS_FILEis set in the env file. - SSH to the node, confirm the file exists and contains at least one entry.
- Re-run the rollout.
Daemon hard-errored. Either supply
ADMIN_TLS_CERT_FILE/ADMIN_TLS_KEY_FILE, or accept the
plaintext escape hatch in §3.3.
Expected during a role-flag change rollout (§4.1). The dashboard does not auto-refresh the role index — operators need to log in again on a node with the updated list. Wait for the rollout to finish, then ask the affected operator to refresh the SPA.
Almost always a cookie-Secure mismatch:
- Listener is plaintext (loopback or
ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=true) - But the cookie is
Secure=true(default — the browser refuses to send it back over HTTP)
Fix by setting ADMIN_ALLOW_INSECURE_DEV_COOKIE=true and rolling.
For a production deployment, use TLS instead.
The follower's LeaderForwarder could not reach a leader. This is
expected during Raft elections — see docs/admin.md §Election-period 503.
Investigate Raft leader-churn logs first; persistent 503s usually
mean the cluster has lost quorum.
docs/admin.md— per-flag configuration reference, audit log shapes, troubleshooting catalogue.docs/design/2026_04_24_implemented_admin_dashboard.md— design rationale, acceptance criteria, out-of-scope follow-ups.scripts/rolling-update.sh— the rollout driver this doc references throughout.scripts/rolling-update.env.example— starting point for your<env>.envfile.