sourcegraph · marcleblanc2 · Jun 11, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
@@ -25,4 +25,5 @@ wheels/
 !.env.example
 !.markdownlint-cli2.yaml
 !maps-example.yaml
+!tests/tests.yaml
 !tests/e2e/fixtures/**/maps.yaml
@@ -1,5 +1,10 @@
 # AGENTS.md
 
+## Reference materials
+
+- GraphQL schema and database migrations (changes to SQL schema) are available in
+  <https://github.com/sourcegraph/artifacts>
+
 ## Linting
 
 ```bash
@@ -26,21 +31,40 @@ uv run src-auth-perms-sync --help
 
 ## Testing
 
-- First run a dry-run (default behaviour, without `--apply` flag) against a Sourcegraph instance
+All testing runs through one entrypoint: `tests/run.py`. Output goes to the
+console and to a per-run log file under `logs/`. Each level runs only its
+own checks.
 
 ```bash
-uv run src-auth-perms-sync [--get]
-uv run src-auth-perms-sync --set maps.yaml --full
-uv run src-auth-perms-sync --restore backups/<source>/<run>/before.json
+# Fast, no network (also what the pre-commit hook runs):
+# lint, format, pyright, unit + fixture tests, CLI rejection matrix,
+# randomized permission invariants
+uv run tests/run.py
+
+# End-to-end runs against the .env test instance with independent GraphQL
+# read-back verification, and a wheel install smoke test
+uv run tests/run.py --live
+
+# Run a subset: comma-delimited test names, substring match
+uv run tests/run.py --live full-overwrite-unions
+uv run tests/run.py --live wheel,baseline
+
+# Repeated timed runs with Jaeger trace retention, RSS sampling,
+# optional kubectl load monitoring, and baseline comparison
+uv run tests/run.py --performance --repeat 3
+uv run tests/run.py --performance --baseline-command "uvx src-auth-perms-sync@latest" \
+  --fail-on-memory-regression-percent 10
+
+# Regenerate fixture goldens after editing tests/e2e/fixtures/ cases
+uv run tests/run.py --update-golden
 ```
 
-- Read the output, and evaluate the expected changes
-- If the expected changes look correct
-  - Run with the `--apply` flag against the test instance
-  - Read and evaluate the output for expected changes
-  - Run with the `--restore` flag against the test instance
-  - Always inspect the before / after snapshots in
-    `src-auth-perms-sync-runs/<endpoint>/backups/` afterward to confirm the diff matches what you expected
+- Fixture cases live in `tests/e2e/fixtures/<case>/` — see the README there
+  for the format. Add cases there to cover new mapping behaviors.
+- For manual verification against a real instance, dry-run first (no
+  `--apply`), read the planned changes, then `--apply` on a scratch instance
+  and inspect the before/after snapshots under
+  `src-auth-perms-sync-runs/<endpoint>/runs/`.
 
 ## Release process
 

@@ -1,5 +1,22 @@
 # TODO
 
+## Medium priority: extend SAML-group live coverage to org sync
+
+tests/setup.py now fabricates SAML accounts with synthetic groups
+(`perms-sync-test-eng` / `perms-sync-test-sales`, see tests/setup.yaml).
+saml-group-live covers permission mapping; add a seeded
+`sync-saml-orgs --apply` live case that maps those groups to a throwaway
+org and asserts membership is added AND removed (today's
+sync-saml-orgs-apply only covers the single real Okta user, add-only).
+
+## Decide: pendingBindIDs / usersWithPendingPermissions
+
+The CLI cannot create pending permissions (it validates users exist), but
+snapshots record `pending_bindIDs`, and setup.py / the live hygiene check
+report (never delete) any that appear. Decide whether "grant before first
+login" is a customer need; if not, consider dropping the snapshot field.
+See the thread discussion 2026-06-11.
+
 ## High priority: Remote trigger on demand
 
 - Sourcegraph webhook for new user coming in v7.4.0
@@ -21,14 +38,24 @@
 ## High priority: Reduce worst-case full-permission sync load
 
 - Use the stress-run evidence in
-  [memory-efficiency.md](./memory-efficiency.md)
+  [engineering-requests.md](./engineering-requests.md)
   to request Sourcegraph bulk explicit-permission read and write APIs.
+  New evidence 2026-06-10: the whole-instance apply (1,150 repo
+  overwrites x 10,002 bindIDs each at parallelism 16) crashed the test
+  instance's Postgres ("connection refused", "unexpected EOF"); the
+  client circuit breaker opened and the harness restored cleanly. That
+  stress cycle is now opt-in: `uv run tests/run.py --live "full cycle"`.
 - Add an explicit destructive/performance-test mode to the e2e runner so giant
   stress runs can skip or defer full restore cleanup when the goal is finding
   the server-side breaking point.
 - Revisit full snapshot capture once Sourcegraph exposes a bulk read path;
   replace aliased `User.permissionsInfo.repositories(source: API)` calls before
   raising concurrency further.
+- `get --repos <name>` still scans every user's explicit grants to find one
+  repo's holders (~400 s at 10k users). A repo-centric read
+  (`repository.permissionsInfo.users` + site-admin disambiguation, as the
+  test harness already does) would make it seconds — see the repo-centric
+  section below.
 
 ## Low priority: Repo-centric path, when users > repos, or for cross-checking
 

@@ -0,0 +1,270 @@
+# Engineering requests
+
+Use this when opening Sourcegraph Engineering issues from memory-efficiency
+evidence. Capture steps stay in [memory-efficiency.md](./memory-efficiency.md);
+this file keeps the request-ready problem statement, evidence, proposed API
+shape, and copy/paste issue text.
+
+## Requested Sourcegraph changes
+
+1. Add a bulk GraphQL read path for explicit API repository permissions.
+2. Add a cheaper presence/filter path for users without explicit API repo
+   permissions.
+3. Add Jaeger spans / metrics around the new store methods and around current
+   `ListUserPermissions` / `CountUserPermissions` paths.
+4. Follow up with a bulk overwrite API for large full-set applies.
+
+## Current trace findings
+
+Current `src-auth-perms-sync` snapshots explicit API grants by calling
+`User.permissionsInfo.repositories(source: API)` through aliased
+`UserExplicitReposBatch` queries. It requests only permission repo IDs, then
+hydrates names separately with `RepositoryNamesByID`.
+
+A focused traced batch for one user with 19 explicit repos showed per-user
+fanout even when only IDs were requested:
+
+| User aliases | CLI request | Jaeger spans | `LoadUserPermissions` | `sql.conn.query` |
+| ---: | ---: | ---: | ---: | ---: |
+| 1 | 398ms | 13 | 1 | 7 |
+| 25 | 508ms | 157 | 25 | 127 |
+| 100 | 1,185ms | 607 | 100 | 502 |
+
+The second hydration query also fans out. A traced `RepositoryNamesByID` query
+for 19 repos produced 46 spans, including 19 `repos.Get` spans and 22
+`sql.conn.query` spans.
+
+An older trace shape that resolved repository objects directly inside
+`permissionsInfo.repositories` showed the per-repo resolver fanout more
+dramatically:
+
+| Request shape | Root GraphQL span | Jaeger fanout |
+| --- | ---: | --- |
+| 25 user aliases, 19 explicit repos each | ~770ms | 475 `repos.Get`, 603 `sql.conn.query` |
+| 100 user aliases, 19 explicit repos each | ~3,769ms | 1,900 `repos.Get`, 2,403 `sql.conn.query` |
+
+Together these point to Sourcegraph server-side GraphQL / DB resolver fanout,
+not local Python CPU. Larger batches reduce request count but can increase
+per-request resolver and SQL work enough to cause timeouts on the test
+instance.
+
+One live-instance behavior is expected: if Sourcegraph returns a GraphQL
+application error showing that a repo/user disappeared between planning and the
+mutation, `src-auth-perms-sync` logs a skipped mutation and continues. The next
+scheduled run will re-plan against the then-current users/repos. Other GraphQL
+application errors still fail normally.
+
+## Stress-run evidence
+
+A prior hard stress map used about 10,001 users and about 1,000 repos, planning
+roughly 10 million explicit grants. That run showed Sourcegraph-side read and
+write costs were the bottleneck. `pg_stat_statements` attributed most database
+time to explicit-permissions helpers:
+
+| Sourcegraph operation | Calls | Total time | Mean time |
+| --- | ---: | ---: | ---: |
+| `permsStore.ListUserPermissions` | 19,974 | 30,862.6s | 1,545ms |
+| `permsStore.upsertUserRepoPermissions-range1` | 472 | 1,178.8s | 2,497ms |
+
+Compared with focused traces at normal scale, `ListUserPermissions` became much
+slower under the large explicit-perms state. This reinforces that the CLI needs
+better Sourcegraph bulk read and write APIs for very large explicit permission
+sets.
+
+## Concurrent-operator evidence (2026-06-10)
+
+Four `src-auth-perms-sync` processes ran full explicit-permissions captures
+concurrently against the 10k-user / 50k-repo test instance (each at
+`--parallelism 8`, `--explicit-permissions-batch-size 25`), while a fifth ran a
+small `set` command. Instance: single `pgsql-0` on an 8-core node.
+
+Observed during the concurrent captures:
+
+- `pgsql-0` CPU (`kubectl top`): 7,636–7,683 millicores of 8,000 (saturated).
+- `frontend` / `gitserver` CPU: 124–138m / 2–3m (idle bystanders).
+- `pg_stat_activity`: 29 active statements, all
+  `permsStore.ListUserPermissions`, **zero wait events** — pure CPU, no lock
+  contention.
+- `pg_stat_statements`: `permsStore.ListUserPermissions` at 24,026 calls,
+  27,635.6s total, 1,150ms mean.
+- Per-client capture throughput: 23 users/sec solo → 2–4 users/sec at 4-way
+  concurrency.
+- Aggregate throughput: 8–16 users/sec at 4-way — **below the 23 users/sec a
+  single client achieves alone** (negative scaling).
+- ALB (CloudWatch): no 5xx, no rejected connections — the edge and frontend
+  are not the bottleneck.
+- Collateral failure: the fifth client's queries exceeded the 60s read timeout
+  under this load; 5 retry attempts exhausted; its run failed with exit 1.
+
+Implications for the engineering request:
+
+- A single per-user `permissionsInfo.repositories(source: API)` read costs
+  roughly 0.3–0.4s of Postgres CPU at this state size (1,150ms mean execution
+  under contention), so one operator at modest parallelism can saturate the
+  database by itself, and two concurrent operators degrade each other below
+  single-operator throughput.
+- Timeout/retry behavior amplifies the problem: once statements exceed the
+  client read timeout, retries re-run the same expensive queries, adding load
+  exactly when the database is saturated.
+- A bulk read API (one query returning explicit grants for many users or for
+  whole repos) would replace ~10,000 × ~1s statements per capture with a
+  single scan, and would also make concurrent operators viable.
+
+## Sourcegraph codepath findings
+
+[Deep Search findings](https://sourcegraph.sourcegraph.com/deepsearch/52a24164-1eb3-4db1-a92d-e320ef1c7557)
+from `github.com/sourcegraph/sourcegraph`:
+
+- Schema: `cmd/frontend/graphqlbackend/authz.graphql` exposes
+  `User.permissionsInfo.repositories(source: PermissionSource)`.
+- `UserResolver.PermissionsInfo` enters
+  `cmd/frontend/internal/authz/resolvers/resolver.go` and calls
+  `db.Perms().LoadUserPermissions(ctx, userID)` before the repositories
+  connection is resolved.
+- `userPermissionsInfoResolver.Repositories` in
+  `cmd/frontend/internal/authz/resolvers/permissions_info.go` uses the generic
+  connection resolver, so `nodes` and `totalCount` can evaluate separately.
+- Each permission node's `Repository()` resolver calls `db.Repos().Get`,
+  creating an N+1 query pattern for repository hydration.
+- Even when the client asks only for permission repo IDs, each aliased user
+  still runs `LoadUserPermissions` and several SQL queries. Current
+  `src-auth-perms-sync` then hydrates repository names separately through
+  `node(id)`, which also resolves as one `repos.Get` per repository ID.
+- `internal/database/perms_store.go` has bulk write helpers for setting repo
+  permissions, but the read path uses per-user connection queries and repo
+  resolver fanout.
+
+## Proposed bulk read API
+
+`src-auth-perms-sync` needs to snapshot explicit API permissions for many
+users. Today it calls `User.permissionsInfo.repositories(source: API)` with
+GraphQL aliases. This is correct, but expensive at scale.
+
+Request a bulk read API for explicit permissions. GraphQL semantics make this a
+query, not a mutation:
+
+```graphql
+type ExplicitRepositoryPermission {
+  userID: ID!
+  repositoryID: ID!
+  repositoryName: String!
+  updatedAt: DateTime!
+}
+
+extend type Query {
+  explicitRepositoryPermissionsForUsers(
+    userIDs: [ID!]!
+    source: PermissionSource = API
+  ): [ExplicitRepositoryPermission!]!
+}
+```
+
+Back it with one SQL shape per user batch:
+
+```sql
+SELECT urp.user_id, urp.repo_id, repo.name, urp.updated_at
+FROM user_repo_permissions urp
+JOIN repo ON repo.id = urp.repo_id AND repo.deleted_at IS NULL
+WHERE urp.user_id = ANY($1)
+  AND urp.source = 'api'
+ORDER BY urp.user_id, repo.name;
+```
+
+Important requirements:
+
+- Return compact scalar data, not `Repository` GraphQL objects, to avoid
+  per-repo resolver hydration.
+- Enforce the same authorization policy as the current user permissions
+  resolver.
+- Support batching / pagination for large user lists.
+- Add Jaeger spans around the new store method and around existing
+  `ListUserPermissions` / `CountUserPermissions` so future investigations do
+  not require inferring work from `sql.conn.query` spans alone.
+
+Expected benefit: replace hundreds or thousands of per-repo resolver SQL spans
+per request with one indexed `user_repo_permissions` join per user batch.
+
+## Proposed presence/filter API
+
+The `get --users-without-explicit-perms` path also needs a cheaper presence
+check. Today it has to ask
+`User.permissionsInfo.repositories(source: API, first: 1)` for every candidate
+user, in aliased batches. Recent test runs show the client can parallelize
+those batches, but the Sourcegraph frontend / load balancer can still return
+502/503s under that resolver load. Add one or both direct APIs:
+
+```graphql
+type ExplicitRepositoryPermissionPresence {
+  userID: ID!
+  hasExplicitRepositoryPermissions: Boolean!
+}
+
+extend type Query {
+  explicitRepositoryPermissionPresenceForUsers(
+    userIDs: [ID!]!
+    source: PermissionSource = API
+  ): [ExplicitRepositoryPermissionPresence!]!
+
+  usersWithoutExplicitRepositoryPermissions(
+    createdAt: DateTimeFilter
+    source: PermissionSource = API
+    first: Int
+    after: String
+  ): UserConnection!
+}
+```
+
+Expected benefit: `src-auth-perms-sync get --users-without-explicit-perms` can
+either check explicit-permission presence for candidate users in one indexed
+batch query, or ask Sourcegraph for the filtered user set directly instead of
+probing every user through the expensive permissions connection resolver.
+
+## Bulk overwrite follow-up
+
+The stress profile also needs attention on the write path. A purpose-built bulk
+overwrite API that accepts many repo/user edges at once, streams or stages the
+input server-side, and avoids repeated per-repo permission reconciliation would
+make worst-case full syncs much safer.
+
+## Copy/paste request
+
+Title: Add a bulk GraphQL read path for explicit repository permissions
+
+Problem: `src-auth-perms-sync` must snapshot explicit API repo permissions for
+many users. The only current GraphQL read path is
+`User.permissionsInfo.repositories(source: API)`. Current traces show this is
+per-user work even when the client asks only for repo IDs: 25 aliases produced
+25 `LoadUserPermissions` spans and 127 SQL spans; 100 aliases produced 100
+`LoadUserPermissions` spans and 502 SQL spans. The client must then hydrate
+repository names separately; a 19-repo `RepositoryNamesByID` query produced 19
+`repos.Get` spans and 22 SQL spans. Older traces that resolved repository
+objects directly inside `permissionsInfo.repositories` produced 475 `repos.Get`
+spans for 25 aliases and 1,900 for 100 aliases. Larger batches and higher
+concurrency therefore increase server-side resolver/SQL fanout enough to cause
+timeouts instead of improving throughput.
+
+Request: add a bulk explicit-permissions read API that accepts many user IDs and
+returns compact permission edges (`userID`, `repositoryID`, `repositoryName`,
+`updatedAt`) for `source: API`, without resolving full `Repository` GraphQL
+objects. A single indexed query over `user_repo_permissions` joined to `repo`
+should be enough for each user batch. Also add a cheaper presence/filter path
+for `get --users-without-explicit-perms`: either `userID -> has explicit API
+repo permissions` for many users, or a direct query for users without explicit
+API repo permissions, optionally filtered by `createdAt`.
+
+Acceptance criteria:
+
+- One request can fetch explicit API repo permissions for many users.
+- The response includes repository ID and name without triggering per-repo
+  `db.Repos().Get` resolver calls.
+- The implementation preserves current authorization checks.
+- The store method and resolver have Jaeger spans/metrics that make per-batch
+  latency visible.
+- `src-auth-perms-sync` can replace its aliased
+  `User.permissionsInfo.repositories(source: API)` calls with this API.
+- `src-auth-perms-sync get --users-without-explicit-perms` can stop probing
+  every candidate user through `User.permissionsInfo.repositories(source: API,
+  first: 1)`.
+- Follow-up: evaluate a bulk overwrite API for large full-set applies. The
+  stress run planned roughly 10 million grants and observed
+  `permsStore.upsertUserRepoPermissions-range1` averaging about 2.5s per call.