Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,5 @@ wheels/
!.env.example
!.markdownlint-cli2.yaml
!maps-example.yaml
!tests/tests.yaml
!tests/e2e/fixtures/**/maps.yaml
46 changes: 35 additions & 11 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# AGENTS.md

## Reference materials

- GraphQL schema and database migrations (changes to SQL schema) are available in
<https://github.com/sourcegraph/artifacts>

## Linting

```bash
Expand All @@ -26,21 +31,40 @@ uv run src-auth-perms-sync --help

## Testing

- First run a dry-run (default behaviour, without `--apply` flag) against a Sourcegraph instance
All testing runs through one entrypoint: `tests/run.py`. Output goes to the
console and to a per-run log file under `logs/`. Each level runs only its
own checks.

```bash
uv run src-auth-perms-sync [--get]
uv run src-auth-perms-sync --set maps.yaml --full
uv run src-auth-perms-sync --restore backups/<source>/<run>/before.json
# Fast, no network (also what the pre-commit hook runs):
# lint, format, pyright, unit + fixture tests, CLI rejection matrix,
# randomized permission invariants
uv run tests/run.py

# End-to-end runs against the .env test instance with independent GraphQL
# read-back verification, and a wheel install smoke test
uv run tests/run.py --live

# Run a subset: comma-delimited test names, substring match
uv run tests/run.py --live full-overwrite-unions
uv run tests/run.py --live wheel,baseline

# Repeated timed runs with Jaeger trace retention, RSS sampling,
# optional kubectl load monitoring, and baseline comparison
uv run tests/run.py --performance --repeat 3
uv run tests/run.py --performance --baseline-command "uvx src-auth-perms-sync@latest" \
--fail-on-memory-regression-percent 10

# Regenerate fixture goldens after editing tests/e2e/fixtures/ cases
uv run tests/run.py --update-golden
```

- Read the output, and evaluate the expected changes
- If the expected changes look correct
- Run with the `--apply` flag against the test instance
- Read and evaluate the output for expected changes
- Run with the `--restore` flag against the test instance
- Always inspect the before / after snapshots in
`src-auth-perms-sync-runs/<endpoint>/backups/` afterward to confirm the diff matches what you expected
- Fixture cases live in `tests/e2e/fixtures/<case>/` — see the README there
for the format. Add cases there to cover new mapping behaviors.
- For manual verification against a real instance, dry-run first (no
`--apply`), read the planned changes, then `--apply` on a scratch instance
and inspect the before/after snapshots under
`src-auth-perms-sync-runs/<endpoint>/runs/`.

## Release process

Expand Down
29 changes: 28 additions & 1 deletion dev/TODO.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
# TODO

## Medium priority: extend SAML-group live coverage to org sync

tests/setup.py now fabricates SAML accounts with synthetic groups
(`perms-sync-test-eng` / `perms-sync-test-sales`, see tests/setup.yaml).
saml-group-live covers permission mapping; add a seeded
`sync-saml-orgs --apply` live case that maps those groups to a throwaway
org and asserts membership is added AND removed (today's
sync-saml-orgs-apply only covers the single real Okta user, add-only).

## Decide: pendingBindIDs / usersWithPendingPermissions

The CLI cannot create pending permissions (it validates users exist), but
snapshots record `pending_bindIDs`, and setup.py / the live hygiene check
report (never delete) any that appear. Decide whether "grant before first
login" is a customer need; if not, consider dropping the snapshot field.
See the thread discussion 2026-06-11.

## High priority: Remote trigger on demand

- Sourcegraph webhook for new user coming in v7.4.0
Expand All @@ -21,14 +38,24 @@
## High priority: Reduce worst-case full-permission sync load

- Use the stress-run evidence in
[memory-efficiency.md](./memory-efficiency.md)
[engineering-requests.md](./engineering-requests.md)
to request Sourcegraph bulk explicit-permission read and write APIs.
New evidence 2026-06-10: the whole-instance apply (1,150 repo
overwrites x 10,002 bindIDs each at parallelism 16) crashed the test
instance's Postgres ("connection refused", "unexpected EOF"); the
client circuit breaker opened and the harness restored cleanly. That
stress cycle is now opt-in: `uv run tests/run.py --live "full cycle"`.
- Add an explicit destructive/performance-test mode to the e2e runner so giant
stress runs can skip or defer full restore cleanup when the goal is finding
the server-side breaking point.
- Revisit full snapshot capture once Sourcegraph exposes a bulk read path;
replace aliased `User.permissionsInfo.repositories(source: API)` calls before
raising concurrency further.
- `get --repos <name>` still scans every user's explicit grants to find one
repo's holders (~400 s at 10k users). A repo-centric read
(`repository.permissionsInfo.users` + site-admin disambiguation, as the
test harness already does) would make it seconds — see the repo-centric
section below.

## Low priority: Repo-centric path, when users > repos, or for cross-checking

Expand Down
270 changes: 270 additions & 0 deletions dev/engineering-requests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
# Engineering requests

Use this when opening Sourcegraph Engineering issues from memory-efficiency
evidence. Capture steps stay in [memory-efficiency.md](./memory-efficiency.md);
this file keeps the request-ready problem statement, evidence, proposed API
shape, and copy/paste issue text.

## Requested Sourcegraph changes

1. Add a bulk GraphQL read path for explicit API repository permissions.
2. Add a cheaper presence/filter path for users without explicit API repo
permissions.
3. Add Jaeger spans / metrics around the new store methods and around current
`ListUserPermissions` / `CountUserPermissions` paths.
4. Follow up with a bulk overwrite API for large full-set applies.

## Current trace findings

Current `src-auth-perms-sync` snapshots explicit API grants by calling
`User.permissionsInfo.repositories(source: API)` through aliased
`UserExplicitReposBatch` queries. It requests only permission repo IDs, then
hydrates names separately with `RepositoryNamesByID`.

A focused traced batch for one user with 19 explicit repos showed per-user
fanout even when only IDs were requested:

| User aliases | CLI request | Jaeger spans | `LoadUserPermissions` | `sql.conn.query` |
| ---: | ---: | ---: | ---: | ---: |
| 1 | 398ms | 13 | 1 | 7 |
| 25 | 508ms | 157 | 25 | 127 |
| 100 | 1,185ms | 607 | 100 | 502 |

The second hydration query also fans out. A traced `RepositoryNamesByID` query
for 19 repos produced 46 spans, including 19 `repos.Get` spans and 22
`sql.conn.query` spans.

An older trace shape that resolved repository objects directly inside
`permissionsInfo.repositories` showed the per-repo resolver fanout more
dramatically:

| Request shape | Root GraphQL span | Jaeger fanout |
| --- | ---: | --- |
| 25 user aliases, 19 explicit repos each | ~770ms | 475 `repos.Get`, 603 `sql.conn.query` |
| 100 user aliases, 19 explicit repos each | ~3,769ms | 1,900 `repos.Get`, 2,403 `sql.conn.query` |

Together these point to Sourcegraph server-side GraphQL / DB resolver fanout,
not local Python CPU. Larger batches reduce request count but can increase
per-request resolver and SQL work enough to cause timeouts on the test
instance.

One live-instance behavior is expected: if Sourcegraph returns a GraphQL
application error showing that a repo/user disappeared between planning and the
mutation, `src-auth-perms-sync` logs a skipped mutation and continues. The next
scheduled run will re-plan against the then-current users/repos. Other GraphQL
application errors still fail normally.

## Stress-run evidence

A prior hard stress map used about 10,001 users and about 1,000 repos, planning
roughly 10 million explicit grants. That run showed Sourcegraph-side read and
write costs were the bottleneck. `pg_stat_statements` attributed most database
time to explicit-permissions helpers:

| Sourcegraph operation | Calls | Total time | Mean time |
| --- | ---: | ---: | ---: |
| `permsStore.ListUserPermissions` | 19,974 | 30,862.6s | 1,545ms |
| `permsStore.upsertUserRepoPermissions-range1` | 472 | 1,178.8s | 2,497ms |

Compared with focused traces at normal scale, `ListUserPermissions` became much
slower under the large explicit-perms state. This reinforces that the CLI needs
better Sourcegraph bulk read and write APIs for very large explicit permission
sets.

## Concurrent-operator evidence (2026-06-10)

Four `src-auth-perms-sync` processes ran full explicit-permissions captures
concurrently against the 10k-user / 50k-repo test instance (each at
`--parallelism 8`, `--explicit-permissions-batch-size 25`), while a fifth ran a
small `set` command. Instance: single `pgsql-0` on an 8-core node.

Observed during the concurrent captures:

- `pgsql-0` CPU (`kubectl top`): 7,636–7,683 millicores of 8,000 (saturated).
- `frontend` / `gitserver` CPU: 124–138m / 2–3m (idle bystanders).
- `pg_stat_activity`: 29 active statements, all
`permsStore.ListUserPermissions`, **zero wait events** — pure CPU, no lock
contention.
- `pg_stat_statements`: `permsStore.ListUserPermissions` at 24,026 calls,
27,635.6s total, 1,150ms mean.
- Per-client capture throughput: 23 users/sec solo → 2–4 users/sec at 4-way
concurrency.
- Aggregate throughput: 8–16 users/sec at 4-way — **below the 23 users/sec a
single client achieves alone** (negative scaling).
- ALB (CloudWatch): no 5xx, no rejected connections — the edge and frontend
are not the bottleneck.
- Collateral failure: the fifth client's queries exceeded the 60s read timeout
under this load; 5 retry attempts exhausted; its run failed with exit 1.

Implications for the engineering request:

- A single per-user `permissionsInfo.repositories(source: API)` read costs
roughly 0.3–0.4s of Postgres CPU at this state size (1,150ms mean execution
under contention), so one operator at modest parallelism can saturate the
database by itself, and two concurrent operators degrade each other below
single-operator throughput.
- Timeout/retry behavior amplifies the problem: once statements exceed the
client read timeout, retries re-run the same expensive queries, adding load
exactly when the database is saturated.
- A bulk read API (one query returning explicit grants for many users or for
whole repos) would replace ~10,000 × ~1s statements per capture with a
single scan, and would also make concurrent operators viable.

## Sourcegraph codepath findings

[Deep Search findings](https://sourcegraph.sourcegraph.com/deepsearch/52a24164-1eb3-4db1-a92d-e320ef1c7557)
from `github.com/sourcegraph/sourcegraph`:

- Schema: `cmd/frontend/graphqlbackend/authz.graphql` exposes
`User.permissionsInfo.repositories(source: PermissionSource)`.
- `UserResolver.PermissionsInfo` enters
`cmd/frontend/internal/authz/resolvers/resolver.go` and calls
`db.Perms().LoadUserPermissions(ctx, userID)` before the repositories
connection is resolved.
- `userPermissionsInfoResolver.Repositories` in
`cmd/frontend/internal/authz/resolvers/permissions_info.go` uses the generic
connection resolver, so `nodes` and `totalCount` can evaluate separately.
- Each permission node's `Repository()` resolver calls `db.Repos().Get`,
creating an N+1 query pattern for repository hydration.
- Even when the client asks only for permission repo IDs, each aliased user
still runs `LoadUserPermissions` and several SQL queries. Current
`src-auth-perms-sync` then hydrates repository names separately through
`node(id)`, which also resolves as one `repos.Get` per repository ID.
- `internal/database/perms_store.go` has bulk write helpers for setting repo
permissions, but the read path uses per-user connection queries and repo
resolver fanout.

## Proposed bulk read API

`src-auth-perms-sync` needs to snapshot explicit API permissions for many
users. Today it calls `User.permissionsInfo.repositories(source: API)` with
GraphQL aliases. This is correct, but expensive at scale.

Request a bulk read API for explicit permissions. GraphQL semantics make this a
query, not a mutation:

```graphql
type ExplicitRepositoryPermission {
userID: ID!
repositoryID: ID!
repositoryName: String!
updatedAt: DateTime!
}

extend type Query {
explicitRepositoryPermissionsForUsers(
userIDs: [ID!]!
source: PermissionSource = API
): [ExplicitRepositoryPermission!]!
}
```

Back it with one SQL shape per user batch:

```sql
SELECT urp.user_id, urp.repo_id, repo.name, urp.updated_at
FROM user_repo_permissions urp
JOIN repo ON repo.id = urp.repo_id AND repo.deleted_at IS NULL
WHERE urp.user_id = ANY($1)
AND urp.source = 'api'
ORDER BY urp.user_id, repo.name;
```

Important requirements:

- Return compact scalar data, not `Repository` GraphQL objects, to avoid
per-repo resolver hydration.
- Enforce the same authorization policy as the current user permissions
resolver.
- Support batching / pagination for large user lists.
- Add Jaeger spans around the new store method and around existing
`ListUserPermissions` / `CountUserPermissions` so future investigations do
not require inferring work from `sql.conn.query` spans alone.

Expected benefit: replace hundreds or thousands of per-repo resolver SQL spans
per request with one indexed `user_repo_permissions` join per user batch.

## Proposed presence/filter API

The `get --users-without-explicit-perms` path also needs a cheaper presence
check. Today it has to ask
`User.permissionsInfo.repositories(source: API, first: 1)` for every candidate
user, in aliased batches. Recent test runs show the client can parallelize
those batches, but the Sourcegraph frontend / load balancer can still return
502/503s under that resolver load. Add one or both direct APIs:

```graphql
type ExplicitRepositoryPermissionPresence {
userID: ID!
hasExplicitRepositoryPermissions: Boolean!
}

extend type Query {
explicitRepositoryPermissionPresenceForUsers(
userIDs: [ID!]!
source: PermissionSource = API
): [ExplicitRepositoryPermissionPresence!]!

usersWithoutExplicitRepositoryPermissions(
createdAt: DateTimeFilter
source: PermissionSource = API
first: Int
after: String
): UserConnection!
}
```

Expected benefit: `src-auth-perms-sync get --users-without-explicit-perms` can
either check explicit-permission presence for candidate users in one indexed
batch query, or ask Sourcegraph for the filtered user set directly instead of
probing every user through the expensive permissions connection resolver.

## Bulk overwrite follow-up

The stress profile also needs attention on the write path. A purpose-built bulk
overwrite API that accepts many repo/user edges at once, streams or stages the
input server-side, and avoids repeated per-repo permission reconciliation would
make worst-case full syncs much safer.

## Copy/paste request

Title: Add a bulk GraphQL read path for explicit repository permissions

Problem: `src-auth-perms-sync` must snapshot explicit API repo permissions for
many users. The only current GraphQL read path is
`User.permissionsInfo.repositories(source: API)`. Current traces show this is
per-user work even when the client asks only for repo IDs: 25 aliases produced
25 `LoadUserPermissions` spans and 127 SQL spans; 100 aliases produced 100
`LoadUserPermissions` spans and 502 SQL spans. The client must then hydrate
repository names separately; a 19-repo `RepositoryNamesByID` query produced 19
`repos.Get` spans and 22 SQL spans. Older traces that resolved repository
objects directly inside `permissionsInfo.repositories` produced 475 `repos.Get`
spans for 25 aliases and 1,900 for 100 aliases. Larger batches and higher
concurrency therefore increase server-side resolver/SQL fanout enough to cause
timeouts instead of improving throughput.

Request: add a bulk explicit-permissions read API that accepts many user IDs and
returns compact permission edges (`userID`, `repositoryID`, `repositoryName`,
`updatedAt`) for `source: API`, without resolving full `Repository` GraphQL
objects. A single indexed query over `user_repo_permissions` joined to `repo`
should be enough for each user batch. Also add a cheaper presence/filter path
for `get --users-without-explicit-perms`: either `userID -> has explicit API
repo permissions` for many users, or a direct query for users without explicit
API repo permissions, optionally filtered by `createdAt`.

Acceptance criteria:

- One request can fetch explicit API repo permissions for many users.
- The response includes repository ID and name without triggering per-repo
`db.Repos().Get` resolver calls.
- The implementation preserves current authorization checks.
- The store method and resolver have Jaeger spans/metrics that make per-batch
latency visible.
- `src-auth-perms-sync` can replace its aliased
`User.permissionsInfo.repositories(source: API)` calls with this API.
- `src-auth-perms-sync get --users-without-explicit-perms` can stop probing
every candidate user through `User.permissionsInfo.repositories(source: API,
first: 1)`.
- Follow-up: evaluate a bulk overwrite API for large full-set applies. The
stress run planned roughly 10 million grants and observed
`permsStore.upsertUserRepoPermissions-range1` averaging about 2.5s per call.
Loading