Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions dev/TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@ See the thread discussion 2026-06-11.
- Use the stress-run evidence in
[engineering-requests.md](./engineering-requests.md)
to request Sourcegraph bulk explicit-permission read and write APIs.
2026-06-12: presence-probe resolver internals and measured costs added
there (see "Presence-check resolver internals"); request is ready to
submit. Client-side, `set --users-without-explicit-perms` now matches
rules before probing and hydrates users in aliased batches (5,210s →
15s on the 10k-user instance), but `get --users-without-explicit-perms`
still probes every active user — only a server-side presence/filter API
fixes that.
New evidence 2026-06-10: the whole-instance apply (1,150 repo
overwrites x 10,002 bindIDs each at parallelism 16) crashed the test
instance's Postgres ("connection refused", "unexpected EOF"); the
Expand Down
46 changes: 46 additions & 0 deletions dev/engineering-requests.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,52 @@ from `github.com/sourcegraph/sourcegraph`:
permissions, but the read path uses per-user connection queries and repo
resolver fanout.

## Presence-check resolver internals (2026-06-12)

Measured on the 10k-user / 50k-repo test instance, the presence probe
`User.permissionsInfo.repositories(source: API, first: 1)` costs 225–350ms of
server work per user, and alias batching barely helps (21,004 single-user
probes averaged 351.6ms; 25-user batches averaged 5,616ms ≈ 224.7ms/user). A
single `set --users-without-explicit-perms` run probing all 10,002 users at
batch size 1 spent 4,269s of its 5,210s total in these probes.

Reading the resolver code in `github.com/sourcegraph/sourcegraph` explains why
`first: 1` does not make the probe cheap:

- `UserResolver.PermissionsInfo` (resolver.go) **unconditionally** calls
`db.Perms().LoadUserPermissions`, which runs
`SELECT ... FROM user_repo_permissions WHERE user_id = $1` with **no LIMIT
and no source filter**, only to compute the parent object's `source` /
`updatedAt` fields. The rows are discarded afterward.
- `userPermissionsInfoResolver.Repositories` (permissions_info.go) builds a
CTE (`reposPermissionsInfoQueryFmt` in perms_store.go) that **materializes
every repo accessible to the user** — a full `repo` ⋈
`external_service_repos` ⋈ `external_services` join with the correlated
authz `EXISTS` predicate and an `ORDER BY` — before the outer query applies
`urp.source = 'API'` and the LIMIT. `first: 1` becomes `LIMIT 2` on the
outer query only; the CTE is not short-circuited.
- Requesting `totalCount` adds a second independent execution of the same CTE
(`CountUserPermissions`).
- `user_repo_permissions` has only two indexes:
`(user_id, user_external_account_id, repo_id)` unique and `(repo_id)`.
Nothing covers `source`, so even the row-level filter cannot use an index.

The cheap query Sourcegraph could run instead is a single indexed scan:

```sql
SELECT DISTINCT user_id FROM user_repo_permissions WHERE source = 'api';
```

Client-side mitigation shipped in `src-auth-perms-sync` (2026-06-12):
`set --users-without-explicit-perms` now matches maps.yaml user selectors
locally BEFORE probing, so probes scale with the rule-matched user count
instead of the instance's user count, and user hydration runs as aliased
25-user batches instead of one `UserByID` request per user. The remaining
inherent cost — ~225ms × probed user — is exactly what the
presence/filter API requested below would remove, and
`get --users-without-explicit-perms` still has to probe every active user
because its semantics are instance-wide.

## Current explicit-permissions REST API findings

[Deep Search findings](https://sourcegraph.sourcegraph.com/deepsearch/972e0964-1b41-4805-8774-65dad2ad58c6)
Expand Down
99 changes: 69 additions & 30 deletions src/src_auth_perms_sync/permissions/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,26 @@ def _mapping_context_with_rules(
)


def _users_matching_any_mapping_rule(
context: permission_types.MappingContext,
users: list[shared_types.User],
) -> list[shared_types.User]:
"""Return users matched by at least one mapping rule's user selector."""
return [
user
for user in users
if any(
permissions_mapping.user_matches_user_selector(
mapping_rule["users"],
user,
context.providers,
context.saml_groups_attribute_names,
)
for mapping_rule in context.mapping_rules
)
]


def _mapping_rules_matching_selected_users(
context: permission_types.MappingContext,
users: list[shared_types.User],
Expand Down Expand Up @@ -506,22 +526,17 @@ def _hydrate_site_user_candidates(
return []

log.info(
"Hydrating Sourcegraph metadata for %d selected user candidate(s) with parallelism=%d ...",
"Hydrating Sourcegraph metadata for %d selected user candidate(s) "
"in batches of %d with parallelism=%d ...",
len(candidates),
permissions_sourcegraph.USER_HYDRATION_BATCH_SIZE,
parallelism,
)

def hydrate_user(candidate: shared_types.SiteUserCandidate) -> shared_types.User | None:
return permissions_sourcegraph.get_user_by_id(
client,
candidate["id"],
include_emails=include_emails,
include_account_data=include_account_data,
)

hydrated_users = run_context.parallel_map(
hydrate_user,
candidates,
hydrated_users = permissions_sourcegraph.get_users_by_ids(
client,
[candidate["id"] for candidate in candidates],
include_emails=include_emails,
include_account_data=include_account_data,
parallelism=parallelism,
worker_pool=worker_pool,
progress_label="Hydrated selected Sourcegraph user metadata",
Expand Down Expand Up @@ -852,31 +867,55 @@ def cmd_set_additive_users_without_explicit_perms(
include_user_account_data = permissions_mapping.mapping_rules_need_saml_account_data(
mapping_rules
)
candidate_selection = (
permissions_sourcegraph.list_site_user_candidates_without_explicit_repos(
client,
created_after_filter,
batch_size=explicit_permissions_batch_size,
parallelism=parallelism,
worker_pool=worker_pool,
)
)
candidates = candidate_selection.candidates
log.info(
"Selected %d active user candidate(s) without explicit repo permissions; "
"skipped %d with existing explicit permissions.",
len(candidates),
candidate_selection.explicit_user_count,
# Match mapping rules BEFORE the explicit-permission check: the
# per-user `permissionsInfo.repositories(source: API, first: 1)`
# probe costs ~0.2-0.4s of server CPU per user regardless of
# batching (the resolver materializes every accessible repo before
# the LIMIT applies), while rule matching is local. Checking only
# rule-matched users keeps the expensive probes proportional to the
# maps.yaml scope instead of the whole instance.
candidates = permissions_sourcegraph.list_site_user_candidates(
client,
created_after_filter,
parallelism=parallelism,
worker_pool=worker_pool,
)

users = _hydrate_site_user_candidates(
all_users = _hydrate_site_user_candidates(
client,
candidates,
include_emails=include_user_emails,
include_account_data=include_user_account_data,
parallelism=parallelism,
worker_pool=worker_pool,
)
matched_users = _users_matching_any_mapping_rule(context, all_users)
log.info(
"%d / %d active user(s) match a mapping rule's user selector.",
len(matched_users),
len(all_users),
)
explicit_user_ids: set[str] = set()
if matched_users:
log.info(
"Checking %d matched user(s) for existing explicit repo permissions "
"in batches of %d ...",
len(matched_users),
explicit_permissions_batch_size,
)
explicit_user_ids = permissions_sourcegraph.user_ids_with_explicit_repos(
client,
[user["id"] for user in matched_users],
batch_size=explicit_permissions_batch_size,
parallelism=parallelism,
worker_pool=worker_pool,
)
users = [user for user in matched_users if user["id"] not in explicit_user_ids]
log.info(
"Selected %d matched user(s) without explicit repo permissions; "
"skipped %d with existing explicit permissions.",
len(users),
len(explicit_user_ids),
)
if not users:
_run_additive_apply(
client,
Expand Down
26 changes: 26 additions & 0 deletions src/src_auth_perms_sync/permissions/queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,32 @@ def query_user_by_id(
"""


def users_by_ids_batch_query(
batch_size: int,
*,
include_emails: bool = False,
include_account_data: bool = True,
) -> str:
"""Hydrate many users in one request via aliased `node()` lookups.

Replaces one `UserByID` round trip per user: the per-request overhead
dominates user hydration, so batching cuts request count by the batch
size with the same per-user fields.
"""
fields = user_fields(include_emails=include_emails, include_account_data=include_account_data)
variables = ", ".join(f"$user{index}: ID!" for index in range(batch_size))
aliases = "".join(
f"""
user{index}: node(id: $user{index}) {{
... on User {{
{fields}
}}
}}"""
for index in range(batch_size)
)
return f"query UsersByIDBatch({variables}) {{{aliases}\n}}"


QUERY_USER_BY_USERNAME = query_user_by_username()
QUERY_USER_BY_EMAIL = query_user_by_email()
QUERY_USER_BY_ID = query_user_by_id()
Expand Down
45 changes: 45 additions & 0 deletions src/src_auth_perms_sync/permissions/sourcegraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
log = logging.getLogger(__name__)
SITE_USER_CANDIDATE_PAGE_SIZE = 1000
REPOSITORY_PAGE_SIZE = 1000
USER_HYDRATION_BATCH_SIZE = 25


@dataclass(frozen=True)
Expand Down Expand Up @@ -209,6 +210,50 @@ def get_user_by_id(
return cast(shared_types.User | None, data.get("node"))


def get_users_by_ids(
client: src.SourcegraphClient,
user_ids: Sequence[str],
*,
include_emails: bool = False,
include_account_data: bool = True,
parallelism: int = 1,
worker_pool: ThreadPoolExecutor | None = None,
progress_label: str | None = None,
) -> list[shared_types.User | None]:
"""Hydrate User nodes by GraphQL ID in aliased batches.

Returns one entry per requested ID, in order; `None` marks users that
no longer exist.
"""

def fetch_batch(batch: Sequence[str]) -> list[shared_types.User | None]:
data = client.graphql(
queries.users_by_ids_batch_query(
len(batch),
include_emails=include_emails,
include_account_data=include_account_data,
),
cast(src.JSONDict, {f"user{index}": user_id for index, user_id in enumerate(batch)}),
follow_pages=False,
)
users: list[shared_types.User | None] = []
for index in range(len(batch)):
node = src.json_dict(data.get(f"user{index}"))
users.append(cast(shared_types.User, node) if node.get("id") else None)
return users

users: list[shared_types.User | None] = []
for batch_users in run_context.parallel_map(
fetch_batch,
_batches(tuple(user_ids), USER_HYDRATION_BATCH_SIZE),
parallelism=parallelism,
worker_pool=worker_pool,
progress_label=progress_label,
):
users.extend(batch_users)
return users


def list_site_user_candidates(
client: src.SourcegraphClient,
created_after: str | None,
Expand Down
7 changes: 7 additions & 0 deletions tests/e2e/case_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,13 @@ def graphql(
return {"user": self._graphql_user_by_email(variable_values["email"])}
if "query UserByID" in query:
return {"node": self._graphql_user_by_id(variable_values["id"])}
if "query UsersByIDBatch" in query:
hydrated: dict[str, Any] = {}
index = 0
while f"user{index}" in variable_values:
hydrated[f"user{index}"] = self._graphql_user_by_id(variable_values[f"user{index}"])
index += 1
return hydrated
if "query SiteUsers" in query:
return {"site": {"users": self._site_users(variable_values)}}
if "query UserExplicitRepoExistsBatch" in query:
Expand Down
Loading