Summary
Please backport upstream commit f91c7bb ("sdap: eliminate O(N²) loop in
sdap_add_incomplete_groups()") to the sssd-2-9-4 branch. This fix addresses a
significant performance regression visible on RHEL 8 deployments running sssd 2.9.4
in environments with large numbers of LDAP groups.
It's my belief this is a major contributing factor to the WATCHDOG death spiral causing
denial of service to users once the cache reaches a certain size. New writes to the cache
on login cause big delays, triggering the WATCHDOG to terminate the process, relaunching
and never allowing the user to login.
Problem
sdap_add_incomplete_groups() uses a two-pass design:
- Walk
sysdb_groupnames[], check each against sysdb, build a missing[] list.
- For each missing group, scan the entire
ldap_groups[] array again to find matching LDAP attributes.
When the cache is cold (e.g., after a service restart or cache flush), every group is
missing, so the inner scan runs N times over an N-element array — O(N²) complexity.
In production environments with hundreds or thousands of groups per user this causes
id / getent lookups to stall noticeably and can cause timeouts that propagate to
PAM/SSH authentication.
The upstream commit message also notes a secondary effect: the repeated
sdap_get_group_primary_name() calls inside the inner loop caused excess talloc
allocations that trashed ldb/tdb cache pages, slowing the subsequent
sysdb_update_members() call in sdap_initgr_common_store() as well.
Upstream fix
Commit f91c7bb (merged 2026-02-16, authored by Alexey Tikhonov, reviewed by
Justin Stephenson and Sumit Bose) replaces the two-pass design with a single O(N)
loop that iterates ldap_groups[] directly: for each entry, check sysdb and, if
missing, create the incomplete entry immediately. The sysdb_groupnames parameter is
removed as it is no longer needed.
Files changed:
| File |
Change |
src/providers/ldap/sdap_async_groups.c |
Remove sysdb_groupnamelist call-site |
src/providers/ldap/sdap_async_initgroups.c |
Restructure main loop |
src/providers/ldap/sdap_async_private.h |
Drop sysdb_groupnames from prototype |
Why sssd-2-9-4 specifically
RHEL 8 ships sssd 2.9.4 and will not receive newer minor versions. Without a
backport to the sssd-2-9-4 branch, this fix cannot reach RHEL 8 users through a
normal errata update. The fix is a pure algorithmic improvement with no new
dependencies or behavior changes — it is a low-risk candidate for backport.
Proof-of-concept backport
I have already attempted/applied this patch to sssd-2-9-4 locally. The cherry-pick required
a manual conflict resolution in sdap_async_initgroups.c (the 2-9-4 branch carries a
slightly different structure in the if (use_id_mapping) block), but the resolved
result is semantically identical to the upstream version. I'm working to test on my RHEL8 test
system and perform the same 10k lookups and measure performance. If the team prefers
another back port, happy to test.
References
- Upstream commit:
f91c7bbc38e41eeb31f2132acc7263bd4ac9d47c
- Target branch:
sssd-2-9-4
Summary
Please backport upstream commit f91c7bb ("sdap: eliminate O(N²) loop in
sdap_add_incomplete_groups()") to thesssd-2-9-4branch. This fix addresses asignificant performance regression visible on RHEL 8 deployments running sssd 2.9.4
in environments with large numbers of LDAP groups.
It's my belief this is a major contributing factor to the WATCHDOG death spiral causing
denial of service to users once the cache reaches a certain size. New writes to the cache
on login cause big delays, triggering the WATCHDOG to terminate the process, relaunching
and never allowing the user to login.
Problem
sdap_add_incomplete_groups()uses a two-pass design:sysdb_groupnames[], check each against sysdb, build amissing[]list.ldap_groups[]array again to find matching LDAP attributes.When the cache is cold (e.g., after a service restart or cache flush), every group is
missing, so the inner scan runs N times over an N-element array — O(N²) complexity.
In production environments with hundreds or thousands of groups per user this causes
id/getentlookups to stall noticeably and can cause timeouts that propagate toPAM/SSH authentication.
The upstream commit message also notes a secondary effect: the repeated
sdap_get_group_primary_name()calls inside the inner loop caused excesstallocallocations that trashed ldb/tdb cache pages, slowing the subsequent
sysdb_update_members()call insdap_initgr_common_store()as well.Upstream fix
Commit f91c7bb (merged 2026-02-16, authored by Alexey Tikhonov, reviewed by
Justin Stephenson and Sumit Bose) replaces the two-pass design with a single O(N)
loop that iterates
ldap_groups[]directly: for each entry, check sysdb and, ifmissing, create the incomplete entry immediately. The
sysdb_groupnamesparameter isremoved as it is no longer needed.
Files changed:
src/providers/ldap/sdap_async_groups.csysdb_groupnamelistcall-sitesrc/providers/ldap/sdap_async_initgroups.csrc/providers/ldap/sdap_async_private.hsysdb_groupnamesfrom prototypeWhy sssd-2-9-4 specifically
RHEL 8 ships sssd 2.9.4 and will not receive newer minor versions. Without a
backport to the
sssd-2-9-4branch, this fix cannot reach RHEL 8 users through anormal errata update. The fix is a pure algorithmic improvement with no new
dependencies or behavior changes — it is a low-risk candidate for backport.
Proof-of-concept backport
I have already attempted/applied this patch to
sssd-2-9-4locally. The cherry-pick requireda manual conflict resolution in
sdap_async_initgroups.c(the 2-9-4 branch carries aslightly different structure in the
if (use_id_mapping)block), but the resolvedresult is semantically identical to the upstream version. I'm working to test on my RHEL8 test
system and perform the same 10k lookups and measure performance. If the team prefers
another back port, happy to test.
References
f91c7bbc38e41eeb31f2132acc7263bd4ac9d47csssd-2-9-4