Part of: #408
Story: Scheduler Leader Election with pg_advisory_lock
Part of: #408
[Conversation Reference: "I think it's cleaner that one server takes over scheduling work" and "we will have to have a system to prevent duplicate schedules"]
Story Overview
Objective: Implement leader election using PostgreSQL's pg_advisory_lock so that exactly one cluster node runs the scheduler (RefreshScheduler, DataRetentionScheduler, and other periodic tasks). The advisory lock is held on a dedicated PostgreSQL connection -- if the connection drops (node crash), the lock is automatically released and another node can acquire it.
User Value: Periodic tasks (repo refresh, data retention cleanup, etc.) run exactly once across the cluster, not once per node. No duplicate scheduling, no missed schedules.
Acceptance Criteria
AC1: Advisory Lock Acquisition for Leadership
Scenario: One node acquires the scheduler leadership lock.
Given multiple CIDX nodes are starting up
When each node attempts to acquire pg_advisory_lock(lock_id)
Then exactly one node succeeds (returns true)
And the winning node starts all scheduler services
And losing nodes do NOT start scheduler services
And losing nodes continue serving queries normally
Technical Requirements:
AC2: Scheduler Services Gated by Leadership
Scenario: Only the leader node runs periodic schedulers.
Given Node A is the leader
When scheduler startup runs
Then Node A starts: RefreshScheduler, DataRetentionScheduler, and other periodic services
And Node B (follower) does NOT start these schedulers
And Node B's startup log says: "Not leader, skipping scheduler startup"
And Node A's startup log says: "Acquired leadership, starting schedulers"
Technical Requirements:
AC3: Dedicated Connection for Lock Holding
Scenario: The advisory lock is held on a connection that stays open.
Given the leader node holds pg_advisory_lock on a dedicated connection
When the connection remains open
Then the lock is held continuously
And the lock is NOT released between operations (session-level lock)
And the connection is separate from the connection pool used by backends
Technical Requirements:
AC4: Automatic Lock Release on Node Failure
Scenario: If the leader crashes, the lock is automatically released.
Given Node A is the leader holding the advisory lock
When Node A's process crashes (kill -9, OOM, etc.)
Then the PostgreSQL connection is dropped
And PostgreSQL automatically releases the session-level advisory lock
And the lock becomes available for other nodes to acquire
Technical Requirements:
Implementation Status
Technical Implementation Details
File Structure
src/code_indexer/server/services/
leader_election_service.py # LeaderElectionService
Advisory Lock Pattern
class LeaderElectionService:
LOCK_ID = 0x434944585F4C4452 # "CIDX_LDR"
def __init__(self, postgres_url: str):
self._postgres_url = postgres_url
self._lock_conn: Optional[psycopg.Connection] = None
self._is_leader = False
def try_acquire_leadership(self) -> bool:
"""Attempt to become the leader. Non-blocking."""
self._lock_conn = psycopg.connect(
self._postgres_url,
autocommit=True,
options="-c statement_timeout=5000"
)
# Configure TCP keepalive on the connection
row = self._lock_conn.execute(
"SELECT pg_try_advisory_lock(%s)", (self.LOCK_ID,)
).fetchone()
self._is_leader = row[0] # True if lock acquired
if not self._is_leader:
self._lock_conn.close()
self._lock_conn = None
return self._is_leader
def is_leader(self) -> bool:
return self._is_leader
def release(self) -> None:
"""Graceful release on shutdown."""
if self._lock_conn:
self._lock_conn.close() # Releases advisory lock
self._is_leader = False
Startup Integration
1. Create connection pool (shared backends)
2. Run migrations
3. Validate NFS mount
4. Try acquire leadership <-- THIS STORY
5. If leader: start schedulers
6. If not leader: skip schedulers, log info
7. Start query serving (all nodes)
8. Start heartbeat service (all nodes)
9. Start reconciliation service (all nodes)
Leader-Only vs All-Node Services
| Service |
Leader Only |
All Nodes |
| RefreshScheduler |
YES |
- |
| DataRetentionScheduler |
YES |
- |
| MetaDescriptionHook timer |
YES |
- |
| DependencyMapScheduler |
YES |
- |
| Query serving |
- |
YES |
| Job claiming/execution |
- |
YES |
| Heartbeat |
- |
YES |
| Reconciliation |
- |
YES |
| Health endpoint |
- |
YES |
Testing Requirements
- Automated: Single node acquires leadership successfully.
- Automated: Second node fails to acquire leadership (non-blocking, returns false).
- Automated: Leader release makes lock available again.
- Automated: Scheduler services only start on leader node.
- Manual E2E: Start 2 nodes, verify only one runs schedulers (check logs), kill leader, verify follower acquires leadership (Story 16).
Definition of Done
Part of: #408
Story: Scheduler Leader Election with pg_advisory_lock
Part of: #408
[Conversation Reference: "I think it's cleaner that one server takes over scheduling work" and "we will have to have a system to prevent duplicate schedules"]
Story Overview
Objective: Implement leader election using PostgreSQL's
pg_advisory_lockso that exactly one cluster node runs the scheduler (RefreshScheduler, DataRetentionScheduler, and other periodic tasks). The advisory lock is held on a dedicated PostgreSQL connection -- if the connection drops (node crash), the lock is automatically released and another node can acquire it.User Value: Periodic tasks (repo refresh, data retention cleanup, etc.) run exactly once across the cluster, not once per node. No duplicate scheduling, no missed schedules.
Acceptance Criteria
AC1: Advisory Lock Acquisition for Leadership
Scenario: One node acquires the scheduler leadership lock.
Technical Requirements:
LeaderElectionServiceinsrc/code_indexer/server/services/leader_election_service.pypg_advisory_lock(lock_id)on a DEDICATED connection (not from pool)0x434944585F4C4452-- "CIDX_LDR" in hex)pg_try_advisory_lock(lock_id)to check if availableAC2: Scheduler Services Gated by Leadership
Scenario: Only the leader node runs periodic schedulers.
Technical Requirements:
LeaderElectionService.is_leader()returns boolis_leader()before starting each schedulerAC3: Dedicated Connection for Lock Holding
Scenario: The advisory lock is held on a connection that stays open.
Technical Requirements:
pg_advisory_lock(session-level, not transaction-level)AC4: Automatic Lock Release on Node Failure
Scenario: If the leader crashes, the lock is automatically released.
Technical Requirements:
keepalives=1, keepalives_idle=10, keepalives_interval=5, keepalives_count=3Implementation Status
Technical Implementation Details
File Structure
Advisory Lock Pattern
Startup Integration
Leader-Only vs All-Node Services
Testing Requirements
Definition of Done