Skip to content

Stop RPC materializer retries when ACL token is not found#23606

Open
melkor217 wants to merge 1 commit into
hashicorp:mainfrom
melkor217:cursor/stop-rpc-materializer-acl-not-found-retries
Open

Stop RPC materializer retries when ACL token is not found#23606
melkor217 wants to merge 1 commit into
hashicorp:mainfrom
melkor217:cursor/stop-rpc-materializer-acl-not-found-retries

Conversation

@melkor217
Copy link
Copy Markdown

@melkor217 melkor217 commented May 25, 2026

Summary

  • Treat ACL not found as a terminal error in RPCMaterializer, matching existing LocalMaterializer and agent/cache behavior
  • Stop client-agent streaming health subscriptions from retrying until idle eviction (~20 minutes) after a workload token is deleted
  • Notify blocked waiters once before exiting so in-flight queries can complete

Fixes repeated agent.rpcclient.health: subscribe call failed errors reported in #22515 (Nomad workload identity / consul-template).

Test plan

  • go test ./agent/submatview/... -count=1
  • Deploy patched Consul client agent on a Nomad client node
  • Run a job with {{ range service "..." }} template, stop the allocation, confirm no ongoing subscribe call failed / ACL not found errors in Consul logs

Made with Cursor

Treat ACL not found as a terminal subscribe error on client agents so
streaming health subscriptions exit instead of retrying until idle eviction.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hashicorp-cla-app
Copy link
Copy Markdown

hashicorp-cla-app Bot commented May 25, 2026

CLA assistant check
All committers have signed the CLA.

@melkor217
Copy link
Copy Markdown
Author

Before the fix (consul 1.22.5)

# journalctl -u consul | grep agent.rpcclient.health | tail -n 3
May 25 09:33:04 hashicorp-test-app02.int.tsum.com consul[1830819]: {"@level":"error","@message":"subscribe call failed","@module":"agent.rpcclient.health","@timestamp":"2026-05-25T09:33:04.140330+03:00","err":"rpc error: code = Unknown desc = ACL not found","error":"rpc error: code = Unknown desc = ACL not found","failure_count":9,"key":"server-in","topic":2}
May 25 09:34:03 hashicorp-test-app02.int.tsum.com consul[1830819]: {"@level":"error","@message":"subscribe call failed","@module":"agent.rpcclient.health","@timestamp":"2026-05-25T09:34:03.209473+03:00","err":"rpc error: code = Unknown desc = ACL not found","error":"rpc error: code = Unknown desc = ACL not found","failure_count":10,"key":"server-in","topic":2}
May 25 09:35:34 hashicorp-test-app02.int.tsum.com consul[1830819]: {"@level":"error","@message":"subscribe call failed","@module":"agent.rpcclient.health","@timestamp":"2026-05-25T09:35:34.466510+03:00","err":"rpc error: code = Unknown desc = ACL not found","error":"rpc error: code = Unknown desc = ACL not found","failure_count":11,"key":"server-in","topic":2}
{
  "@level": "error",
  "@message": "subscribe call failed",
  "@module": "agent.rpcclient.health",
  "@timestamp": "2026-05-25T09:35:34.466510+03:00",
  "err": "rpc error: code = Unknown desc = ACL not found",
  "error": "rpc error: code = Unknown desc = ACL not found",
  "failure_count": 11,
  "key": "server-in",
  "topic": 2
}

@melkor217
Copy link
Copy Markdown
Author

After the fix:

# journalctl -u consul | grep agent.rpcclient.health | tail -n 3
May 25 09:42:28 hashicorp-test-app02.int.tsum.com consul[1830819]: {"@level":"error","@message":"subscribe call failed","@module":"agent.rpcclient.health","@timestamp":"2026-05-25T09:42:28.860266+03:00","err":"rpc error: code = Canceled desc = grpc: the client connection is closing","error":"rpc error: code = Canceled desc = grpc: the client connection is closing","failure_count":4,"key":"server-in","topic":2}
May 25 09:42:28 hashicorp-test-app02.int.tsum.com consul[1830819]: {"@level":"error","@message":"subscribe call failed","@module":"agent.rpcclient.health","@timestamp":"2026-05-25T09:42:28.973672+03:00","err":"rpc error: code = Canceled desc = grpc: the client connection is closing","error":"rpc error: code = Canceled desc = grpc: the client connection is closing","failure_count":4,"key":"server-in","topic":2}
May 25 09:45:21 hashicorp-test-app02.int.tsum.com consul[4058732]: {"@level":"error","@message":"subscribe call failed","@module":"agent.rpcclient.health","@timestamp":"2026-05-25T09:45:21.531086+03:00","err":"rpc error: code = Unknown desc = ACL not found","error":"rpc error: code = Unknown desc = ACL not found","failure_count":1,"key":"server-in","topic":2}

Only first error with failure_count=1 appears, no retries.

@melkor217 melkor217 marked this pull request as ready for review May 25, 2026 07:11
@melkor217 melkor217 requested review from a team as code owners May 25, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant