Skip to content

collect metrics on relay success and latency#67

Merged
dhfang merged 16 commits into
mainfrom
df/relay-metrics
Apr 24, 2026
Merged

collect metrics on relay success and latency#67
dhfang merged 16 commits into
mainfrom
df/relay-metrics

Conversation

@dhfang
Copy link
Copy Markdown
Contributor

@dhfang dhfang commented Apr 17, 2026

Instrument relayer client to collect metrics on relay success/failure counts and relay request latency.

johnnylarner and others added 13 commits March 13, 2026 11:01
Remove the aggregate Err field from SentTx in both cosmos and ethereum
chains. Rename SourceErr to BroadcastErr and RelayerErr to
PostBroadcastErr to align with the txMode abstraction. Add Failed() and
Error() methods that derive overall status from the specific error
fields and TxResponse.
…elay config

- add MsgIFTTransfer as a case in both ethereum and cosmos TxFactory
  BuildTxs/CreateMsg, with baseline support for high-throughput runs
- replace implicit HandlePostBroadcast with explicit RelayConfig that
  specifies a relayer URL and which message types should be relayed
- remove txMode interface and mode.go from both chains
- move relayer config from ift.relayer to top-level relay in spec
- use SuggestGasTipCap instead of hardcoded 2 gwei default for EVM
  gas fee estimation
- fix EVM IFT timeout to use Unix seconds instead of nanoseconds
@dhfang dhfang requested a review from a team as a code owner April 17, 2026 22:23
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR instruments the relayer gRPC client with three Prometheus metrics: a success counter, a failure counter (both labeled by chain_id), and a per-attempt duration histogram. The Cosmos runner correctly gates relay metric creation behind spec.MetricsEnabled; the Ethereum runner does not, unconditionally calling iftrelayer.NewMetrics() and registering metrics in the global Prometheus registry regardless of the operator's preference.

  • Ethereum runner ignores MetricsEnabled: iftrelayer.NewMetrics() is called unconditionally at line 181 of chains/ethereum/runner/runner.go, unlike the Cosmos runner which guards it with if spec.MetricsEnabled. This will also panic if two Ethereum runners with relay enabled are created in the same process (e.g., in integration tests), because prometheus.MustRegister panics on duplicate registration.

Confidence Score: 4/5

Safe to merge after fixing the unconditional metric registration in the Ethereum runner.

One P1 finding: the Ethereum runner bypasses the MetricsEnabled guard and unconditionally registers relay metrics, inconsistent with the Cosmos runner and liable to panic if two relay-enabled Ethereum runners are created in the same process.

chains/ethereum/runner/runner.go — unconditional iftrelayer.NewMetrics() call needs MetricsEnabled guard

Important Files Changed

Filename Overview
ift/relayer/metrics.go New file defining Prometheus Success/Failure counters and Duration histogram for relay operations; uses MustRegister on global registry which panics on duplicate registration.
ift/relayer/client.go Wires metrics into SubmitTxHash: Duration recorded per attempt, Success on first successful relay, Failure on context cancellation and after all retries exhausted — logic is correct.
chains/cosmos/runner/runner.go Correctly guards relay metrics creation behind spec.MetricsEnabled before passing to NewGRPCClient.
chains/ethereum/runner/runner.go Unconditionally calls iftrelayer.NewMetrics() regardless of spec.MetricsEnabled, inconsistent with the Cosmos runner and will panic on duplicate registration in the same process.

Sequence Diagram

sequenceDiagram
    participant Runner
    participant GRPCClient
    participant PrometheusRegistry
    participant RelayerService

    Runner->>GRPCClient: NewGRPCClient(cfg, chainID, metrics)
    GRPCClient-->>Runner: *GRPCClient

    loop up to maxRelayRetries (15)
        Runner->>GRPCClient: SubmitTxHash(ctx, txHash)
        GRPCClient->>RelayerService: Relay(callCtx, RelayRequest)
        GRPCClient->>PrometheusRegistry: Duration.Observe(elapsed)
        alt success
            GRPCClient->>PrometheusRegistry: Success.Inc()
            GRPCClient-->>Runner: nil
        else error
            note over GRPCClient: store lastErr, retry after 3s
        end
        alt ctx cancelled during retry delay
            GRPCClient->>PrometheusRegistry: Failure.Inc()
            GRPCClient-->>Runner: ctx.Err()
        end
    end
    GRPCClient->>PrometheusRegistry: Failure.Inc()
    GRPCClient-->>Runner: error (all retries exhausted)
Loading

Comments Outside Diff (1)

  1. chains/ethereum/runner/runner.go, line 181-186 (link)

    P1 Relay metrics ignore MetricsEnabled flag

    iftrelayer.NewMetrics() is called unconditionally here, while the Cosmos runner correctly guards the same call with if spec.MetricsEnabled. This has two consequences: relay metrics are always registered to the global Prometheus registry even when the operator disables metrics, and if two Ethereum runners are constructed in the same process with spec.Relay != nil (e.g., in parallel tests), prometheus.MustRegister will panic on the second registration.

Reviews (1): Last reviewed commit: "collect metrics on relay success and lat..." | Re-trigger Greptile

Comment thread ift/relayer/metrics.go
Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10},
}, []string{"chain_id"}),
}
prometheus.MustRegister(m.Success, m.Failure, m.Duration)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 MustRegister panics on duplicate registration

prometheus.MustRegister panics if the same metric name is registered twice in the default global registry. If NewMetrics() is called more than once in the same process — for example, in integration tests that instantiate a runner for each test case — the second call will panic. Consider using prometheus.Register with an already-registered check, or accepting a prometheus.Registerer as a parameter so callers can use isolated registries for tests.

@dhfang dhfang force-pushed the df/relay-metrics branch from 4e14063 to 3f2c7c9 Compare April 23, 2026 18:41
@dhfang dhfang changed the base branch from plat-374/extended-ift-transfer-integration-test to df/ift-relay-support April 23, 2026 18:44
Base automatically changed from df/ift-relay-support to main April 24, 2026 21:28
# Conflicts:
#	chains/cosmos/runner/runner.go
#	chains/ethereum/ift/contract.go
#	chains/ethereum/runner/runner.go
#	ift/relayer/client.go
@dhfang dhfang merged commit 8b5382e into main Apr 24, 2026
3 checks passed
@dhfang dhfang deleted the df/relay-metrics branch April 24, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants