Skip to content

Increasing CR with each retransmission#10359

Open
NomDeTom wants to merge 28 commits into
meshtastic:developfrom
NomDeTom:CRCRRCRRR
Open

Increasing CR with each retransmission#10359
NomDeTom wants to merge 28 commits into
meshtastic:developfrom
NomDeTom:CRCRRCRRR

Conversation

@NomDeTom
Copy link
Copy Markdown
Collaborator

@NomDeTom NomDeTom commented May 1, 2026

This adds dynamic coding-rate escalation for retransmissions, and introduces explicit tracking of retransmission-chain failures.

Net effect: retransmissions are now adaptive and observable, with clearer runtime logs plus a simple failure-trend metric you can compare before/after CR strategy changes.

Addresses #8114

Specifically requires #10120

🤝 Attestations

  • I have tested that my proposed changes behave as described.
  • I have tested that my proposed changes do not cause any obvious regressions on the following devices:
    • Heltec (Lora32) V3
    • LilyGo T-Deck
    • LilyGo T-Beam
    • RAK WisBlock 4631
    • Seeed Studio T-1000E tracker card
    • Other (please specify below)

NomDeTom and others added 20 commits April 10, 2026 01:00
…presets

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
…smission logic

Co-authored-by: Copilot <copilot@github.com>
…g rate restoration

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@github-actions github-actions Bot added needs-review Needs human review enhancement New feature or request labels May 1, 2026
@Komzpa
Copy link
Copy Markdown
Contributor

Komzpa commented May 3, 2026

I compared this with #10327 and the Meshtasticator modeling history.

First, current native CI seems to have a real compile break unrelated to the retry CR logic: test_admin_radio still uses the old positional RegionInfo initializers after defaultPreset was inserted before name, so strings like "TEST_US_SPACED" are being compiled where a ModemPreset is now expected.

Second, I think there is an airtime accounting bug in RadioLibInterface::completeSending(): restoreTemporaryCodingRateOverride() runs before getPacketTime(p), so TX airtime is calculated after restoring the base CR, not with the temporary retransmission CR that was actually used. That will undercount airtime/utilization for CR7/CR8 retries, exactly where the extra airtime matters.

For context, there is another version of this idea in #10327 that I have been testing through Meshtasticator. The simulator-side history is in meshtastic/Meshtasticator#71 and the comparison/repro workflow is in meshtastic/Meshtasticator#74.

The main thing we found there is that “increase CR on retransmission” is not uniformly good. It helps when the loss looks like quiet/weak-link loss, but it can hurt when the loss is collision/congestion-driven because the longer airtime creates more collision pressure.

A few variants that looked tempting performed badly enough that I would avoid baking them in as unconditional behavior:

  • CR6 on idle first attempts (idle6) was worse in dense stress: reach went 2.53% -> 2.22% versus the tuned DCR variant.
  • Broad CR6 bias could improve reach in lighter traffic, but still lost useful delivery versus static CR5.
  • Always using a CR6 floor for direct relays looked good in a 60s DM run, but fell apart in the dense 10s/period-2 guardrail: useful delivery dropped to 916 versus 1026 for the hybrid DCR variant and 1049 for static CR5.
  • CR8 needed to be rare/budgeted. The tuned 92-node Batumi run ended up mostly CR5/CR6, with no CR8 in that run, while still improving reach from 5.82% to 6.11% with TX airtime only moving 6.28% -> 6.36%.

So I think the risky part here is the unconditional policy shape: first retry base + 1, second and later retries CR8, regardless of channel pressure or whether the previous loss looked like congestion. The simulator-tested version ended up treating retransmission CR as a quiet-loss/final-retry/packet-context decision, with busy/congested clamps.

The useful overlap I took from this PR is making the retry attempt explicit in PendingPacket instead of inferring it from sender role. I think that part is a good cleanup.

Komzpa added a commit to Komzpa/firmware that referenced this pull request May 3, 2026
Store the initial retransmission countdown in PendingPacket so DCR can derive retry attempts from packet state instead of sender role assumptions.

Inspired-by: @NomDeTom in meshtastic#10359
Komzpa added a commit to Komzpa/firmware that referenced this pull request May 3, 2026
Store the initial retransmission countdown in PendingPacket so DCR can derive retry attempts from packet state instead of sender role assumptions.

Inspired-by: @NomDeTom in meshtastic#10359
@NomDeTom
Copy link
Copy Markdown
Collaborator Author

NomDeTom commented May 3, 2026

First, current native CI seems to have a real compile break unrelated to the retry CR logic: test_admin_radio still uses the old positional RegionInfo initializers after defaultPreset was inserted before name, so strings like "TEST_US_SPACED" are being compiled where a ModemPreset is now expected.

Possibly: that testing break is based on the re-regionalisation changes, which are part of my preferred base for this pr. I'll have to go around again and check.

Second, I think there is an airtime accounting bug in RadioLibInterface::completeSending(): restoreTemporaryCodingRateOverride() runs before getPacketTime(p), so TX airtime is calculated after restoring the base CR, not with the temporary retransmission CR that was actually used. That will undercount airtime/utilization for CR7/CR8 retries, exactly where the extra airtime matters.

This is a good catch: the need to respect the order of operations wasn't obvious to me, and my limited testing didn't pick that up.

For context, there is another version of this idea in #10327 that I have been testing through Meshtasticator. The simulator-side history is in meshtastic/Meshtasticator#71 and the comparison/repro workflow is in meshtastic/Meshtasticator#74.

The main thing we found there is that “increase CR on retransmission” is not uniformly good. It helps when the loss looks like quiet/weak-link loss, but it can hurt when the loss is collision/congestion-driven because the longer airtime creates more collision pressure.

A few variants that looked tempting performed badly enough that I would avoid baking them in as unconditional behavior:

  • CR6 on idle first attempts (idle6) was worse in dense stress: reach went 2.53% -> 2.22% versus the tuned DCR variant.
  • Broad CR6 bias could improve reach in lighter traffic, but still lost useful delivery versus static CR5.
  • Always using a CR6 floor for direct relays looked good in a 60s DM run, but fell apart in the dense 10s/period-2 guardrail: useful delivery dropped to 916 versus 1026 for the hybrid DCR variant and 1049 for static CR5.
  • CR8 needed to be rare/budgeted. The tuned 92-node Batumi run ended up mostly CR5/CR6, with no CR8 in that run, while still improving reach from 5.82% to 6.11% with TX airtime only moving 6.28% -> 6.36%.

This change hasn't been extensively tested with packet success ratios, but I think it's important to add something to make subsequent retries more effective. How are you determining the difference on-device between link-related failure and collision-related failure?

Edit: changed some words so I make sense.

NomDeTom and others added 2 commits May 3, 2026 20:32
Co-authored-by: Copilot <copilot@github.com>
…tests

Co-authored-by: Copilot <copilot@github.com>
@NomDeTom
Copy link
Copy Markdown
Collaborator Author

NomDeTom commented May 6, 2026

@Komzpa I'm considering rewriting this to give it some "dwell time" at the higher CR, if it improves packet ack rates. This would go something like: raise the CR each retry and remember it for future packets, and then reduce it by 1 if subsequently 5/5 packets are acked first time.

I could put a fall-back behaviour if no packets are acked - falling back to 5, 7, 8 for the 3 retries might be better if the node isn't well connected at all.

Does that sound better? Closer to your testing?

@Komzpa
Copy link
Copy Markdown
Contributor

Komzpa commented May 6, 2026

I reran this against a rebased Meshtasticator stack, including the current capture-aware collision / empirical PHY-loss model and a policy mode that matches the current PR behavior (base + 1 on first retry, CR8 on second+ retry).

Current PR head tested: 47adeef07a8fde70cc375bd2f4c05a6121c8446b.

One CI note first: native PlatformIO tests are still red, but this looks like the regionalization test compile break rather than this CR policy directly. test_admin_radio still has old positional RegionInfo initializers after defaultPreset was inserted before name, so strings such as "TEST_US_SPACED" are being compiled where a ModemPreset is now expected. The native simulator job is green.

Method

I compared four policies on two scenarios, with the same simulator physics flags in every run:

  • static: baseline static preset CR
  • firmware10359: current PR retry policy
  • dcr: context-aware retry CR policy from the Meshtasticator experiments
  • dcr_dtp: same DCR plus dynamic TX power reduction for strong relay/control packets

Common run shape:

  • --simtime-seconds 60
  • --period-seconds 5
  • --phy-loss-model
  • --capture-collision-model

Scenarios:

  • batumi: 92-node real-mesh preset with terrain, clutter, and link calibration.
  • burning_man: generated Black Rock City-style preset with 123 nodes and 3 elevated routers. I intentionally ran it through normal loraMesh.py --preset burning_man; it is no longer a separate simulator with custom radio physics. The event-specific inputs are geometry, clutter raster, per-node TX power, and generic per-node enclosure loss.

Results

Batumi

policy reach % useful % TX airtime % sent collisions PHY loss CR5/6/7/8 %
static 5.57 48.12 5.95 867 1646 6099 100/0/0/0
firmware10359 5.55 47.33 6.16 884 1669 6117 83/10/0/6
dcr 5.96 47.60 6.15 907 1739 6202 83/13/4/0
dcr_dtp 6.04 49.47 6.27 918 1715 6292 83/13/4/0

Delta vs static:

policy reach pp useful pp TX airtime pp collisions
firmware10359 -0.02 -0.79 +0.21 +23
dcr +0.39 -0.53 +0.20 +93
dcr_dtp +0.47 +1.34 +0.32 +69

Burning Man

policy reach % useful % TX airtime % sent collisions PHY loss CR5/6/7/8 %
static 7.60 59.00 3.91 1900 4448 623 100/0/0/0
firmware10359 7.42 59.87 4.05 1872 4632 574 85/12/0/3
dcr 7.23 56.71 4.00 1822 4230 574 90/10/0/0
dcr_dtp 7.47 59.70 4.06 1896 4103 598 91/9/0/0

Delta vs static:

policy reach pp useful pp TX airtime pp collisions
firmware10359 -0.19 +0.88 +0.14 +184
dcr -0.37 -2.29 +0.09 -218
dcr_dtp -0.13 +0.71 +0.15 -345

Quick visual

Reach %
Batumi       static 5.57 | fw10359 5.55 | dcr 5.96 | dcr+dtp 6.04
Burning Man  static 7.60 | fw10359 7.42 | dcr 7.23 | dcr+dtp 7.47

Useful %
Batumi       static 48.12 | fw10359 47.33 | dcr 47.60 | dcr+dtp 49.47
Burning Man  static 59.00 | fw10359 59.87 | dcr 56.71 | dcr+dtp 59.70

TX airtime %
Batumi       static 5.95 | fw10359 6.16 | dcr 6.15 | dcr+dtp 6.27
Burning Man  static 3.91 | fw10359 4.05 | dcr 4.00 | dcr+dtp 4.06

Readout

This does not convince me that the current unconditional escalation is ready as-is.

In Batumi, firmware10359 spends more airtime and gets slightly worse reach/useful delivery than static CR. In Burning Man, it improves useful delivery a bit, but reach drops and collision count rises. The CR8 share is small in these runs, but the unconditional direction is still visible: the current policy can spend extra airtime without a clear delivery win.

The dwell-time idea you mentioned sounds closer to the right shape than this PR, but I would still want guards around it:

  • Do not treat every missed ACK as weak-link loss. In a busy mesh, missed ACK can mean collision/congestion.
  • Avoid raising CR while local channel utilization / queue pressure is high.
  • Make CR8 rare and budgeted, preferably final-retry or strong weak-link evidence only.
  • If you keep per-peer dwell state, decay it down after first-try ACK success, but also clamp or reset it when the channel looks busy.

For your fallback idea: 5, 7, 8 for retries when no packets are ACKing may be reasonable as a disconnected/weak-link rescue path, but I would not use it as the normal retry path without congestion checks.

Komzpa added a commit to Komzpa/firmware that referenced this pull request May 14, 2026
Store the initial retransmission countdown in PendingPacket so DCR can derive retry attempts from packet state instead of sender role assumptions.

Inspired-by: @NomDeTom in meshtastic#10359
Komzpa added a commit to Komzpa/firmware that referenced this pull request May 24, 2026
Store the initial retransmission countdown in PendingPacket so DCR can derive retry attempts from packet state instead of sender role assumptions.

Inspired-by: @NomDeTom in meshtastic#10359
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request needs-review Needs human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants