Skip to content

edgeHub doesn't delete expired messages with RefCount > 0 #7509

@AnSu11NTUITY

Description

@AnSu11NTUITY

Expected Behavior

In case a custom module A is down (stopped, failed, hung, idle, whatever) and another custom module B sends messages to it, messages should be deleted from the edgeHub storage to prevent a full disk.

Current Behavior

Even if the TTL is already expired, messages stay in the edgeHub storage.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. having one module publishing a lot of messages to another module
  2. let the other module crash always a few seconds after startup (after 20 times of retry, edgeAgent doesn't restart the module anymore)
  3. check the size of the edgeHub storage directory

Context (Environment)

Output of iotedge check

Click here

independent

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]: Ubuntu 22.04 and Debian 12
  • Architecture [e.g. amd64, arm32, arm64]: amd64 and arm64
  • Container OS [e.g. Linux containers, Windows containers]: Linux containers

Runtime Versions

  • aziot-edged [run iotedge version]: 1.5
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.5.19
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.5.19
  • Docker/Moby [run docker version]: 20.10.11+azure-2

Logs

there are already > 1 Mio messages in the edgeHub Rocks DB

edgehub_messages_sent_total{iothub="prodIotHubNtuity.azure-devices.net",edge_device="a17f9e19-1f4a-4168-8f56-2badb1a88646",instance_number="1b3417d6-ded1-4752-a57c-16765ff477d6",from="a17f9e19-1f4a-4168-8f56-2badb1a88646/Modbus",to="a17f9e19-1f4a-4168-8f56-2badb1a88646/ProtocolAbstraction",from_route_output="samples",to_route_input="samples",priority="2000000000",ms_telemetry="True"}
1040373

and this is an example message from the directory /var/lib/aziot/storage/edgeHub/ which should have been deleted already because it’s expired → according to Copilot they are not delete if RefCount is still > 0

example message which should have been deleted already because it’s expired → according to Copilot they are not delete if RefCount is still > 0

Additional Information

I think Copilot already found the root cause, see here my investigation:
Image

and the problematic code is maybe because the messages are not deleted before RefCount becomes 0:
https://github.com/Azure/iotedge/blob/main/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/storage/MessageStore.cs#L324

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions