Expose `C_SEGMENT_SIZE_BYTES` and `C_INDEX_SIZE_BYTES` gauges by lukas8219 · Pull Request #218 · rabbitmq/osiris

lukas8219 · 2026-05-19T16:07:27Z

What

Add both metrics for C_SEGMENT_SIZE_BYTES and C_INDEX_SIZE_BYTES so we can enable on Management UI and Prometheus.

How

2 new fields
Increment on each write
Decrement after each retention eval cycle
Calculate the initial sizes doing stat at each file at init

Below is AI-generated

Benchmark with Mac Book M3 Max PRO (32GB RAM 10-4 CPU)

Phase	Result
Fill (50 GB, 101 segments)	~1.25 GB/s sustained, stable across all segments
Restart (101 segments)	9 ms avg (16 ms cold, 6 ms warm) — `sum_log_sizes` is 101 × 2 stat calls, negligible
Retention run 1 (51 segments deleted, 25 GB)	406 ms
Retention runs 2–5 (idempotent, 0 deletes)	2–3 ms

The 406 ms on the first retention run is the actual file deletion of 51 segments. Runs 2–5 (index scan only, nothing to delete) are 2–3 ms regardless of how many segments remain — that confirms the O(remaining) scan is cheap at this scale.

At 100 GB you'd expect ~800 ms for a full 50% delete run, and the same 2–4 ms for idempotent evaluations. Nothing alarming.

Closes #161

lukas8219 · 2026-05-19T16:09:05Z

                            %% local segments
-                            [begin ok = delete_segment_from_index(I) end
-                             || I <- IdxFiles],
+                            [delete_segment_from_index(I) || I <- IdxFiles],


Not sure we pattern match from above is really necessary?

Yeah looks like it isn't since delete_segment_from_index/1 already hard matches on ok

the-mikedavis

I took a look and I have a few nits but I think this is the right approach. I'm not an authority here so I'd wait for feedback from Arnaud / David / Karl before acting on my opinions (especially the index file part).

the-mikedavis · 2026-05-20T00:36:25Z

-         {segments, ?C_SEGMENTS, counter, "Number of segments"}
+         {segments, ?C_SEGMENTS, counter, "Number of segments"},
+         {segment_size_bytes, ?C_SEGMENT_SIZE_BYTES, counter, "Total size of all segment files in bytes"},
+         {index_size_bytes, ?C_INDEX_SIZE_BYTES, counter, "Total size of all index files in bytes"}


I'm not sure it's worthwhile to track index data. It would mean an extra fstat for index files during recovery and retention to gather the size data. Since chunk headers are at minimum 48 bytes per chunk, the segment will always be larger (48 > 29 for each index record). If these metrics aim to answer a question of "which stream is taking up all of my data" I think segment size metrics are enough.

I think there's a separate question of "are my chunks so small they're hurting performance" and that could be answered by a bytes-per-chunk metric instead. Tracking index data here would indirectly hint towards that rather than answer it clearly.

I am not strong on any claim here as I also believe that for big clusters, with loads of streams, that's not a cheap price to pay.

@the-mikedavis
To track or not to track it: I thought maybe, if the overhead is not big, to keep going with both. Easier to catch bugs if users have this level of visibility.
Now that was myself assuming there's a chance the index file could have holes/get bloated somehow - which could have been naive as Streams are running for lots of years and there's not a single complaint about it.

I'll good with removing the index tracking from this PR

the-mikedavis · 2026-05-20T00:57:45Z

                            %% local segments
-                            [begin ok = delete_segment_from_index(I) end
-                             || I <- IdxFiles],
+                            [delete_segment_from_index(I) || I <- IdxFiles],


Yeah looks like it isn't since delete_segment_from_index/1 already hard matches on ok

the-mikedavis · 2026-05-20T01:07:31Z

+            counters:put(Cnt, ?C_SEGMENT_SIZE_BYTES, InitSegBytes),
+            counters:put(Cnt, ?C_INDEX_SIZE_BYTES, InitIdxBytes),


I haven't looked very deeply but I'm not sure this accounts for the seeking to Size shorter than the file length after a hard crash + data loss. There could be garbled garbage sitting at the end of a segment. This would account for those extra bytes permanently

You are referring to the maybe_fix_corrupted_files @the-mikedavis ? If so, the InitXBytes is calculated after that call so the truncates should be applied already.

Actually I'm thinking of first_and_last_seginfos/1, I believe the size it calculates for the last segment file is based on the last valid position in the index file. Since the segment file is written before the index, Size might be smaller than the actual segment file size if the writer terminates between writing the segment and writing the index, or if there is data loss from a power loss. If we set the counter here to the file size (rather than the size according to Size) then we'll always be overestimating: below we set file:position(SegFd, Size) and file:truncate(SegFd). The segment data counter will always contain the truncated bytes.

… entire eval flow

the-mikedavis · 2026-05-28T17:14:03Z

-         {segments, ?C_SEGMENTS, counter, "Number of segments"}
+         {segments, ?C_SEGMENTS, counter, "Number of segments"},
+         {segment_size_bytes, ?C_SEGMENT_SIZE_BYTES, counter, "Total size of all segment files in bytes"},
+         {index_size_bytes, ?C_INDEX_SIZE_BYTES, counter, "Total size of all index files in bytes"}


It looks like all of these are counter currently but I think most of them (maybe all of them?) should be gauge. Maybe it's outside the scope of this change though since the existing ones all use counter

Agree for a future PR

Copilot

Pull request overview

This PR adds two new size metrics to Osiris logs—total segment bytes and total index bytes—so they can be surfaced in the Management UI and exported to Prometheus. The metrics are initialized on startup via filesystem stat, incremented on writes, and decremented after retention deletes segments.

Changes:

Extend osiris_log counters with segment_size_bytes and index_size_bytes, including initialization, write-time increments, and retention-time decrements.
Extend retention evaluation to return deleted segment/index byte counts so stream processes can update the new metrics.
Add Common Test coverage for fresh logs, writes, restart, and retention effects on the new size counters.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/osiris_log.erl`	Adds the two new counters, initializes them at `init`, updates them on write, and decrements them based on retention deletions.
`src/osiris_retention.erl`	Adjusts scheduling pattern match to handle the extended retention evaluation return value.
`src/osiris.hrl`	Increases the base log counter field count to accommodate the two new counters.
`test/osiris_log_SUITE.erl`	Adds tests validating the new counters across init/write/restart/retention flows and adapts retention tests to new return shape.
`test/osiris_SUITE.erl`	Adds an integration test ensuring retention decreases `segment_size_bytes` at the writer counter level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

the-mikedavis

This is looking good to me 👍. I only have nitpicks / style comments left.

Before merging though I think we should discuss/decide whether index file bytes are worthy of tracking. I would lean against it but I don't have a strong opinion either way.

the-mikedavis · 2026-06-02T18:07:10Z

 -spec evaluate_retention(file:filename_all(), [retention_spec()]) ->
    {range(), FirstTimestamp :: osiris:timestamp(),
-     NumRemainingFiles :: non_neg_integer()}.
+     NumRemainingFiles :: non_neg_integer(),
+     {DeletedSegBytes :: non_neg_integer(), DeletedIndexBytes :: non_neg_integer()}}.


I think this tuple is getting a little bit large. It could probably be a record or map instead? (I like maps :)

the-mikedavis · 2026-06-02T18:09:01Z

-    % convert to binary for faster operations later
-    % mostly in segment_from_index_file/1


I think we can keep this comment. Unrelated to this PR I think we should explore replacing all list strings in Osiris with binaries. They're much much more compact for all text like stream names and directories, especially ASCII. I will throw something on my todo list to look at it.

Yes. My bad. AI-slopped hard on this one

- Restored comments about future performance improvements - Removed Index Tracking - Moved to Maps instead of Tuple

Expose SEGMENT_SIZE_BYTES and SEGMENT_SIZE_BYTES gauges

faf46d1

lukas8219 commented May 19, 2026

View reviewed changes

Comment thread src/osiris_log.erl Outdated

lukas8219 commented May 19, 2026

View reviewed changes

Comment thread src/osiris_log.erl Outdated

the-mikedavis requested changes May 20, 2026

View reviewed changes

lukas8219 added 2 commits May 20, 2026 12:27

Adjusted osiris_retention:schedule/3 and added simple test case for…

7e01092

… entire eval flow

Removed unnecessary commentaries

4530f93

the-mikedavis reviewed May 28, 2026

View reviewed changes

kjnilsson requested a review from Copilot May 29, 2026 10:50

Copilot started reviewing on behalf of kjnilsson May 29, 2026 10:50 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Adjusted counter after truncate call and tail-recursive optimizations

5a6721e

lukas8219 requested a review from the-mikedavis June 1, 2026 14:44

the-mikedavis reviewed Jun 2, 2026

View reviewed changes

Adjusted to Mike comments.

8a3c74f

- Restored comments about future performance improvements - Removed Index Tracking - Moved to Maps instead of Tuple

lukas8219 requested a review from the-mikedavis June 6, 2026 00:30

		counters:put(Cnt, ?C_SEGMENT_SIZE_BYTES, InitSegBytes),
		counters:put(Cnt, ?C_INDEX_SIZE_BYTES, InitIdxBytes),

		% convert to binary for faster operations later
		% mostly in segment_from_index_file/1

Conversation

lukas8219 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Below is AI-generated

Benchmark with Mac Book M3 Max PRO (32GB RAM 10-4 CPU)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

the-mikedavis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukas8219 Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the-mikedavis May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

the-mikedavis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukas8219 commented May 19, 2026 •

edited

Loading

lukas8219 Jun 2, 2026 •

edited

Loading

the-mikedavis May 28, 2026 •

edited

Loading