Skip to content

mon,mds: make standby-replay state more capable #4

Draft
ifed01 wants to merge 980 commits into
masterfrom
wip-ifed-many-standby-replay-mds
Draft

mon,mds: make standby-replay state more capable #4
ifed01 wants to merge 980 commits into
masterfrom
wip-ifed-many-standby-replay-mds

Conversation

@ifed01
Copy link
Copy Markdown

@ifed01 ifed01 commented Dec 6, 2025

This allows 2 standby-replay daemons and permits cephfs scrubbing at standby-replay MDS.

To fully benefit from this PR one can use 'ceph mds freeze ' command to 'freeze' specific MDS. This leaves the latter in standby-replay mode permanently but lets other daemons to cycle through their states as designed. Hence FS keeps functioning properly with an additional standby-replay daemon. Which gets an ability to monitor FS in parallel this way.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

cbodley and others added 30 commits November 7, 2025 09:14
cmake/rgw: WITH_RADOSGW_POSIX depends on WITH_RADOSGW_DBSTORE

Reviewed-by: Kefu Chai <tchaikov@gmail.com>
Fix single backticks to double backticks to properly end the inline
preformatted formatting. Fixes the formatting overflowing until the next
occurrence of double backticks seen in rendered docs, URL:
https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_scheduler_type

Add full stops that seemed to be missing in desc attribute.
Use singular word "value" in desc attribute when there's only one
possible other value.

Remove unnecessary "the".

Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Expose rbd_default_clone_format option which has a fairly comprehensive
description (much more verbose than most other options, anyway).  This
should help with understanding the difference between clone v1 and v2.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
doc/rbd/rbd-config-ref: add clone settings section

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
…dosgw

src/common: Fix text formatting in options/rgw.yaml.in
* Introduce qa/clusters/crimson
  4 deployment clusters (1/2/3/4 nodes) options same as classic.

* Symlink all cluster dirs to the common dir above
  For now keep using only 1/2, we could add 3/4 later on.

* Move to "crimson cpu num" instead of specifying
  "crimson cpu set" set.
  - We expect users to mostly use this option for deploying
    clusters, so use this as testing default.

* remove "crimson bluestore cpu set" which is responsible for
  cpu pinning exclusiveness in seastar/alien cores.

* ignore "for optimal performance" cluster warning now that we
  no longer pin cpus for testing.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
This is a preperation step for addind new backend options
testing (e.g caching type)

* move crimson's objectstore yamls from qa/config to
  qa/objectstore/crimson
* Use the entire qa/objectstore/crimson where possible
  instead of symlinking each backend definition

See:
```
├── objectstore_tool
│   ├── clusters
│   ├── crimson-supported-all-distro ->
.qa/distros/crimson-supported-all-distro/
│   ├── deploy
│   ├── objectstore
│   └── tasks
├── perf
│   ├── clusters
│   ├── crimson-supported-all-distro ->
.qa/distros/crimson-supported-all-distro/
│   ├── deploy
│   ├── objectstore -> .qa/objectstore/crimson
│   ├── settings
│   └── workloads
├── rbd
│   ├── clusters
│   ├── crimson-supported-all-distro ->
.qa/distros/crimson-supported-all-distro/
│   ├── deploy
│   ├── objectstore -> .qa/objectstore/crimson
│   └── tasks
├── singleton
│   ├── all
│   ├── crimson-supported-all-distro ->
.qa/distros/crimson-supported-all-distro/
│   └── objectstore -> .qa/objectstore/crimson
├── thrash
│   ├── 0-size-min-size-overrides
│   ├── 1-pg-log-overrides
│   ├── 2-recovery-overrides
│   ├── clusters
│   ├── crimson-supported-all-distro ->
.qa/distros/crimson-supported-all-distro/
│   ├── deploy
│   ├── objectstore
│   ├── thrashers
│   └── workloads
````

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
The directories which symlink to the common crimson objectstore
will now also use 2q/lru randomly.

Fixes: https://tracker.ceph.com/issues/72302

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
as rbm is not yet supported in with the tool. Disable it
properly.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
When scheduling jobs with --sha1 instead of -c. The ceph-ci
branch used is 'main'. However, ceph-ci doesn't actually have
a main branch - Instead use ceph.git main branch.

```
Command failed on smithi116 with status 8: "wget -q -O
/home/ubuntu/cephtest/admin_socket_client.0/objecter_requests --
'http://git.ceph.com/?p=ceph-ci.git;a=blob_plain;f=src/test/admin_socket/objecter_requests;hb=main' && chmod u=rx -- /home/ubuntu/cephtest/admin_socket_client.0/objecter_requests"
```

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Bluestore already runs thrash/default

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
https://tracker.ceph.com/issues/67446 is merged, We should be able
to start testing Seastore similar to `crimson-rados/thrash` suite which
uses ceph_test_rados and rados bench.
crimson-rados-experimental is a copy of crimson-rados thrash with only
objectstore changes.
Once the experimental suite is ready, we could add seastore to
crimson-rados/thrash and remove crimson-rados/thrash_seastore_* variants.

See: https://tracker.ceph.com/issues/71237

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
fixes : https://tracker.ceph.com/issues/73766
Signed-off-by: Abhishek Desai <abhishek.desai1@ibm.com>
…realm-zonegroup

mgr/dashboard: Carbonize Administration module > Create Realm/Zone group/zone

Reviewed-by: Afreen Misbah <afreen@ibm.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
…ert-msg

mgr/dashboard: fix upgrade's cluster alerts popover

Reviewed-by: Afreen Misbah <afreen@ibm.com>
crimson-rados/perf/clusters used fixed-2 though only a single node
was used.

To preserve the current behavior:
* move to the correct fixed-1 definition
* introduce ignorelist yaml as previously included in
  perf/clusters

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
…processing on construction.

these variable are getting initialized on s3select/CSV flow, no valgrind local run had discovered any issue related to these variables.
valgrind reports produced by teuthology points on run_s3select_on_csv to contain UninitCondition warning. sometimes.

Signed-off-by: galsalomon66 <gal.salomon@gmail.com>
Fixed a race condition in the Inotify class where the ev_loop() thread
and caller threads (add_watch/remove_watch) were accessing the
wd_callback_map and wd_remove_map hash maps without synchronization.

This caused a segfault during hash table operations when one thread
was reading from the map while another was modifying it, leading to
iterator invalidation and memory corruption.

Backtrace from the crash:
  Frame 5: file::listing::Inotify::ev_loop()+0x190
  Frame 4: ankerl::unordered_dense::v3_1_0::detail::table::find()
  Crash: Memory access violation during WatchRecord lookup

The fix adds:
- A mutex (map_mutex) to protect both hash maps
- Lock guards in add_watch() and remove_watch() during map modifications
- Lock guard in ev_loop() with proper copying of watch record data to
  avoid holding the lock during callbacks and prevent use-after-free

See https://jenkins.ceph.com/job/ceph-pull-requests/169774/testReport/junit/projectroot.src.test/rgw/unittest_rgw_posix_driver/

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
…eanup

* in case of prefix per source this would prevent leaking this object
* in case of share prefix, it would prevent data loss when other source
buckets will try to commit an already comitted temporary object
* when updatign the "last committed" attribute, the object must exist.
  this is so that commit without rollover (in case of cleanup) won't
  recreate the deleted object
* some refactoring of try-catch code to have less nesting

Fixes: https://tracker.ceph.com/issues/73675

Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
The scrubber calls PG::_scan_rollback_obs() to clean up obsolete
rollback objects. This function may queue a transaction to
delete such objects.

The commit modifies the scrubber, so that no rescheduling of
the scrub is mandated if no transaction was queued.

Fixes: https://tracker.ceph.com/issues/73773
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
qa/tasks: make the cephadm and vstart_runner tasks aware of watchdog
The ModeCollector class is used to collect values
of some type 'key', each associated with some object
identified by an 'ID'. The collector reports the 'mode'
value - the value associated with the largest number
of distinct IDs.

The results structure returned by the collector specifies
one of three possible mode_status_t values:

- no_mode_value - No clear victory for any value

- mode_value - we have a winner, but it has less than half of the
  samples

- authorative_value - more than half of the samples are of the same
  value

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
…nces

mgr, mon, osdc: pass complex parameters by rvalue reference

Reviewed-by: Adam Emerson <aemerson@redhat.com>
Reviewed-By: J. Eric Ivancich <ivancich@redhat.com>
osd/scrub: scanning the rollbacks not mandating a reschedule

Reviewed-by: Samuel Just <sjust@redhat.com>
The entire subsuite is pinned by centos_latest.yaml symlink, so the
stanza in memcheck.yaml is redundant.  Removing it allows to experiment
with other distros just through varying the symlink target.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Co-authored-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
common: ModeCollector: locating the value of the mode

Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com>
that were left out by mistake in the previous commit.

(the previous commit: 3efcdbf)

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
aainscow and others added 22 commits November 26, 2025 11:28
The can_serve_replica_read() function is called by replica to determine whether there are
any uncommitted writes.  If such writes exist, then the system will reject the IO to avoid
the risk of reading data from a write which may yet be rolled back.

The same code is going to be useful for EC direct reads.

The string_view code is not expensive.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
This was not necessary prior to direct reads, but is essential when the
client needs to know which shard the read came from.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
This allows a backend to expose how an object offset/length translates to
an offset/length on a particular shard.

For Replica, this is trivial.

For EC, this means looking up the start and end offsets, then translating
this to shard address space.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Sparse reads for EC are simple to implement, as the code is essentially
identical to that of replica, with some address translation.

When doing a direct read in EC, only a single OSD is involved and
that OSD, by definition is the only OSD involved. As such we can
do the more performant sync read, rather than async read.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
This function is necessary for balanced reads and as such is required for EC too.

Rename the function to make sense, given this change of purpose, but the
functionality does not change.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
For direct read failures, the locking is such that we cannot
immediately send a new IO without deadlocking. This new interface
allows an op to be sent as an asio post.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
This parameter is not used by the _calc_target code.  It is being
removed just to clean up the code, as we are making some changes
to _calc_target in later stages of the split io PR.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
The functionality is not altered by this commit.

In the future we want to post-process split-ios after
recombining the read data.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
…r shard

This will eventually be used by SplitIo to direct ops to the correct OSD.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
…it op

When spliting ops, certain addition sub ops (e.g. get xattr) can be simply passed
through to the child op.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
This will provide the ability for Objecter to split up
certain ops and distribute them to the OSDs directly if
that provides a preformance advantage.

This is experimental code and is switched off unless the
magic pool flags are enabled. These magic pool flags were
pushed in an earlier commit in the same PR.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
…02511

doc: Fix Sphinx warning about theme option
…e-202511

doc/releases: Fix Sphinx warning in tentacle.rst
…ephfs

doc/cephfs: Small improvements in fscrypt.rst
cmake: disable WITH_BREAKPAD on power arch

Reviewed-by: Kefu Chai <k.chai@proxmox.com>
debian: include rgw-gap-list manpage and rgw-policy-check in ceph-common

Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
Reviewed-by: Matan Breizman <mbreizma@ibm.com>
…tial_fix

librbd: rbd_aio_write_with_crc32c store CRC32C with initial value -1 to match msgr2 validation

Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
EC Direct Reads: First PR, background work

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Bill Scales <bill_scales@uk.ibm.com>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Comment thread src/mon/MDSMonitor.cc
Comment on lines +2448 to +2455
int followables = mds_map.is_followable(rank);
if (followables < 2 ) {
dout(1) << " setting mds." << info->global_id
<< " to follow mds rank " << rank << dendl;
fsmap.assign_standby_replay(info->global_id, fs.get_fscid(), rank);
do_propose = true;
changed = true;
break;
//break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it about having at least 2 standby-replay mds?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this should be rather configurable but for now this allows up to 2 MDS-es in standby-replay mode.

Copy link
Copy Markdown

@sajibreadd-croit sajibreadd-croit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sajibreadd-croit pushed a commit that referenced this pull request Feb 11, 2026
…yed static"

```
Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: AddressSanitizer:DEADLYSIGNAL
Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: =================================================================
Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: ==3==ERROR: AddressSanitizer: stack-overflow on address 0x7b512f6c8dd8 (pc 0x0000046e7a72 bp 0x7b512de7c900 sp 0x7b512f6c8dd8 T0)
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     #0 0x0000046e7a72 in get_global_options() (/usr/bin/ceph-osd-crimson+0x46e7a72) (BuildId: 2a86043f51c9be9cb19801e276fb3ee36239556a)
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     #1 0x0000046e540e in build_options() (/usr/bin/ceph-osd-crimson+0x46e540e) (BuildId: 2a86043f51c9be9cb19801e276fb3ee36239556a)
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     #2 0x0000033b7949 in get_ceph_options() (/usr/bin/ceph-osd-crimson+0x33b7949) (BuildId: 2a86043f51c9be9cb19801e276fb3ee36239556a)
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     #3 0x000003440540 in md_config_t::md_config_t(ConfigValues&, ConfigTracker const&, bool) (/usr/bin/ceph-osd-crimson+0x3440540) (BuildId: 2a860>
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     #4 0x0000046856a8 in crimson::common::ConfigProxy::ConfigProxy(EntityName const&, std::basic_string_view<char, std::char_traits<char> >) (/usr>
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     #5 0x000000eb6cb5 in seastar::shared_ptr_count_for<crimson::common::ConfigProxy>::shared_ptr_count_for<EntityName&, std::__cxx11::basic_string>
..
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     ceph#40 0x000000ed6434 in seastar::future<int> seastar::futurize<int>::apply<crimson::osd::_get_early_config(int, char const**)::{lambda()#1}::ope>
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]:     ceph#41 0x000000ed672b in seastar::async<crimson::osd::_get_early_config(int, char const**)::{lambda()#1}::operator()() const::{lambda()#1}>(seast>
```

This reverts commit 1ab0a8c.

Fixes: https://tracker.ceph.com/issues/74481

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.