Skip to content

feat(umbp): chunked DRAM MR registration for NICs with limited max_mr_size#251

Open
maning00 wants to merge 4 commits intomainfrom
feat/dram-mr-chunked-registration
Open

feat(umbp): chunked DRAM MR registration for NICs with limited max_mr_size#251
maning00 wants to merge 4 commits intomainfrom
feat/dram-mr-chunked-registration

Conversation

@maning00
Copy link
Copy Markdown
Contributor

@maning00 maning00 commented Apr 3, 2026

Summary

  • Split DRAM tier ibv_reg_mr registration into chunks to prevent abort when
    the NIC's max_mr_size is smaller than DRAMTier capacity
  • Auto-detect device max_mr_size at runtime via ibv_device_attr; also
    supports UMBP_MAX_MR_CHUNK_SIZE env var and max_mr_chunk_size config
    field for manual override
  • Hard-cap IONIC (Pensando) NICs to 4 GB since they report incorrect
    max_mr_size
  • Fix Heartbeat() silently discarding tier_capacities updates

Changes

IOEngine / RDMA layer:

  • Add Backend::GetMaxMemoryRegionSize() virtual method; RdmaBackend
    queries min(max_mr_size) across all devices with a 4 GB cap for Pensando
  • IOEngine exposes GetMaxMemoryRegionSize() delegating to backends

PoolClient auto-chunking:

  • Init() detects chunk size, splits DRAM buffer registration into
    per-chunk export_dram_mems_ entries; normalizes to SIZE_MAX (no
    chunking, zero overhead) when effective_chunk >= max(buffer_size)
  • RegisterMemory() splits into chunks; RegisteredRegion gains a
    group_base field so DeregisterMemory() can batch-remove all chunks
    belonging to one original registration call

DRAMTier allocator:

  • Add SetAlignmentBoundary() to guarantee Allocate() never returns a
    slot that crosses a chunk boundary

Location ID mapping:

  • LocalStorageManager::BuildTierLocationInfo() changed from static to
    instance method; maps global offset → chunk_index:chunk_offset when
    chunking is active
  • UMBPClient::MaybePublishLocal() adapted accordingly

Config & bindings:

  • UMBPDistributedConfig / PoolClientConfig gain max_mr_chunk_size
  • pybind exposes the new field
  • UMBPClient constructor reads UMBP_MAX_MR_CHUNK_SIZE env var

Bug fix:

  • ClientRegistry::Heartbeat() had (void)tier_capacities — capacity
    updates from clients were silently dropped. Now writes them into
    ClientRecord.

Test plan

  • All 15 UMBP unit tests pass (including previously failing
    HeartbeatUpdatesCapacities)
  • Python binding verified: max_mr_chunk_size is readable/writable
  • BUILD_UMBP=1 BUILD_TESTS=1 pip install . builds cleanly
  • Integration test with UMBP_MAX_MR_CHUNK_SIZE=1048576 for
    distributed Put/Get across chunk boundaries
  • On-device validation with IONIC NIC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant