Skip to content

Latest commit

 

History

History
436 lines (345 loc) · 15.7 KB

File metadata and controls

436 lines (345 loc) · 15.7 KB

merlion-node-exporter-cpp — Design

This document is the source of truth for the architecture of the C++ sibling of merlion-node-exporter-rs. It captures every decision that Codex (or any implementer) should not have to re-litigate.

If you find yourself making a non-trivial choice that this document doesn't cover, stop and update this document first so the choice shows up in the design history rather than in code archaeology.


1. Goals & non-goals

Goals

  • Wire compatibility with upstream node_exporter. A Prometheus scrape config / dashboard / alerting rule that works against upstream node_exporter must work against this binary, byte-for-byte, for the collectors we ship. Metric names, label sets, type lines, and HELP text all follow upstream conventions.
  • Wire compatibility with merlion-node-exporter-rs. The two Merlion implementations are interchangeable. A scrape diff between them should reduce to numeric value differences and label-ordering noise — never metric-name or shape differences.
  • Linux MVP scope. The 15 collectors listed in §7 Implementation Plan — same set as -rs.
  • Modern C++ idioms. C++23, std::expected, std::format, std::string_view at boundaries, RAII everywhere, constexpr where it pays for itself.
  • Headers-only-where-practical dependencies. Keep FetchContent pulls minimal so the project builds cleanly on Linux and macOS without package-manager assistance beyond brew install llvm cmake.

Non-goals

  • BSD / Darwin / Solaris / AIX collectors. macOS is supported as a build target so contributors can compile on their laptops, but the only collector that actually returns data on macOS is uname. Everything else returns an error and degrades to node_scrape_collector_success{collector="..."} 0, matching -rs's behaviour.
  • Histograms, summaries, or OpenMetrics protobuf negotiation in the MVP. The exposition encoder targets the text 0.0.4 subset that node_exporter actually emits today.
  • A Prometheus client-library dependency. The Metric model is hand-rolled — see §3.2 for the rationale.

2. Tech stack

Concern Choice Notes
Language C++23 rust-version parallel: cmake_minimum_required(VERSION 3.28)
Compiler Clang ≥ 18 (Homebrew LLVM on macOS) Apple Clang is not supported — std::expected / std::format lag in libc++
Build CMake 3.28+ FetchContent for deps, ctest for tests
HTTP cpp-httplib v0.18+ Blocking, header-only, perfect for the per-scrape model
CLI CLI11 v2.4+ Header-only
Logging spdlog v1.14+ Mirrors the tracing setup in -rs
Tests Catch2 v3 FetchContent'd
Format clang-format Style pinned in .clang-format (LLVM base + 4-space indent + 100-col line)
CI GitHub Actions Matrix: ubuntu-24.04 (clang-18, gcc-14) + macos-14 (brew llvm)

Why these and not the alternatives

  • cpp-httplib over Boost.Beast / Crow / Drogon. Per-scrape blocking handlers are the right model for a node exporter — there is nothing to await. cpp-httplib is the smallest dependency that gives us a single-header blocking HTTP server with thread-pool handling. Boost is enormous; Crow / Drogon are async frameworks we don't need.
  • Hand-rolled Metric model. Identical reasoning to -rs's decision to skip prometheus-client: the typed pre-registration pattern popular client libraries optimise for is more verbose than helpful when every scrape re-reads /proc fresh.
  • CLI11 over Boost.Program_options / argparse. Header-only, no Boost, declarative builder API maps cleanly onto clap in -rs so the two CLIs stay in sync.
  • Homebrew LLVM clang, not Apple Clang. std::expected and std::format are still partial in Apple's libc++. This matches the project-wide toolchain decision in merlion-tsdb-cpp.

3. Architecture

3.1 Component map

The C++ tree mirrors -rs's module layout one-to-one. A reviewer who knows one project should be able to find the equivalent file in the other in five seconds.

merlion-node-exporter-cpp/
├── include/merlion_node_exporter/
│   ├── metric.hpp          # Metric / Sample / MetricType
│   ├── encoding.hpp        # Prometheus text-format encoder
│   ├── registry.hpp        # Collector interface + Registry
│   ├── config.hpp          # path.{procfs,sysfs,rootfs}
│   ├── server.hpp          # cpp-httplib /metrics server
│   ├── cli.hpp             # CLI11 argument struct
│   └── version.hpp         # generated by CMake from project version
├── src/
│   ├── encoding.cpp
│   ├── registry.cpp
│   ├── config.cpp
│   ├── server.cpp
│   ├── cli.cpp
│   ├── main.cpp
│   └── collectors/
│       ├── loadavg.cpp
│       ├── meminfo.cpp
│       └── uname.cpp
├── tests/
│   ├── encoding_test.cpp
│   ├── loadavg_test.cpp
│   ├── meminfo_test.cpp
│   └── uname_test.cpp
├── cmake/
│   └── ToolchainLLVM.cmake # optional: forces brew LLVM on macOS
├── docs/
│   └── DESIGN.md           # ← you are here
├── .clang-format
├── .github/workflows/ci.yml
├── CMakeLists.txt
├── README.md
├── LICENSE
└── NOTICE

The -rs equivalents are:

C++ file Rust file
metric.hpp src/metric.rs
encoding.{hpp,cpp} src/encoding.rs
registry.{hpp,cpp} src/registry.rs
config.{hpp,cpp} src/config.rs
server.{hpp,cpp} src/server.rs
cli.{hpp,cpp} src/cli.rs
collectors/<X>.cpp src/collectors/<X>.rs
main.cpp src/main.rs

3.2 Metric model

Mirrors -rs exactly. Public surface:

namespace merlion::node_exporter {

enum class MetricType { Counter, Gauge, Untyped };

struct Sample {
    // Ordered key/value pairs; order is preserved as supplied by the
    // collector so output is deterministic.
    std::vector<std::pair<std::string, std::string>> labels;
    double value = 0.0;
};

struct Metric {
    std::string name;
    std::string help;
    MetricType  mtype = MetricType::Untyped;
    std::vector<Sample> samples;
};

} // namespace merlion::node_exporter

Construction is plain aggregate / brace-init; no builder pattern. The hand-rolled fluent style in -rs is unnecessary in C++ where designated initialisers exist.

3.3 Encoding

The encoder writes Prometheus text format 0.0.4 byte-for-byte compatible with -rs's src/encoding.rs. Public surface:

namespace merlion::node_exporter::encoding {

inline constexpr std::string_view content_type =
    "text/plain; version=0.0.4; charset=utf-8";

std::string encode(std::span<const Metric> metrics);

} // namespace merlion::node_exporter::encoding

Rules (must match -rs):

  1. Skip metric families with no samples.
  2. Emit # HELP <name> <escaped help>\n then # TYPE <name> <type>\n per family.
  3. One sample per line: <name>{<labels>} <value>\n.
  4. Label-value escaping: \\\, "\", \n\n (literal backslash-n).
  5. Integer-valued doubles with |v| < 1e15 print without a decimal point; everything else uses std::format("{}", v).
  6. NaN, +Inf, -Inf printed literally.

A round-trip test fixture in tests/encoding_test.cpp asserts that the canonical output exactly equals the byte string produced by -rs's encoder for the same input. This fixture is the contract that keeps the two implementations interchangeable.

3.4 Collector interface

namespace merlion::node_exporter {

struct Config {
    std::filesystem::path procfs = "/proc";
    std::filesystem::path sysfs  = "/sys";
    std::filesystem::path rootfs = "/";

    // /proc/foo and foo both resolve to <procfs>/foo.
    std::filesystem::path proc_path(std::string_view rel) const;
    std::filesystem::path sys_path(std::string_view rel) const;
};

class Collector {
public:
    virtual ~Collector() = default;
    virtual std::string_view name() const noexcept = 0;
    virtual std::expected<std::vector<Metric>, std::string>
        collect(const Config&) const = 0;
};

} // namespace merlion::node_exporter
  • std::expected (C++23) is the canonical mechanism — no exceptions across the collector boundary. Exceptions inside a collector implementation are caught by the registry and converted to std::unexpected("…").
  • name() returns a string literal ("loadavg", "meminfo", …). Stable identifier used for --no-collector.<name> flags and the collector label on scrape-status metrics.
  • Collectors are stateless and reused across scrapes.

3.5 Registry

class Registry {
public:
    void register_collector(std::unique_ptr<Collector>);
    std::vector<std::string_view> enabled_names() const;

    // Runs every collector, appends per-collector
    // node_scrape_collector_success and
    // node_scrape_collector_duration_seconds, returns the flat metric list.
    std::vector<Metric> gather(const Config&) const;
};

The two synthesised metric families (node_scrape_collector_success, node_scrape_collector_duration_seconds) match -rs byte-for-byte — same names, same labels, same HELP text. See src/registry.rs::Registry::gather for the reference behaviour.

3.6 HTTP server

cpp-httplib blocking server bound to --web.listen-address. Single GET <telemetry_path> route. Steps per request:

  1. Run registry.gather(config).
  2. encoding::encode(metrics) into a std::string.
  3. Respond with Content-Type: text/plain; version=0.0.4; charset=utf-8.

Errors are logged via spdlog and produce a 500 with body "internal error\n" — same wording as -rs.

Graceful shutdown: SIGINT / SIGTERM triggers server.stop().

3.7 CLI

Flag-for-flag parity with -rs (and therefore upstream node_exporter). Use CLI11:

--web.listen-address <ADDR>   default :9100 (env MNE_LISTEN_ADDRESS)
--web.telemetry-path <PATH>   default /metrics (env MNE_TELEMETRY_PATH)
--path.procfs <DIR>           default /proc (env MNE_PROCFS)
--path.sysfs <DIR>            default /sys  (env MNE_SYSFS)
--path.rootfs <DIR>           default /     (env MNE_ROOTFS)
--no-collector <NAME>         repeatable
--collector.only <NAME>       repeatable

:9100 resolves to 0.0.0.0:9100 in the server-bind step, matching -rs's Cli::resolved_listen_address.

3.8 Logging

spdlog default logger; level controlled by MNE_LOG_LEVEL env (values: trace|debug|info|warn|error, default info). The Rust side honours RUST_LOG; the C++ side uses MNE_LOG_LEVEL to avoid ambiguity. Document this divergence in the binary's --help output.


4. Build

CMakeLists.txt at the project root drives everything. Highlights:

  • cmake_minimum_required(VERSION 3.28) and set(CMAKE_CXX_STANDARD 23).
  • Detects the toolchain. On macOS, if CMAKE_CXX_COMPILER is not set, emit a FATAL_ERROR pointing the user at the README's Homebrew LLVM instructions — silently falling back to Apple Clang is a footgun.
  • FetchContent_Declare blocks for cpp-httplib, CLI11, spdlog, Catch2, pinned to specific versions (no GIT_TAG main).
  • Builds two targets:
    • merlion_node_exporter_lib (static): everything in src/ except main.cpp.
    • merlion-node-exporter (executable): links the lib + main.cpp.
  • MNE_BUILD_TESTS option (default ON when the project is the top level, OFF when included as a subdir). When on, builds merlion_node_exporter_tests against Catch2 and registers it with ctest.
  • Generates include/merlion_node_exporter/version.hpp from PROJECT_VERSION so --version output stays in sync with CMakeLists.txt.

Performance-relevant flags for the release config:

-O3 -fno-plt -ffunction-sections -fdata-sections
-Wl,--gc-sections   # Linux
-Wl,-dead_strip     # macOS

LTO is enabled when supported (check_ipo_supported).


5. Errors & observability

  • Inside a collector: prefer std::expected and short-circuit helpers. If the kernel surface throws (std::filesystem), catch at the collector boundary and return std::unexpected.
  • Registry: catches exceptions from Collector::collect, logs them via spdlog::error, records node_scrape_collector_success=0, still records the duration sample.
  • Server: any uncaught exception in a handler becomes a 500 + log line. The process never exits because a single scrape failed.

6. Test strategy

  • Pure parsers (encoding, loadavg::parse, meminfo::parse) are tested with hard-coded fixtures stored inline in the test source. Tests must include the same fixtures used by -rs so we know we agree byte-for-byte on synthetic input.
  • Filesystem-bound collectors are tested by pointing Config::procfs at a temp directory the test populates. No /proc access in unit tests.
  • Encoder round-trip: tests/encoding_test.cpp includes a snapshot text file (tests/data/expected_scrape.txt) and asserts the encoder output equals it. The same snapshot is checked in to -rs and a CI job in both repos runs the snapshot through both implementations (TODO — track in [issue #2]).
  • Smoke test: ctest starts the binary on a random port, scrapes /metrics, asserts response code 200 and Content-Type correct.

7. Implementation plan

Ordered so the project is useful as soon as possible. One PR per checkbox. Land scaffold + first three collectors first to validate the architecture; everything after that is mechanical extension.

Scaffold

  • PR #1: CMakeLists.txt, .clang-format, cmake/ToolchainLLVM.cmake, stub main.cpp, CI workflow. Builds and runs --version and exits 0, even though no collectors exist yet. No HTTP server wired up.

Core

  • PR #2: include/.../metric.hpp + tests
  • PR #3: encoding.{hpp,cpp} + tests (must pass the cross-language snapshot fixture)
  • PR #4: config.{hpp,cpp} + tests
  • PR #5: registry.{hpp,cpp} + tests (includes synthesised scrape metrics)
  • PR #6: cli.{hpp,cpp} + server.{hpp,cpp}, wires up /metrics. main.cpp becomes the production entry point.

Seed collectors (mirror -rs scaffold PR)

  • PR #7: loadavg
  • PR #8: meminfo
  • PR #9: uname

Linux MVP (one PR each; same order as -rs)

  • PR #10: cpu/proc/stat per-CPU jiffies
  • PR #11: diskstats/proc/diskstats
  • PR #12: netdev/proc/net/dev
  • PR #13: filesystemgetmntinfo + statvfs
  • PR #14: stat/proc/stat (boot time, intr, ctxt, processes)
  • PR #15: vmstat/proc/vmstat
  • PR #16: netstat/proc/net/{netstat,snmp,snmp6}
  • PR #17: sockstat/proc/net/sockstat{,6}
  • PR #18: pressure/proc/pressure/{cpu,memory,io}
  • PR #19: hwmon/sys/class/hwmon/
  • PR #20: thermal_zone/sys/class/thermal/thermal_zone*
  • PR #21: time — system clock + NTP sync state
  • PR #22: textfile*.prom files from a configured directory

Past MVP

  • Container image (Dockerfile)
  • Homebrew formula in the MerlionOS/homebrew-merlion tap
  • eBPF-backed collectors behind a CMake option (Linux only)

8. Cross-implementation contract

Whenever this document or -rs's public behaviour changes in a way that affects scrape output (new metric, new label, renamed metric, changed HELP text, …), update both repos in lock-step:

  1. PR against this repo with the design update.
  2. PR against -rs implementing the change.
  3. Cross-link the PRs in both descriptions.

Disagreement between the two implementations is a bug in whichever one diverged from this document.