From 21f972a58be2cbf8086e42dadff50b2bfe16eb78 Mon Sep 17 00:00:00 2001 From: Artem Shcherbatiuk Date: Thu, 19 Mar 2026 16:17:01 +0100 Subject: [PATCH 1/4] docs: config --- _docs/config/README.md | 131 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 _docs/config/README.md diff --git a/_docs/config/README.md b/_docs/config/README.md new file mode 100644 index 00000000..2e671ba2 --- /dev/null +++ b/_docs/config/README.md @@ -0,0 +1,131 @@ +# Provider Config Proposal + +## Requirements + +- Provider should be configured by one config. +- Single source of truth that can be watched by provider; changes in the source must be propagated to all watching providers. +- Hot reload - provider should apply config changes on the fly. +- Config must not be public. +- Follow the inventory operator config style. + +## Questions to clarify + +- Do we want to share a single config between providers located in different K8s clusters? (If yes, ConfigMap is not suitable.) + +## Proposed solution + +### Config format + +YAML format with subsections per module (like inventory operator). + +
+Expand to see config example + +```yaml +version: v1 +cluster: + k8s: true + manifest_namespace: lease + public_hostname: "" + node_port_quantity: 1 + wait_ready_duration: 5s + overcommit: + cpu: 0 + memory: 0 + storage: 0 + deployment: + ingress_static_hosts: false + ingress_domain: "" + ingress_expose_lb_hosts: false + network_policies_enabled: true + runtime_class: gvisor + blocked_hostnames: [] + docker_image_pull_secrets: "" + +bidengine: + pricing_strategy: scale + deposit: "5000000uakt" + timeout: 5m + scale: + cpu: "0" + memory: "0" + storage: "0" + endpoint: "0" + ip: "0" + +gateway: + listen_address: "0.0.0.0:8443" + grpc_listen_address: "0.0.0.0:8444" + tls: { cert: "", key: "" } + +monitor: + max_retries: 40 + retry_period: 4s + retry_period_jitter: 15s + healthcheck_period: 10s + healthcheck_period_jitter: 5s + +balance_checker: + withdrawal_period: 24h + lease_funds_check_interval: 10m + +cert_issuer: + enabled: false + +# ... other sections +``` + +
+ +### Hot reload + +Some values (e.g. cluster.k8s, manifest_namespace) require a restart - "startup config". Others can be reloaded on the fly - "runtime config". + +Decisions: + +1. **Auto-restart on config change?** No - to avoid unexpected downtime. Implement a notification mechanism so the operator knows when a restart is required. + +2. **Mixed change (runtime + startup):** Apply runtime values only. Operator must receive a notification that restart is required. + +3. **Module re-init without full process restart?** Possibly yes, in a later iteration. Cluster and bidengine have shared state, so a clean restart is recommended for those modules. Some values (e.g. listen address) could be applied without restart by redesign - start new server, close old one. + +## Solution comparison + +### By scenario + +| Scenario | Best fit | +|----------|----------| +| **Single cluster** | ConfigMap + K8s watch | +| **Multi-cluster, minimal infra** | S3 + poll | +| **Multi-cluster, near-instant updates** | Redis, Consul or Vault | + + +### S3 vs Other Config Sources + +| Criteria | S3 + poll | ConfigMap + K8s watch | Redis | Consul | HTTP + poll | Vault | +|----------|-----------|------------------------|-------|--------|-------------|-------| +| **Single source of truth** | Yes | Per cluster | Yes | Yes | Yes | Yes | +| **Multi-cluster** | Yes | No | Yes* | Yes | Yes | Yes | +| **Watch / push** | No (poll) | Yes | Yes (pub/sub) | Yes | No (poll) | Yes (KV watch) | +| **Auth** | Access key or IAM | K8s SA + RBAC | Password | ACL token | Bearer, mTLS, OAuth2 | AppRole, K8s auth | +| **Auth: set once** | Yes (key) | Yes (SA) | Yes (password) | Yes (token) | Depends | Yes (AppRole) | +| **Auth: outside cloud** | Access key | N/A (in-cluster) | Password | Token | Bearer, mTLS | AppRole | +| **Infra to run** | None (managed) | None | Redis | Consul | HTTP server | Vault | +| **Provider deps** | AWS SDK | K8s client (existing) | redis client | consul client | net/http | vault client | +| **Complexity** | Low | Low | Medium | Medium | Low-Medium | High | +| **Max config delay** | Poll interval (e.g. 30s) | Seconds | Seconds | Seconds | Poll interval | Seconds | + +\* Redis must be reachable from all clusters (shared instance or replication). + + + +### Trade-offs + +| Solution | Pros | Cons | +|----------|------|------| +| **S3** | No extra infra, managed, multi-cloud, simple auth | Polling only, config delay up to poll interval | +| **ConfigMap** | Native K8s, real-time watch, no secrets | Single cluster only | +| **Redis** | Pub/sub, fast updates, simple auth | Run and operate Redis | +| **Consul** | KV + watch, multi-datacenter, ACL | Run and operate Consul | +| **HTTP** | Flexible, any backend | Need server + watch/poll strategy | +| **Vault** | Strong auth, KV watch | Heavy, more setup | From cd72d7f76254f9aa5dc297936b0fbc3de5ac2c74 Mon Sep 17 00:00:00 2001 From: Artem Shcherbatiuk Date: Thu, 19 Mar 2026 18:49:08 +0100 Subject: [PATCH 2/4] docs: config format --- _docs/config/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/_docs/config/README.md b/_docs/config/README.md index 2e671ba2..ed0cffca 100644 --- a/_docs/config/README.md +++ b/_docs/config/README.md @@ -56,7 +56,9 @@ bidengine: gateway: listen_address: "0.0.0.0:8443" grpc_listen_address: "0.0.0.0:8444" - tls: { cert: "", key: "" } + tls: + cert: "" + key: "" monitor: max_retries: 40 From b62f2dc1bf42c597e61ffb030caf0ff9e342c23b Mon Sep 17 00:00:00 2001 From: Artem Shcherbatiuk Date: Thu, 19 Mar 2026 20:28:18 +0100 Subject: [PATCH 3/4] docs: local override, go impl, migration plan --- _docs/config/README.md | 56 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 52 insertions(+), 4 deletions(-) diff --git a/_docs/config/README.md b/_docs/config/README.md index ed0cffca..033a5559 100644 --- a/_docs/config/README.md +++ b/_docs/config/README.md @@ -12,6 +12,14 @@ - Do we want to share a single config between providers located in different K8s clusters? (If yes, ConfigMap is not suitable.) +## Terminology + +- **Ops** (human operator): Person who runs and maintains the provider. Receives notifications, decides when to restart. +- **K8s operator**: Controller running in Kubernetes (e.g. inventory operator) that manages provider deployments and related resources. +- **Startup config**: Values (e.g. cluster.k8s, manifest_namespace) that require a restart to take effect. +- **Runtime config**: Values that can be reloaded on the fly without restart. + + ## Proposed solution ### Config format @@ -81,16 +89,33 @@ cert_issuer: ### Hot reload -Some values (e.g. cluster.k8s, manifest_namespace) require a restart - "startup config". Others can be reloaded on the fly - "runtime config". - Decisions: -1. **Auto-restart on config change?** No - to avoid unexpected downtime. Implement a notification mechanism so the operator knows when a restart is required. +1. **Auto-restart on config change?** No - to avoid unexpected downtime. Notify Ops; they restart when ready. -2. **Mixed change (runtime + startup):** Apply runtime values only. Operator must receive a notification that restart is required. +2. **Mixed change (runtime + startup):** Apply runtime values only. Ops must be notified that restart is required. 3. **Module re-init without full process restart?** Possibly yes, in a later iteration. Cluster and bidengine have shared state, so a clean restart is recommended for those modules. Some values (e.g. listen address) could be applied without restart by redesign - start new server, close old one. +**Restart notification** (when startup config changes, notify Ops; they restart when ready). Fully automated: Ops is pushed the notification, no manual checks. + +Flow: +1. Provider loads config from source (ConfigMap, S3, etc.) and watches or polls for changes. +2. Provider detects startup config change. Applies runtime changes only. +3. Provider emits notification (Prometheus metric, K8s Event, or pub/sub message). +4. Ops receives notification automatically (Prometheus alert, Slack, PagerDuty, etc.) - no active check required. +5. Ops restarts provider when ready (maintenance window, low traffic, etc.). + +| Config source | Notification channel | +|---------------|----------------------| +| **ConfigMap** | Provider creates K8s Event (alert on Event) or sets Prometheus metric; Ops gets paged. | +| **S3** | Provider sets Prometheus metric `provider_config_restart_required=1`; Ops alert fires. | +| **HTTP** | Same as S3. | +| **Redis** | Provider publishes to `restart_required` channel; consumer triggers alert (push to Ops). | +| **Consul** | Provider sets Prometheus metric; or consumer watches KV and triggers alert. | + +Provider keeps running normally. Notification (metric, Event, pub/sub) is a passive marker - no change to provider behavior, no traffic drain. Ops gets alerted and restarts when ready. + ## Solution comparison ### By scenario @@ -131,3 +156,26 @@ Decisions: | **Consul** | KV + watch, multi-datacenter, ACL | Run and operate Consul | | **HTTP** | Flexible, any backend | Need server + watch/poll strategy | | **Vault** | Strong auth, KV watch | Heavy, more setup | + +## Migration plan + +1. **Phase 1 - Struct + loader**: Define Go structs for config, implement YAML loader. Keep flags; map flags to struct fields during transition. +2. **Phase 2 - Remote source**: Add S3/ConfigMap backend as primary config source. Flags override remote values (backward compat). +3. **Phase 3 - Remove flags**: Deprecate individual flags; remote config becomes the only input. Env vars for secrets only (e.g. `AKASH_PROVIDER_KEY`). +4. **Phase 4(optional) - File fallback**: Add optional local file for dev; used when remote is not configured or unreachable. + +## Go implementation + +- **YAML parsing**: `gopkg.in/yaml.v3` (already in go.mod) +- **Config struct + merge**: Custom structs with `mapstructure` tags; `github.com/go-viper/mapstructure/v2` for YAML-to-struct +- **File watch**: `fsnotify` or `github.com/fsnotify/fsnotify` for local file; K8s watch for ConfigMap; S3 poll +- **No Viper for new config**: Current code uses `spf13/viper` with flags. New design: explicit load (YAML unmarshal + optional merge), no Viper. Simplifies precedence and avoids flag/config coupling. + +## Local override of global config + +Use case: global config (S3/ConfigMap) shared by providers; one provider needs different values (e.g. dev, debugging, cluster-specific). + +| Solution | How it works | Pros | Cons | +|----------|--------------|------|------| +| **Override file** | Load global first, then `config.local.yaml` (or path from `--config-override`). Deep-merge; local wins. | Simple, explicit, no extra infra | Two files to manage; override path must be passed | +| **Env per field** | `CLUSTER_DEPLOYMENT_INGRESS_DOMAIN=dev.example.com` overrides `cluster.deployment.ingress_domain`. | No extra files, 12-factor | Verbose for nested keys; env proliferation | From 83f56fb73c6dca5285c7810cc9b2fc156625f748 Mon Sep 17 00:00:00 2001 From: Artem Shcherbatiuk Date: Thu, 19 Mar 2026 20:37:12 +0100 Subject: [PATCH 4/4] docs: cleanup --- _docs/config/README.md | 25 +++++++------------------ 1 file changed, 7 insertions(+), 18 deletions(-) diff --git a/_docs/config/README.md b/_docs/config/README.md index 033a5559..6aea4cbe 100644 --- a/_docs/config/README.md +++ b/_docs/config/README.md @@ -15,7 +15,6 @@ ## Terminology - **Ops** (human operator): Person who runs and maintains the provider. Receives notifications, decides when to restart. -- **K8s operator**: Controller running in Kubernetes (e.g. inventory operator) that manages provider deployments and related resources. - **Startup config**: Values (e.g. cluster.k8s, manifest_namespace) that require a restart to take effect. - **Runtime config**: Values that can be reloaded on the fly without restart. @@ -97,24 +96,14 @@ Decisions: 3. **Module re-init without full process restart?** Possibly yes, in a later iteration. Cluster and bidengine have shared state, so a clean restart is recommended for those modules. Some values (e.g. listen address) could be applied without restart by redesign - start new server, close old one. -**Restart notification** (when startup config changes, notify Ops; they restart when ready). Fully automated: Ops is pushed the notification, no manual checks. +**Restart notification** (when startup config changes, notify Ops; they restart when ready) -Flow: -1. Provider loads config from source (ConfigMap, S3, etc.) and watches or polls for changes. -2. Provider detects startup config change. Applies runtime changes only. -3. Provider emits notification (Prometheus metric, K8s Event, or pub/sub message). -4. Ops receives notification automatically (Prometheus alert, Slack, PagerDuty, etc.) - no active check required. -5. Ops restarts provider when ready (maintenance window, low traffic, etc.). - -| Config source | Notification channel | -|---------------|----------------------| -| **ConfigMap** | Provider creates K8s Event (alert on Event) or sets Prometheus metric; Ops gets paged. | -| **S3** | Provider sets Prometheus metric `provider_config_restart_required=1`; Ops alert fires. | -| **HTTP** | Same as S3. | -| **Redis** | Provider publishes to `restart_required` channel; consumer triggers alert (push to Ops). | -| **Consul** | Provider sets Prometheus metric; or consumer watches KV and triggers alert. | - -Provider keeps running normally. Notification (metric, Event, pub/sub) is a passive marker - no change to provider behavior, no traffic drain. Ops gets alerted and restarts when ready. +- **Flow**: + - Provider loads config, watches or polls for changes + - Provider detects config change; runtime config is applied immediately, + - If there is a startup config change, Provider emits `provider_config_restart_required=1` metric (or K8s Event when in-cluster) - passive marker, no traffic drain + - Prometheus or other monitoring tool alerts Ops (Slack, PagerDuty, etc.) + - Ops restarts when ready ## Solution comparison