akash-network · vertex451 · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026
@@ -0,0 +1,170 @@
+# Provider Config Proposal
+
+## Requirements
+
+- Provider should be configured by one config.
+- Single source of truth that can be watched by provider; changes in the source must be propagated to all watching providers.
+- Hot reload - provider should apply config changes on the fly.
+- Config must not be public.
+- Follow the inventory operator config style.
+
+## Questions to clarify
+
+- Do we want to share a single config between providers located in different K8s clusters? (If yes, ConfigMap is not suitable.)
+
+## Terminology
+
+- **Ops** (human operator): Person who runs and maintains the provider. Receives notifications, decides when to restart.
+- **Startup config**: Values (e.g. cluster.k8s, manifest_namespace) that require a restart to take effect.
+- **Runtime config**: Values that can be reloaded on the fly without restart.
+
+
+## Proposed solution
+
+### Config format
+
+YAML format with subsections per module (like inventory operator).
+
+<details>
+<summary>Expand to see config example</summary>
+
+```yaml
+version: v1
+cluster:
+  k8s: true
+  manifest_namespace: lease
+  public_hostname: ""
+  node_port_quantity: 1
+  wait_ready_duration: 5s
+  overcommit:
+    cpu: 0
+    memory: 0
+    storage: 0
+  deployment:
+    ingress_static_hosts: false
+    ingress_domain: ""
+    ingress_expose_lb_hosts: false
+    network_policies_enabled: true
+    runtime_class: gvisor
+    blocked_hostnames: []
+    docker_image_pull_secrets: ""
+
+bidengine:
+  pricing_strategy: scale
+  deposit: "5000000uakt"
+  timeout: 5m
+  scale:
+    cpu: "0"
+    memory: "0"
+    storage: "0"
+    endpoint: "0"
+    ip: "0"
+
+gateway:
+  listen_address: "0.0.0.0:8443"
+  grpc_listen_address: "0.0.0.0:8444"
+  tls:
+    cert: ""
+    key: ""
+
+monitor:
+  max_retries: 40
+  retry_period: 4s
+  retry_period_jitter: 15s
+  healthcheck_period: 10s
+  healthcheck_period_jitter: 5s
+
+balance_checker:
+  withdrawal_period: 24h
+  lease_funds_check_interval: 10m
+
+cert_issuer:
+  enabled: false
+
+# ... other sections
+```
+
+</details>
+
+### Hot reload
+
+Decisions:
+
+1. **Auto-restart on config change?** No - to avoid unexpected downtime. Notify Ops; they restart when ready.
+
+2. **Mixed change (runtime + startup):** Apply runtime values only. Ops must be notified that restart is required.
+
+3. **Module re-init without full process restart?** Possibly yes, in a later iteration. Cluster and bidengine have shared state, so a clean restart is recommended for those modules. Some values (e.g. listen address) could be applied without restart by redesign - start new server, close old one.
+
+**Restart notification** (when startup config changes, notify Ops; they restart when ready)
+
+- **Flow**:
+  - Provider loads config, watches or polls for changes
+  - Provider detects config change; runtime config is applied immediately, 
+  - If there is a startup config change, Provider emits `provider_config_restart_required=1` metric (or K8s Event when in-cluster) - passive marker, no traffic drain
+  - Prometheus or other monitoring tool alerts Ops (Slack, PagerDuty, etc.)
+  - Ops restarts when ready
+
+## Solution comparison
+
+### By scenario
+
+| Scenario | Best fit |
+|----------|----------|
+| **Single cluster** | ConfigMap + K8s watch |
+| **Multi-cluster, minimal infra** | S3 + poll |
+| **Multi-cluster, near-instant updates** | Redis, Consul or Vault |
+
+
+### S3 vs Other Config Sources
+
+| Criteria | S3 + poll | ConfigMap + K8s watch | Redis | Consul | HTTP + poll | Vault |
+|----------|-----------|------------------------|-------|--------|-------------|-------|
+| **Single source of truth** | Yes | Per cluster | Yes | Yes | Yes | Yes |
+| **Multi-cluster** | Yes | No | Yes* | Yes | Yes | Yes |
+| **Watch / push** | No (poll) | Yes | Yes (pub/sub) | Yes | No (poll) | Yes (KV watch) |
+| **Auth** | Access key or IAM | K8s SA + RBAC | Password | ACL token | Bearer, mTLS, OAuth2 | AppRole, K8s auth |
+| **Auth: set once** | Yes (key) | Yes (SA) | Yes (password) | Yes (token) | Depends | Yes (AppRole) |
+| **Auth: outside cloud** | Access key | N/A (in-cluster) | Password | Token | Bearer, mTLS | AppRole |
+| **Infra to run** | None (managed) | None | Redis | Consul | HTTP server | Vault |
+| **Provider deps** | AWS SDK | K8s client (existing) | redis client | consul client | net/http | vault client |
+| **Complexity** | Low | Low | Medium | Medium | Low-Medium | High |
+| **Max config delay** | Poll interval (e.g. 30s) | Seconds | Seconds | Seconds | Poll interval | Seconds |
+
+\* Redis must be reachable from all clusters (shared instance or replication).
+
+
+
+### Trade-offs
+
+| Solution | Pros | Cons |
+|----------|------|------|
+| **S3** | No extra infra, managed, multi-cloud, simple auth | Polling only, config delay up to poll interval |
+| **ConfigMap** | Native K8s, real-time watch, no secrets | Single cluster only |
+| **Redis** | Pub/sub, fast updates, simple auth | Run and operate Redis |
+| **Consul** | KV + watch, multi-datacenter, ACL | Run and operate Consul |
+| **HTTP** | Flexible, any backend | Need server + watch/poll strategy |
+| **Vault** | Strong auth, KV watch | Heavy, more setup |
+
+## Migration plan
+
+1. **Phase 1 - Struct + loader**: Define Go structs for config, implement YAML loader. Keep flags; map flags to struct fields during transition.
+2. **Phase 2 - Remote source**: Add S3/ConfigMap backend as primary config source. Flags override remote values (backward compat).
+3. **Phase 3 - Remove flags**: Deprecate individual flags; remote config becomes the only input. Env vars for secrets only (e.g. `AKASH_PROVIDER_KEY`).
+4. **Phase 4(optional) - File fallback**: Add optional local file for dev; used when remote is not configured or unreachable.
+
+## Go implementation
+
+- **YAML parsing**: `gopkg.in/yaml.v3` (already in go.mod)
+- **Config struct + merge**: Custom structs with `mapstructure` tags; `github.com/go-viper/mapstructure/v2` for YAML-to-struct
+- **File watch**: `fsnotify` or `github.com/fsnotify/fsnotify` for local file; K8s watch for ConfigMap; S3 poll
+- **No Viper for new config**: Current code uses `spf13/viper` with flags. New design: explicit load (YAML unmarshal + optional merge), no Viper. Simplifies precedence and avoids flag/config coupling.
+
+## Local override of global config
+
+Use case: global config (S3/ConfigMap) shared by providers; one provider needs different values (e.g. dev, debugging, cluster-specific).
+
+| Solution | How it works | Pros | Cons |
+|----------|--------------|------|------|
+| **Override file** | Load global first, then `config.local.yaml` (or path from `--config-override`). Deep-merge; local wins. | Simple, explicit, no extra infra | Two files to manage; override path must be passed |
+| **Env per field** | `CLUSTER_DEPLOYMENT_INGRESS_DOMAIN=dev.example.com` overrides `cluster.deployment.ingress_domain`. | No extra files, 12-factor | Verbose for nested keys; env proliferation |