Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions _docs/config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Provider Config Proposal

## Requirements

- Provider should be configured by one config.
- Single source of truth that can be watched by provider; changes in the source must be propagated to all watching providers.
- Hot reload - provider should apply config changes on the fly.
- Config must not be public.
- Follow the inventory operator config style.

## Questions to clarify

- Do we want to share a single config between providers located in different K8s clusters? (If yes, ConfigMap is not suitable.)

## Terminology

- **Ops** (human operator): Person who runs and maintains the provider. Receives notifications, decides when to restart.
- **Startup config**: Values (e.g. cluster.k8s, manifest_namespace) that require a restart to take effect.
- **Runtime config**: Values that can be reloaded on the fly without restart.


## Proposed solution

### Config format

YAML format with subsections per module (like inventory operator).

<details>
<summary>Expand to see config example</summary>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config example is general, not precise, but it shows how the future config format may look like


```yaml
version: v1
cluster:
k8s: true
manifest_namespace: lease
public_hostname: ""
node_port_quantity: 1
wait_ready_duration: 5s
overcommit:
cpu: 0
memory: 0
storage: 0
deployment:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this relating to specifically? Provider deployment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That section is related to those flags: link

As I understand, it is settings for tenant workloads (how the provider deploys leases to K8s), not the provider’s own deployment.

ingress_static_hosts: false
ingress_domain: ""
ingress_expose_lb_hosts: false
network_policies_enabled: true
runtime_class: gvisor
blocked_hostnames: []
docker_image_pull_secrets: ""

bidengine:
pricing_strategy: scale
deposit: "5000000uakt"
timeout: 5m
scale:
cpu: "0"
memory: "0"
storage: "0"
endpoint: "0"
ip: "0"

gateway:
listen_address: "0.0.0.0:8443"
grpc_listen_address: "0.0.0.0:8444"
tls:
cert: ""
key: ""

monitor:
max_retries: 40
retry_period: 4s
retry_period_jitter: 15s
healthcheck_period: 10s
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health check is performed by readiness/liveness probes. What is this specifically setting?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to that flag: link

This is not K8s readiness/liveness probes, but the check conducted by the deployment against tenant workloads.

healthcheck_period_jitter: 5s

balance_checker:
withdrawal_period: 24h
lease_funds_check_interval: 10m

cert_issuer:
enabled: false

# ... other sections
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future Gateway API settings

```

</details>

### Hot reload

Decisions:

1. **Auto-restart on config change?** No - to avoid unexpected downtime. Notify Ops; they restart when ready.

2. **Mixed change (runtime + startup):** Apply runtime values only. Ops must be notified that restart is required.

3. **Module re-init without full process restart?** Possibly yes, in a later iteration. Cluster and bidengine have shared state, so a clean restart is recommended for those modules. Some values (e.g. listen address) could be applied without restart by redesign - start new server, close old one.

**Restart notification** (when startup config changes, notify Ops; they restart when ready)

- **Flow**:
- Provider loads config, watches or polls for changes
- Provider detects config change; runtime config is applied immediately,
- If there is a startup config change, Provider emits `provider_config_restart_required=1` metric (or K8s Event when in-cluster) - passive marker, no traffic drain
- Prometheus or other monitoring tool alerts Ops (Slack, PagerDuty, etc.)
- Ops restarts when ready

## Solution comparison

### By scenario

| Scenario | Best fit |
|----------|----------|
| **Single cluster** | ConfigMap + K8s watch |
| **Multi-cluster, minimal infra** | S3 + poll |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the multi-cluster setup in this context. Does it mean a shared configuration between providers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine an operator that has two providers in physicality different clusters.
Or a single provider, but with worker nodes in the different k8s clusters.

Why I am separating this, because K8S config maps works only within the same cluster, and it is the simplest and k8s native solution that I would consider in case we go with single cluster appproach.

| **Multi-cluster, near-instant updates** | Redis, Consul or Vault |


### S3 vs Other Config Sources

| Criteria | S3 + poll | ConfigMap + K8s watch | Redis | Consul | HTTP + poll | Vault |
|----------|-----------|------------------------|-------|--------|-------------|-------|
| **Single source of truth** | Yes | Per cluster | Yes | Yes | Yes | Yes |
| **Multi-cluster** | Yes | No | Yes* | Yes | Yes | Yes |
| **Watch / push** | No (poll) | Yes | Yes (pub/sub) | Yes | No (poll) | Yes (KV watch) |
| **Auth** | Access key or IAM | K8s SA + RBAC | Password | ACL token | Bearer, mTLS, OAuth2 | AppRole, K8s auth |
| **Auth: set once** | Yes (key) | Yes (SA) | Yes (password) | Yes (token) | Depends | Yes (AppRole) |
| **Auth: outside cloud** | Access key | N/A (in-cluster) | Password | Token | Bearer, mTLS | AppRole |
| **Infra to run** | None (managed) | None | Redis | Consul | HTTP server | Vault |
| **Provider deps** | AWS SDK | K8s client (existing) | redis client | consul client | net/http | vault client |
| **Complexity** | Low | Low | Medium | Medium | Low-Medium | High |
| **Max config delay** | Poll interval (e.g. 30s) | Seconds | Seconds | Seconds | Poll interval | Seconds |

\* Redis must be reachable from all clusters (shared instance or replication).



### Trade-offs

| Solution | Pros | Cons |
|----------|------|------|
| **S3** | No extra infra, managed, multi-cloud, simple auth | Polling only, config delay up to poll interval |
| **ConfigMap** | Native K8s, real-time watch, no secrets | Single cluster only |
| **Redis** | Pub/sub, fast updates, simple auth | Run and operate Redis |
| **Consul** | KV + watch, multi-datacenter, ACL | Run and operate Consul |
| **HTTP** | Flexible, any backend | Need server + watch/poll strategy |
| **Vault** | Strong auth, KV watch | Heavy, more setup |

## Migration plan

1. **Phase 1 - Struct + loader**: Define Go structs for config, implement YAML loader. Keep flags; map flags to struct fields during transition.
2. **Phase 2 - Remote source**: Add S3/ConfigMap backend as primary config source. Flags override remote values (backward compat).
3. **Phase 3 - Remove flags**: Deprecate individual flags; remote config becomes the only input. Env vars for secrets only (e.g. `AKASH_PROVIDER_KEY`).
4. **Phase 4(optional) - File fallback**: Add optional local file for dev; used when remote is not configured or unreachable.

## Go implementation

- **YAML parsing**: `gopkg.in/yaml.v3` (already in go.mod)
- **Config struct + merge**: Custom structs with `mapstructure` tags; `github.com/go-viper/mapstructure/v2` for YAML-to-struct
- **File watch**: `fsnotify` or `github.com/fsnotify/fsnotify` for local file; K8s watch for ConfigMap; S3 poll
- **No Viper for new config**: Current code uses `spf13/viper` with flags. New design: explicit load (YAML unmarshal + optional merge), no Viper. Simplifies precedence and avoids flag/config coupling.

## Local override of global config

Use case: global config (S3/ConfigMap) shared by providers; one provider needs different values (e.g. dev, debugging, cluster-specific).

| Solution | How it works | Pros | Cons |
|----------|--------------|------|------|
| **Override file** | Load global first, then `config.local.yaml` (or path from `--config-override`). Deep-merge; local wins. | Simple, explicit, no extra infra | Two files to manage; override path must be passed |
| **Env per field** | `CLUSTER_DEPLOYMENT_INGRESS_DOMAIN=dev.example.com` overrides `cluster.deployment.ingress_domain`. | No extra files, 12-factor | Verbose for nested keys; env proliferation |
Loading