Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cmd/hyperfleet-api/migrate/cmd.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,11 @@ func runMigrateWithError() error {
}
}()

if err := db.Migrate(connection.New(ctx)); err != nil {
// Use MigrateWithLock to prevent concurrent migrations from multiple pods
if err := db.MigrateWithLock(ctx, connection); err != nil {
logger.WithError(ctx, err).Error("Migration failed")
return err
}

logger.Info(ctx, "Migration completed successfully")
return nil
}
42 changes: 42 additions & 0 deletions docs/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,48 @@ Uses GORM AutoMigrate:
- Additive (creates missing tables, columns, indexes)
- Run via `./bin/hyperfleet-api migrate`

### Migration Coordination

**Problem:** During rolling deployments, multiple pods attempt to run migrations simultaneously, causing race conditions and deployment failures.

**Solution:** PostgreSQL advisory locks ensure exclusive migration execution.

#### How It Works

```go
// Only one pod/process acquires the lock and runs migrations
// Others wait until the lock is released
db.MigrateWithLock(ctx, factory)
```

**Implementation:**
1. Pod sets statement timeout (5 minutes) to prevent indefinite blocking
2. Pod acquires advisory lock via `pg_advisory_xact_lock(hash("migrations"), hash("Migrations"))`
3. Lock holder runs migrations exclusively
4. Other pods block until lock is released or timeout is reached
5. Lock automatically released on transaction commit

**Key Features:**
- **Zero infrastructure overhead** - Uses native PostgreSQL locks
- **Automatic cleanup** - Locks released on transaction end or pod crash
- **Timeout protection** - 5-minute timeout prevents indefinite blocking if a pod hangs
- **Nested lock support** - Same lock can be acquired in nested contexts without deadlock
- **UUID-based ownership** - Only original acquirer can unlock

#### Testing Concurrent Migrations

Integration tests validate concurrent behavior:

```bash
make test-integration # Runs TestConcurrentMigrations
```

**Test coverage:**
- `TestConcurrentMigrations` - Multiple pods running migrations simultaneously
- `TestAdvisoryLocksConcurrently` - Lock serialization under race conditions
- `TestAdvisoryLocksWithTransactions` - Lock + transaction interaction
- `TestAdvisoryLockBlocking` - Lock blocking behavior

## Database Setup

```bash
Expand Down
32 changes: 24 additions & 8 deletions pkg/config/db.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ import (
)

type DatabaseConfig struct {
Dialect string `json:"dialect"`
SSLMode string `json:"sslmode"`
Debug bool `json:"debug"`
MaxOpenConnections int `json:"max_connections"`
Dialect string `json:"dialect"`
SSLMode string `json:"sslmode"`
Debug bool `json:"debug"`
MaxOpenConnections int `json:"max_connections"`
AdvisoryLockTimeoutSeconds int `json:"advisory_lock_timeout_seconds"`

ConnMaxLifetime time.Duration `json:"conn_max_lifetime"`
ConnMaxIdleTime time.Duration `json:"conn_max_idle_time"`
Expand All @@ -40,10 +41,11 @@ type DatabaseConfig struct {

func NewDatabaseConfig() *DatabaseConfig {
return &DatabaseConfig{
Dialect: "postgres",
SSLMode: "disable",
Debug: false,
MaxOpenConnections: 50,
Dialect: "postgres",
SSLMode: "disable",
Debug: false,
MaxOpenConnections: 50,
AdvisoryLockTimeoutSeconds: 300, // 5 minutes - prevents indefinite blocking on migrations

ConnMaxLifetime: 5 * time.Minute,
ConnMaxIdleTime: 1 * time.Minute,
Expand Down Expand Up @@ -74,6 +76,11 @@ func (c *DatabaseConfig) AddFlags(fs *pflag.FlagSet) {
&c.MaxOpenConnections, "db-max-open-connections", c.MaxOpenConnections,
"Maximum open DB connections for this instance",
)

fs.IntVar(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Pattern

The new --db-advisory-lock-timeout flag (line 80) and DB_ADVISORY_LOCK_TIMEOUT
env var binding (line 107) have no unit test coverage. The env var logic
silently ignores invalid values (0, -1, "abc") which could surprise operators.

The repo already has thorough AddFlags/BindEnv tests for LoggingConfig in
logging_test.go. Consider adding equivalent tests for DatabaseConfig covering:
default value, flag override, env var override, flag-over-env precedence, and
invalid env var handling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've added comprehensive test coverage for DatabaseConfig in pkg/config/db_test.go following the same patterns established in logging_test.go.

Tests added:

  1. ✅ Default value - TestNewDatabaseConfig_Defaults
    - Validates AdvisoryLockTimeoutSeconds = 300 (5 minutes)
    - Validates Debug = false
  2. ✅ Flag override - TestDatabaseConfig_AddFlags
    - Verifies --db-advisory-lock-timeout flag registration
    - Tests flag parsing with custom values (e.g., --db-advisory-lock-timeout=600)
  3. ✅ Env var override - TestDatabaseConfig_BindEnv + TestDatabaseConfig_EnvOverridesDefaults
    - Tests DB_ADVISORY_LOCK_TIMEOUT and DB_DEBUG env var binding
    - Validates env vars override defaults when no flag is set
  4. ✅ Flag-over-env precedence - TestDatabaseConfig_FlagsOverrideEnv + TestDatabaseConfig_PriorityMixed
    - Verifies flags take priority over env vars
    - Tests mixed scenarios (flag for timeout, env for debug)
  5. ✅ Invalid env var handling - TestDatabaseConfig_BindEnv + TestDatabaseConfig_InvalidEnvHandling
    - Tests exactly the cases you mentioned: 0, -1, "abc"
    - Documents silent fallback behavior with t.Logf()
    - Invalid values keep the default (300 seconds)

Coverage highlights:

  • 6 test functions covering all requested scenarios
  • 30+ sub-tests using table-driven approach
  • Follows the exact syntax/semantic patterns from logging_test.go
  • All tests passing ✅

&c.AdvisoryLockTimeoutSeconds, "db-advisory-lock-timeout", c.AdvisoryLockTimeoutSeconds,
"Advisory lock timeout in seconds (prevents indefinite blocking during migrations)",
)
fs.DurationVar(&c.ConnMaxLifetime, "db-conn-max-lifetime", c.ConnMaxLifetime, "Maximum lifetime of a DB connection")
fs.DurationVar(&c.ConnMaxIdleTime, "db-conn-max-idle-time", c.ConnMaxIdleTime, "Maximum idle time of a DB connection")
fs.IntVar(&c.MaxIdleConnections, "db-max-idle-connections", c.MaxIdleConnections, "Maximum idle DB connections")
Expand All @@ -96,6 +103,15 @@ func (c *DatabaseConfig) BindEnv(fs *pflag.FlagSet) {
}
}
}

if val := os.Getenv("DB_ADVISORY_LOCK_TIMEOUT"); val != "" {
if fs == nil || !fs.Changed("db-advisory-lock-timeout") {
timeout, err := strconv.Atoi(val)
if err == nil && timeout > 0 {
c.AdvisoryLockTimeoutSeconds = timeout
}
}
}
}

func (c *DatabaseConfig) ReadFiles() error {
Expand Down
Loading