Skip to content

e2e: add AWS infrastructure support for single/multi-region testing#611

Open
nameisbhaskar wants to merge 1 commit intomasterfrom
bhaskar/aws-e2e-testing-infra
Open

e2e: add AWS infrastructure support for single/multi-region testing#611
nameisbhaskar wants to merge 1 commit intomasterfrom
bhaskar/aws-e2e-testing-infra

Conversation

@nameisbhaskar
Copy link
Copy Markdown
Contributor

@nameisbhaskar nameisbhaskar commented Mar 24, 2026

Summary

Resolves: https://cockroachlabs.atlassian.net/browse/CRDB-53967

This PR adds comprehensive AWS/EKS infrastructure support for multi-region e2e testing of the CockroachDB operator with thread-safety improvements and security hardening.

AWS Infrastructure Support

  • EKS Cluster Provisioning: Full support for creating EKS clusters with eksctl, including automatic EBS CSI driver installation for EKS 1.23+
  • Multi-Region Support: VPC peering for cross-region connectivity with proper security group configuration and CIDR routing
  • Network Configuration: Support for 3 regions (us-east-1, us-east-2, us-west-2) with non-overlapping CIDR blocks for VPCs and Pod networks
  • Corporate Proxy & TLS: Handle corporate TLS inspection proxies with optional TLS verification bypass via KUBECTL_INSECURE_SKIP_TLS_VERIFY
  • Resource Cleanup: Comprehensive cleanup script with TestRunID-based tagging for concurrent test isolation and orphaned resource detection

Test Infrastructure Improvements

  • Provider Abstraction: CloudProvider interface with AWS, GCP, K3D, and Kind implementations for consistent multi-cloud testing
  • Cluster Naming: Centralized cluster and test-run naming utilities with GitHub PR context integration for better resource tracking
  • Retry Logic: Improved transient network error detection and retry handling for flaky test resilience

Code Quality Improvements

  • Removed IsMultiRegion flag: Replaced redundant boolean with cluster count checks (len(r.Clusters) > 1) for cleaner architecture
  • Safer conditional logic: Changed early returns to conditional blocks for more maintainable code execution paths

Files Changed

Core Infrastructure

  • tests/e2e/operator/infra/aws.go (new, ~2,900 lines) - AWS/EKS provisioning & teardown with thread-safety
  • tests/e2e/operator/infra/cleanup-aws-resources.sh (new, ~1,300 lines) - Standalone cleanup utility
  • tests/e2e/operator/infra/common.go - Added AWS constants & internal LB annotations
  • tests/e2e/operator/infra/provider.go - Wired AWS into provider factory

Test Improvements

  • tests/e2e/operator/utils/cluster_naming.go (new) - Naming utilities with PR context
  • tests/testutil/require.go - Transient error detection & retry logic
  • tests/e2e/operator/region.go - Removed IsMultiRegion, added retry logic
  • tests/e2e/operator/{singleRegion,multiRegion}/*_test.go - Provider selection utilities

Build & Dependencies

  • Makefile - Increased timeouts for cloud provisioning
  • go.mod, go.sum - Added AWS SDK dependencies

Testing

  • ✅ Single-region AWS tests validated with EKS provisioning
  • ✅ Multi-region AWS tests validated with VPC peering
  • ✅ Kubeconfig concurrency fix prevents file corruption
  • ✅ Instance-level configs prevent test contamination
  • ✅ Internal load balancers prevent public exposure
  • ✅ Cleanup script successfully removes all AWS resources

Usage

# Run single-region tests on AWS
PROVIDER=aws make test/e2e/single-region

# Run multi-region tests on AWS  
PROVIDER=aws make test/e2e/multi-region

# Cleanup AWS resources by TestRunID
./tests/e2e/operator/infra/cleanup-aws-resources.sh --test-run-id <id>

# Cleanup all AWS resources
./tests/e2e/operator/infra/cleanup-aws-resources.sh --all

Notes

  • The KUBECTL_INSECURE_SKIP_TLS_VERIFY environment variable is only needed in corporate environments with TLS inspection proxies
  • VPC peering is used instead of transit gateways for simplicity and cost
  • All resources are tagged with ManagedBy=helm-charts-e2e and TestRunID=<unique-id> for concurrent test isolation
  • Thread-safety fixes ensure reliable parallel cluster creation without kubeconfig corruption
  • Security hardening prevents accidental public exposure of test infrastructure

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds AWS/EKS as a new infrastructure provider for the operator e2e suites, including multi-region provisioning support and a standalone AWS cleanup utility, while also improving e2e resiliency around transient network/proxy failures.

Changes:

  • Introduces AWS infrastructure provisioning/teardown for e2e tests (EKS, VPCs, peering, CSI driver, CoreDNS).
  • Adds a standalone multi-region AWS “zombie resource” cleanup script with dry-run and safety options.
  • Improves e2e robustness via transient-network error detection, retries, and test-run/cluster naming utilities.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/testutil/require.go Adds transient network error detection and DB connection retry logic.
tests/e2e/operator/utils/cluster_naming.go Adds provider selection + cluster/test-run naming utilities.
tests/e2e/operator/singleRegion/cockroachdb_single_region_e2e_test.go Switches provider selection/naming to shared utils; improves cleanup behavior.
tests/e2e/operator/multiRegion/cockroachdb_multi_region_e2e_test.go Same as single-region suite; uses shared provider + naming utilities.
tests/e2e/operator/region.go Adds retry logic around namespace/secret creation and makes Helm cleanup resilient to transient network errors.
tests/e2e/operator/infra/provider.go Wires AWS into the provider factory.
tests/e2e/operator/infra/common.go Adds AWS provider constants and LB annotations; extends shared infra helpers.
tests/e2e/operator/infra/aws.go New AWS/EKS provisioning + teardown implementation for e2e tests.
tests/e2e/operator/infra/cleanup-aws-resources.sh New script to clean up AWS e2e resources across regions safely.
go.mod Adds AWS SDK dependency (direct).
go.sum Updates dependency checksums accordingly.
Makefile Increases e2e timeouts to accommodate slower cloud provisioning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch 2 times, most recently from dd0b262 to 34425eb Compare March 24, 2026 09:08
@nameisbhaskar nameisbhaskar requested a review from Copilot March 24, 2026 09:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch 2 times, most recently from 34cee86 to 3d27151 Compare March 24, 2026 09:35
@nameisbhaskar nameisbhaskar requested a review from Copilot March 24, 2026 09:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch 2 times, most recently from 988df4c to 84a6ba6 Compare March 24, 2026 10:41
@nameisbhaskar nameisbhaskar requested a review from Copilot March 24, 2026 10:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch 8 times, most recently from 2f114da to eb52b41 Compare March 25, 2026 01:25
@nameisbhaskar nameisbhaskar requested a review from Copilot March 25, 2026 01:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch from eb52b41 to 2825d8f Compare March 25, 2026 08:32
@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch 7 times, most recently from 0e2824c to d147c0a Compare March 26, 2026 13:12
@nameisbhaskar nameisbhaskar marked this pull request as ready for review March 26, 2026 13:12
@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch from d147c0a to 531606f Compare March 27, 2026 06:32
@nameisbhaskar nameisbhaskar requested a review from Copilot March 27, 2026 07:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This commit adds comprehensive AWS/EKS infrastructure provisioning for
multi-region e2e testing of CockroachDB operator with thread-safety
improvements and security hardening.

## AWS Infrastructure Support

- **EKS Cluster Provisioning**: Full support for creating EKS clusters with
  eksctl, including automatic EBS CSI driver installation for EKS 1.23+
- **Multi-Region Support**: VPC peering for cross-region connectivity with
  proper security group configuration and CIDR routing
- **Network Configuration**: Support for 3 regions (us-east-1, us-east-2,
  us-west-2) with non-overlapping CIDR blocks for VPCs and Pod networks
- **Corporate Proxy & TLS**: Handle corporate TLS inspection proxies with
  optional TLS verification bypass via KUBECTL_INSECURE_SKIP_TLS_VERIFY
- **Resource Cleanup**: Comprehensive cleanup script with TestRunID-based
  tagging for concurrent test isolation and orphaned resource detection

## Thread-Safety & Concurrency Fixes

- **Kubeconfig Mutex**: Added kubeconfigMutex to serialize kubeconfig file
  updates, preventing race conditions when multiple goroutines write to
  ~/.kube/config concurrently during parallel cluster creation
- **Instance-Level Config**: Moved awsClusterConfigurations from package-level
  to instance-level (r.clusterConfigs) to prevent shared mutable state and
  test contamination when running tests in parallel or sequentially

## Security Hardening

- **Internal Load Balancers**: Set AWS NLB to internal-only
  (aws-load-balancer-internal: "true") to prevent unnecessary public exposure
  of test infrastructure. CoreDNS load balancers are only accessed by pods
  within the VPC for cross-cluster DNS resolution, not from external clients.

## Test Infrastructure Improvements

- **Provider Abstraction**: CloudProvider interface with AWS, GCP, K3D, and
  Kind implementations for consistent multi-cloud testing
- **Cluster Naming**: Centralized cluster and test-run naming utilities with
  GitHub PR context integration for better resource tracking
- **Retry Logic**: Improved transient network error detection and retry
  handling for flaky test resilience

## Code Quality Improvements

- **Removed IsMultiRegion flag**: Replaced redundant boolean with cluster
  count checks (len(r.Clusters) > 1) for cleaner architecture
- **Safer conditional logic**: Changed early returns to conditional blocks
  for more maintainable code execution paths

## Files Changed

Core infrastructure:
- tests/e2e/operator/infra/aws.go (new) - AWS/EKS provisioning & teardown
- tests/e2e/operator/infra/cleanup-aws-resources.sh (new) - Cleanup utility
- tests/e2e/operator/infra/common.go - Added AWS constants & LB annotations
- tests/e2e/operator/infra/provider.go - Wired AWS into provider factory

Test improvements:
- tests/e2e/operator/utils/cluster_naming.go (new) - Naming utilities
- tests/testutil/require.go - Transient error detection & retry logic
- tests/e2e/operator/region.go - Removed IsMultiRegion, added retry logic
- tests/e2e/operator/{singleRegion,multiRegion}/*_test.go - Provider selection

Build & dependencies:
- Makefile - Increased timeouts for cloud provisioning
- go.mod, go.sum - Added AWS SDK dependencies

Resolves: https://cockroachlabs.atlassian.net/browse/CRDB-53967
Addresses: #611 (comment)
Addresses: #611 (comment)
Addresses: #611 (comment)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@nameisbhaskar nameisbhaskar force-pushed the bhaskar/aws-e2e-testing-infra branch from de05d09 to 25c2387 Compare March 27, 2026 09:09
@nameisbhaskar nameisbhaskar changed the title e2e: add AWS infrastructure support for multi-region testing e2e: add AWS infrastructure support for single/multi-region testing Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants