e2e: add AWS infrastructure support for single/multi-region testing#611
e2e: add AWS infrastructure support for single/multi-region testing#611nameisbhaskar wants to merge 1 commit intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds AWS/EKS as a new infrastructure provider for the operator e2e suites, including multi-region provisioning support and a standalone AWS cleanup utility, while also improving e2e resiliency around transient network/proxy failures.
Changes:
- Introduces AWS infrastructure provisioning/teardown for e2e tests (EKS, VPCs, peering, CSI driver, CoreDNS).
- Adds a standalone multi-region AWS “zombie resource” cleanup script with dry-run and safety options.
- Improves e2e robustness via transient-network error detection, retries, and test-run/cluster naming utilities.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testutil/require.go | Adds transient network error detection and DB connection retry logic. |
| tests/e2e/operator/utils/cluster_naming.go | Adds provider selection + cluster/test-run naming utilities. |
| tests/e2e/operator/singleRegion/cockroachdb_single_region_e2e_test.go | Switches provider selection/naming to shared utils; improves cleanup behavior. |
| tests/e2e/operator/multiRegion/cockroachdb_multi_region_e2e_test.go | Same as single-region suite; uses shared provider + naming utilities. |
| tests/e2e/operator/region.go | Adds retry logic around namespace/secret creation and makes Helm cleanup resilient to transient network errors. |
| tests/e2e/operator/infra/provider.go | Wires AWS into the provider factory. |
| tests/e2e/operator/infra/common.go | Adds AWS provider constants and LB annotations; extends shared infra helpers. |
| tests/e2e/operator/infra/aws.go | New AWS/EKS provisioning + teardown implementation for e2e tests. |
| tests/e2e/operator/infra/cleanup-aws-resources.sh | New script to clean up AWS e2e resources across regions safely. |
| go.mod | Adds AWS SDK dependency (direct). |
| go.sum | Updates dependency checksums accordingly. |
| Makefile | Increases e2e timeouts to accommodate slower cloud provisioning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dd0b262 to
34425eb
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/e2e/operator/singleRegion/cockroachdb_single_region_e2e_test.go
Outdated
Show resolved
Hide resolved
34cee86 to
3d27151
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
988df4c to
84a6ba6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2f114da to
eb52b41
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
eb52b41 to
2825d8f
Compare
0e2824c to
d147c0a
Compare
d147c0a to
531606f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This commit adds comprehensive AWS/EKS infrastructure provisioning for
multi-region e2e testing of CockroachDB operator with thread-safety
improvements and security hardening.
## AWS Infrastructure Support
- **EKS Cluster Provisioning**: Full support for creating EKS clusters with
eksctl, including automatic EBS CSI driver installation for EKS 1.23+
- **Multi-Region Support**: VPC peering for cross-region connectivity with
proper security group configuration and CIDR routing
- **Network Configuration**: Support for 3 regions (us-east-1, us-east-2,
us-west-2) with non-overlapping CIDR blocks for VPCs and Pod networks
- **Corporate Proxy & TLS**: Handle corporate TLS inspection proxies with
optional TLS verification bypass via KUBECTL_INSECURE_SKIP_TLS_VERIFY
- **Resource Cleanup**: Comprehensive cleanup script with TestRunID-based
tagging for concurrent test isolation and orphaned resource detection
## Thread-Safety & Concurrency Fixes
- **Kubeconfig Mutex**: Added kubeconfigMutex to serialize kubeconfig file
updates, preventing race conditions when multiple goroutines write to
~/.kube/config concurrently during parallel cluster creation
- **Instance-Level Config**: Moved awsClusterConfigurations from package-level
to instance-level (r.clusterConfigs) to prevent shared mutable state and
test contamination when running tests in parallel or sequentially
## Security Hardening
- **Internal Load Balancers**: Set AWS NLB to internal-only
(aws-load-balancer-internal: "true") to prevent unnecessary public exposure
of test infrastructure. CoreDNS load balancers are only accessed by pods
within the VPC for cross-cluster DNS resolution, not from external clients.
## Test Infrastructure Improvements
- **Provider Abstraction**: CloudProvider interface with AWS, GCP, K3D, and
Kind implementations for consistent multi-cloud testing
- **Cluster Naming**: Centralized cluster and test-run naming utilities with
GitHub PR context integration for better resource tracking
- **Retry Logic**: Improved transient network error detection and retry
handling for flaky test resilience
## Code Quality Improvements
- **Removed IsMultiRegion flag**: Replaced redundant boolean with cluster
count checks (len(r.Clusters) > 1) for cleaner architecture
- **Safer conditional logic**: Changed early returns to conditional blocks
for more maintainable code execution paths
## Files Changed
Core infrastructure:
- tests/e2e/operator/infra/aws.go (new) - AWS/EKS provisioning & teardown
- tests/e2e/operator/infra/cleanup-aws-resources.sh (new) - Cleanup utility
- tests/e2e/operator/infra/common.go - Added AWS constants & LB annotations
- tests/e2e/operator/infra/provider.go - Wired AWS into provider factory
Test improvements:
- tests/e2e/operator/utils/cluster_naming.go (new) - Naming utilities
- tests/testutil/require.go - Transient error detection & retry logic
- tests/e2e/operator/region.go - Removed IsMultiRegion, added retry logic
- tests/e2e/operator/{singleRegion,multiRegion}/*_test.go - Provider selection
Build & dependencies:
- Makefile - Increased timeouts for cloud provisioning
- go.mod, go.sum - Added AWS SDK dependencies
Resolves: https://cockroachlabs.atlassian.net/browse/CRDB-53967
Addresses: #611 (comment)
Addresses: #611 (comment)
Addresses: #611 (comment)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
de05d09 to
25c2387
Compare
Summary
Resolves: https://cockroachlabs.atlassian.net/browse/CRDB-53967
This PR adds comprehensive AWS/EKS infrastructure support for multi-region e2e testing of the CockroachDB operator with thread-safety improvements and security hardening.
AWS Infrastructure Support
Test Infrastructure Improvements
Code Quality Improvements
len(r.Clusters) > 1) for cleaner architectureFiles Changed
Core Infrastructure
tests/e2e/operator/infra/aws.go(new, ~2,900 lines) - AWS/EKS provisioning & teardown with thread-safetytests/e2e/operator/infra/cleanup-aws-resources.sh(new, ~1,300 lines) - Standalone cleanup utilitytests/e2e/operator/infra/common.go- Added AWS constants & internal LB annotationstests/e2e/operator/infra/provider.go- Wired AWS into provider factoryTest Improvements
tests/e2e/operator/utils/cluster_naming.go(new) - Naming utilities with PR contexttests/testutil/require.go- Transient error detection & retry logictests/e2e/operator/region.go- Removed IsMultiRegion, added retry logictests/e2e/operator/{singleRegion,multiRegion}/*_test.go- Provider selection utilitiesBuild & Dependencies
Makefile- Increased timeouts for cloud provisioninggo.mod,go.sum- Added AWS SDK dependenciesTesting
Usage
Notes
KUBECTL_INSECURE_SKIP_TLS_VERIFYenvironment variable is only needed in corporate environments with TLS inspection proxiesManagedBy=helm-charts-e2eandTestRunID=<unique-id>for concurrent test isolation