Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
9d3af40
refactor: Unify bin scripts and refactor core libs
cjac Mar 13, 2026
07e19dd
feat: Implement remaining logic and add CI/CD scripts
cjac Mar 13, 2026
c667991
docs: Update docs, add config, tests, and project files
cjac Mar 13, 2026
e10161f
feat: Overhaul audit system with SQLite and concurrent checks
cjac Mar 17, 2026
c979937
feat: Implement declarative creation and enhance destruction
cjac Mar 18, 2026
cce11e0
fix: Correct _check_exists calls and address code review
cjac Mar 18, 2026
8ce83a7
fix: Stabilize audit, network, and create scripts
cjac Mar 19, 2026
9a92c6d
feat: Refactor destroy-dpgce to be plan-based and audit-driven
cjac Mar 19, 2026
759d512
fix: Stabilize state, SA binding, and script execution
cjac Mar 19, 2026
de0649e
Fix: Stabilize teardown, implement SWP egress, and enforce GCE/GKE se…
cjac Mar 23, 2026
dc322b9
feat: Improve State Handling, Defaults, and Service Account Logic
cjac Mar 24, 2026
909fed3
switch from "null" to ""
cjac Mar 24, 2026
5d18e3a
cleanup of audit and plan
cjac Mar 24, 2026
b9d5179
Stabilize Dataproc lifecycle via persistent state and multi-stage boo…
cjac Mar 25, 2026
f8e235a
Refine egress control, audit robustness, and SWP resource discovery
cjac Mar 25, 2026
0272cfc
Fix DuplicateSectionError in boto.cfg and add universe-domain metadata
cjac Mar 26, 2026
ca39ca8
Always run boto repair via gce-proxy-setup.sh in cluster creation
cjac Mar 26, 2026
5068bac
provide a default ASN
cjac Mar 26, 2026
1a92587
fix: harden proxy/boto config and stabilize cluster boot logic
cjac Mar 26, 2026
19e844e
corrected rename and whitespace errors in connectivity-test
cjac Mar 27, 2026
83fd187
adding useful content to debug-init-action
cjac Mar 27, 2026
c237d16
applied fixes from initialization-actions
cjac Mar 27, 2026
6d7e184
hardened no_proxy list
cjac Mar 27, 2026
a61ecd3
fix(gcloud): resolve timestamp drift and consolidate cluster manageme…
cjac Apr 2, 2026
df62c27
fix(gcloud): stabilize timestamp synchronization and harden route man…
cjac Apr 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions gcloud/.gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
init/*/
tls/
#tmp/
# Env files
env.json
*.env

# Logs
logs/
*.log

# Temp files
tmp/
.DS_Store
*~
env.json

# State
state.db
state.json
*.db-journal
141 changes: 98 additions & 43 deletions gcloud/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<!--

Copyright 2021 Google LLC and contributors
Copyright 2021-2026 Google LLC and contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand All @@ -20,69 +20,124 @@ limitations under the License.

This collection of bash scripts helps create and manage Google Cloud environments to reproduce and test Google Dataproc cluster setups, particularly useful for troubleshooting issues related to startup scripts, initialization actions, and network configurations.

**Core Principles:**

* **Declarative & Idempotent:** The `create-dpgce` script is designed to be declarative. It audits the current state of the cloud environment and only creates missing resources to reach the desired state defined by the flags and `env.json`.
* **Audit-Driven:** The `audit-dpgce` script is the foundation, performing a comprehensive, concurrent scan of the GCP environment to populate a local SQLite database (`state.db`).
* **Stateful Cache:** A local SQLite database (`${GCLOUD_DIR}/.state/state.db`) is used to cache the audit results and persist configuration flags between runs.
* **Modular:** Core logic is organized into functions within the `lib/` directory.

## Supported Scenarios

These scripts are designed to deploy and manage Dataproc clusters in various configurations:

* **Standard Dataproc on GCE:** A cluster with default network settings.
* **Egress Control:** Options for `--nat-egress` (Cloud NAT) or `--swp-egress` (Secure Web Proxy).
* **Custom Images:** Support for deploying clusters using pre-built custom images via the `--custom` flag.
* **GPU-Enabled Clusters:** Facilitates testing GPU-enabled clusters, often used with custom images containing pre-installed drivers.
* **Secure Boot Clusters:** Deployment of clusters using custom images built with Secure Boot enabled.

## Setup

1. **Clone the repository:**
1. **Prerequisites:** Ensure you have the following tools installed:
* `gcloud` CLI
* `gsutil` (usually part of `gcloud`)
* `jq`: Used to parse and manipulate JSON.
* `sqlite3`: Used to query the state cache.
* `perl`: Used in some utility scripts.

2. **Clone the repository:**
```bash
git clone https://github.com/GoogleCloudDataproc/cloud-dataproc
cd cloud-dataproc/gcloud
git clone https://github.com/cjac/dataproc-evolution
cd dataproc-evolution/cloud-dataproc/gcloud
```
(Note: Adjust clone URL if using a different fork)

2. **Configure Environment:**
3. **Configure Environment:**
* Copy the sample configuration: `cp env.json.sample env.json`
* Edit `env.json` with your specific Google Cloud project details, region, network ranges, etc. Key fields include:
* `PROJECT_ID`
* `REGION`
* `ZONE` (often derived from REGION, e.g., `us-west4-b`)
* `BUCKET` (for staging)
* `TEMP_BUCKET`
* Other fields as needed for your test case.
* Edit `env.json` with your specific Google Cloud project details, region, network ranges, custom image URI, etc.

## Main Scripts (`bin/`)

The workflow centers around these main scripts:

3. **Review Script Libraries:** The core logic is now modularized into files within the `lib/` directory, categorized by function (e.g., `lib/gcp`, `lib/network`, `lib/dataproc`).
* **`bin/audit-dpgce`**: Queries the live cloud environment to discover deployed resources and updates the local SQLite state cache (`${GCLOUD_DIR}/.state/state.db`). Typically called automatically by other scripts, but can be run manually to inspect the current state.

## Scripts
* **`bin/create-dpgce`**: The idempotent creation script. It runs an audit, stores the provided flags (e.g., `--custom`, `--nat-egress`) in the state cache, then generates and executes a plan to create any resources that are missing to achieve the desired state. It does not delete or modify existing resources if they are already present.

The main scripts are located in the `bin/` directory:
* **`bin/destroy-dpgce`**: The teardown script. It audits the environment and then de-provisions all discovered resources in a safe dependency order. Uses the state cache to know what to delete. Add `--force` to also delete GCS buckets and SWP policies/certificate authorities.

* **`bin/create-dpgce`**: Creates a standard Dataproc on GCE cluster environment, including VPC, subnets, NAT, router, and firewall rules.
* **`bin/create-dpgce-private`**: Creates a private Dataproc on GCE cluster environment. This setup uses a Secure Web Proxy (SWP) for controlled egress and does *not* include a Cloud NAT or default internet route.
* **`bin/create-dpgke`**: Sets up a Dataproc on GKE environment.
* **`bin/destroy-dpgce`**: Tears down the environment created by `bin/create-dpgce`.
* **`bin/destroy-dpgce-private`**: Tears down the environment created by `bin/create-dpgce-private`.
* **`bin/destroy-dpgke`**: Tears down the DPGKE environment.
* **`bin/recreate-dpgce`**: Quickly deletes and recreates the Dataproc cluster within the existing `dpgce` environment.
* **`bin/recreate-dpgke`**: Quickly deletes and recreates the DPGKE cluster.
* **`bin/recreate-cluster.sh`**: Utility script to quickly delete and recreate just the Dataproc cluster VMs. It loads the *last used* flags (`--custom`, `--nat-egress`, etc.) from the `${GCLOUD_DIR}/.state/state.db` to ensure the cluster is recreated with the same configuration. Useful for testing changes to init actions or cluster properties without tearing down the whole network.

### Common Flags
* **`bin/ssh-m [node-index] [command...]`**: SSHes into the master node. Without arguments, it opens a shell. With arguments, it runs the command on the master node. Defaults to the first master (`-m`). If a number is provided as the first argument, it targets that index in an HA cluster (e.g., `bash bin/ssh-m 1` for `-m-1`).

* **`bin/scp-m [node-index] <local_path(s)>`**: Copies files or directories from your local machine to the `/tmp` directory on the master node. Similar to `ssh-m`, the first argument can be a node index for HA clusters.

### Example Usage

* **Create Environment & Cluster with NAT:**
```bash
bash bin/create-dpgce --nat-egress
```

* **Create Environment & Cluster with Custom Image and NAT:**
```bash
bash bin/create-dpgce --nat-egress --custom
```

* `--no-create-cluster`: Used with `create-*` scripts. Sets up all networking and dependencies but skips the final `gcloud dataproc clusters create` command. Useful for preparing an environment.
* `--force`: Used with `destroy-*` scripts. By default, GCS buckets and versioned SWP Certificate Authority components are not deleted. Use `--force` to remove these as well.
* `--quiet-gcloud`: Used with `create-*` scripts. Suppresses the pretty-printing of the `gcloud dataproc clusters create` command.
* `DEBUG=1`: Set this environment variable before running any script to enable verbose debug output (e.g., `DEBUG=1 bash bin/create-dpgce`).
* `TIMESTAMP=<number>`: Set this to a specific Unix timestamp to attempt to resume a previous `create` operation or to target specific versioned resources for deletion. If not set, a new timestamp is generated for each run.
* **Recreate the Cluster (using last saved flags):**
```bash
bash bin/recreate-cluster.sh
```

## Customizing Cluster Creation
* **Tear Down All Environment Infrastructure:**
```bash
bash bin/destroy-dpgce
```

The parameters for the `gcloud dataproc clusters create` command are primarily defined within `lib/dataproc/cluster.sh` in the `create_dpgce_cluster` function. You can adjust machine types, accelerators, metadata, properties, and initialization actions in this function.
* **Tear Down Everything, Including Persistent Resources (DANGEROUS):**
```bash
bash bin/destroy-dpgce --force
```

Numerous examples of alternative configurations and common options can be found in `docs/dataproc_cluster_examples.md`.
* **SSH to the master node:**
```bash
bash bin/ssh-m
```

## Idempotency and Sentinels
* **Run a command on the master node:**
```bash
bash bin/ssh-m nvidia-smi
```

The `create-*` scripts use sentinel files to track the completion of major steps. These sentinels are stored in `/tmp/dataproc-repro/${RESOURCE_SUFFIX}/sentinels/`. This allows you to re-run a `create-*` script, and it will skip steps that were already completed successfully in a previous run with the same `TIMESTAMP`.
* **Copy a file to the master node's /tmp:**
```bash
bash bin/scp-m my_script.sh
```

The `destroy-*` scripts remove the corresponding sentinel files.
### Default Behavior

## Logging
If `bin/create-dpgce` is run without any flags, it defaults to the following settings:

All `gcloud` commands executed via the `run_gcloud` helper function have their stdout and stderr redirected to log files within the `/tmp/dataproc-repro/${RESOURCE_SUFFIX}/` directory. Check these logs for details on any failures.
* `--no-custom`: Uses the standard image version.
* `--nat-egress`: Enables Cloud NAT for internet access.
* `--no-swp-egress`: Secure Web Proxy is disabled.
* A Dataproc cluster *will* be created.

## Troubleshooting
### Common Flags for `create-dpgce`

* **"command not found"**: Ensure the `bin/` script you are running sources the necessary files from the `lib/` subdirectories.
* **Resource Deletion Failures:** Check the logs in `/tmp/dataproc-repro/${RESOURCE_SUFFIX}/` for the specific `gcloud` error. Often, dependencies prevent deletion. Use `--force` with destroy scripts to be more aggressive.
* **Service Account Permissions:** Cluster creation can fail if the service account doesn't have the required roles. The `create_service_account` function attempts to bind these, but errors can occur. Check the `bind_*.log` files.
* `--custom`: Use the `CUSTOM_IMAGE_URI` from `env.json` for the cluster.
* `--no-custom`: Use the standard `IMAGE_VERSION` from `env.json`.
* `--nat-egress`: Ensure Cloud NAT is configured for internet egress from the standard subnet.
* `--no-nat-egress`: Do not configure Cloud NAT.
* `--swp-egress`: Ensure Secure Web Proxy (SWP) is configured for internet egress.
* `--no-swp-egress`: Do not configure Secure Web Proxy.
* `--no-create-cluster`: Set up all networking and dependencies but skip the `gcloud dataproc clusters create` command.

## Private Cluster Networking
### Debugging

The `create-dpgce-private` script sets up a VPC with no default internet route. Egress is intended to be handled by the Secure Web Proxy. Nodes in this cluster should not have direct internet access.
* `DEBUG=1`: Set this environment variable before running any script to enable verbose debug output (`set -x`).
```bash
DEBUG=1 bash bin/create-dpgce --nat-egress
```
* Logs for each script run are stored in timestamped directories under `/tmp/dataproc-repro/`.
Loading