You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PKB Bug Report: Incomplete Teardown When Provision Phase Fails
Issue Type: Bug Severity: Critical Component: Resource Cleanup / Teardown Related Issue: #1239 Status: Proposed Fix which implemented and tested successfully on my environment
Fix Summary
The Delete() method in benchmark_spec.py fails to clean up network and firewall resources when:
The deleted flag is prematurely set to True during provision failures
Pickle data corruption causes network/firewall objects to be invalid
Teardown runs with --run_stage=teardown but skips cleanup due to the deleted flag
Implemented Solution: Three-layer defense strategy in perfkitbenchmarker/benchmark_spec.py:
Ignore deleted flag during teardown-only runs: Don't trust the deleted flag when stages.TEARDOWN is in FLAGS.run_stage
Validate objects before use: Check that network/firewall objects have valid Delete() methods
Fallback to direct cleanup: When objects are invalid, use gcloud commands to delete resources by name pattern
Test Results: Successfully cleaned up orphaned resources from run URI e848137d:
PerfKitBenchmarker fails to delete network and firewall resources when a benchmark fails during the provision phase, leaving orphaned GCP resources that:
Cost money
Block future runs with the same run_uri
Require manual cleanup
Root Cause Analysis
The primary issue is a fatal ControlPath too long error during the SSH connection setup within the WaitForBootCompletion method. This error is triggered because the temporary directory path for the SSH control socket exceeds the maximum allowed length on the user's macOS system. This initial error triggers a KeyboardInterrupt, which then causes the incomplete teardown.
The Delete() method in benchmark_spec.py has a critical flaw: firewall and network deletion is skipped when self.vms is empty, which occurs when a benchmark fails during VM provisioning.
Code Analysis
In perfkitbenchmarker/benchmark_spec.py (lines 958-987):
defDelete(self):
ifself.deleted:
return# ... other resource deletions ...ifself.vms: # ← PROBLEM: This condition gates VM deletiontry:
background_tasks.RunThreaded(self.DeleteVm, self.vms)
background_tasks.RunThreaded(
lambdavm: vm.DeleteScratchDisks(), self.vms
)
exceptException:
logging.exception(...)
# Placement groups deletion (outside if block)forfirewallinself.firewalls.values(): # ← This code runstry:
firewall.DisallowAllPorts() # ← But only DISABLES ports, doesn't DELETEexceptException:
logging.exception(...)
# ... container cluster deletion ...fornetinself.networks.values(): # ← Network deletion failstry:
net.Delete() # ← Fails because firewall rules still existexceptException:
logging.exception(...)
The Bug:
firewall.DisallowAllPorts() only disables firewall rules, it doesn't delete them
Network deletion fails because firewall rules are dependencies
When self.vms is empty (provision failure), the firewall deletion logic is effectively bypassed
Reproduction Steps
Run any PKB benchmark that creates networks and VMs
Kill the process during VM provisioning (after network/firewall creation but before VMs are fully created)
Run teardown with --run_stage=teardown --run_uri=<failed_run_uri>
Observe that network and firewall rules are NOT deleted
2025-11-17 20:23:56,540 e848137d MainThread nginx(1/0) INFO Tearing down resources for benchmark nginx
2025-11-17 20:23:56,546 e848137d MainThread INFO Benchmark run statuses:
Name UID Status Failed Substatus
nginx nginx0 SUCCEEDED UNCATEGORIZED
Success rate: 100.00% (1/1)
Note: Teardown completed in 0.006 seconds - no actual deletion occurred!
My Failed Attempts
My previous attempts to fix the issue were unsuccessful because I was focused on the wrong problem. I initially believed the issue was with the teardown logic itself, but the real problem is the ControlPath too long error that triggers the entire failure cascade. My fixes to the teardown logic were ineffective because the script was never reaching that point in the code.
The following run_uri was used for testing: a25b5b22
When the pickle file is missing, the following error is thrown:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/perfkitbenchmarker/runs/e848137d/cluster_boot0'
Related Issues
This bug is related to Issue #1239 which describes similar cleanup problems in the object_storage_service benchmark. That issue has been open since December 2016 (8+ years) with no resolution.
Both issues stem from the same root cause: incomplete resource tracking and cleanup logic in PKB's teardown phase.
Proposed Fix
Option 1: Always Delete Networks and Firewalls
Modify benchmark_spec.py to delete networks and firewalls regardless of VM state:
defDelete(self):
ifself.deleted:
return# ... other resource deletions ...ifself.vms:
try:
background_tasks.RunThreaded(self.DeleteVm, self.vms)
background_tasks.RunThreaded(
lambdavm: vm.DeleteScratchDisks(), self.vms
)
exceptException:
logging.exception(...)
# ALWAYS delete firewalls, even if no VMs existforfirewallinself.firewalls.values():
try:
firewall.Delete() # ← Explicit deletion, not just DisallowAllPorts()exceptException:
logging.exception(...)
# ... container cluster deletion ...# ALWAYS delete networks, even if no VMs existfornetinself.networks.values():
try:
net.Delete()
exceptException:
logging.exception(...)
Option 2: Comprehensive Resource Tracking
Track all created resources in the pickled spec, not just VMs:
This bug has significant cost implications for organizations running PKB at scale. Each failed benchmark run leaves orphaned resources that accumulate over time. Without automated cleanup or proper error reporting, these resources can go unnoticed until billing alerts trigger.
The bug is particularly insidious because:
PKB reports teardown as "SUCCEEDED" even when it fails
No warning messages are generated
The pickled spec doesn't track all created resources
Manual intervention is required for every failed run
Deep Dive Analysis (November 17, 2025 - 21:36)
Pickle File Investigation
Examined the pickled BenchmarkSpec for run_uri=e848137d:
PKB Bug Report: Incomplete Teardown When Provision Phase Fails
Issue Type: Bug
Severity: Critical
Component: Resource Cleanup / Teardown
Related Issue: #1239
Status: Proposed Fix which implemented and tested successfully on my environment
Fix Summary
The
Delete()method inbenchmark_spec.pyfails to clean up network and firewall resources when:deletedflag is prematurely set toTrueduring provision failures--run_stage=teardownbut skips cleanup due to thedeletedflagImplemented Solution: Three-layer defense strategy in
perfkitbenchmarker/benchmark_spec.py:deletedflag whenstages.TEARDOWNis inFLAGS.run_stageDelete()methodsgcloudcommands to delete resources by name patternTest Results: Successfully cleaned up orphaned resources from run URI
e848137d:default-internal-10-0-0-0-8-e848137d,perfkit-firewall-e848137d-22-22pkb-network-e848137dOriginal Bug Report
Summary
PerfKitBenchmarker fails to delete network and firewall resources when a benchmark fails during the provision phase, leaving orphaned GCP resources that:
run_uriRoot Cause Analysis
The primary issue is a fatal
ControlPath too longerror during the SSH connection setup within theWaitForBootCompletionmethod. This error is triggered because the temporary directory path for the SSH control socket exceeds the maximum allowed length on the user's macOS system. This initial error triggers aKeyboardInterrupt, which then causes the incomplete teardown.The
Delete()method inbenchmark_spec.pyhas a critical flaw: firewall and network deletion is skipped whenself.vmsis empty, which occurs when a benchmark fails during VM provisioning.Code Analysis
In
perfkitbenchmarker/benchmark_spec.py(lines 958-987):The Bug:
firewall.DisallowAllPorts()only disables firewall rules, it doesn't delete themself.vmsis empty (provision failure), the firewall deletion logic is effectively bypassedReproduction Steps
--run_stage=teardown --run_uri=<failed_run_uri>Example Command
Result: Teardown reports "SUCCEEDED" but resources remain:
Actual Behavior
self.vmslist is empty in pickled specExpected Behavior
self.vmsstateImpact
Resource Leakage
Operational Issues
run_uriSilent Failure
Evidence
Our Case Study
Run URI:
e848137dTimeline:
Orphaned Resources (verified with gcloud):
PKB Teardown Log:
Note: Teardown completed in 0.006 seconds - no actual deletion occurred!
My Failed Attempts
My previous attempts to fix the issue were unsuccessful because I was focused on the wrong problem. I initially believed the issue was with the teardown logic itself, but the real problem is the
ControlPath too longerror that triggers the entire failure cascade. My fixes to the teardown logic were ineffective because the script was never reaching that point in the code.The following
run_uriwas used for testing:a25b5b22When the pickle file is missing, the following error is thrown:
Related Issues
This bug is related to Issue #1239 which describes similar cleanup problems in the
object_storage_servicebenchmark. That issue has been open since December 2016 (8+ years) with no resolution.Both issues stem from the same root cause: incomplete resource tracking and cleanup logic in PKB's teardown phase.
Proposed Fix
Option 1: Always Delete Networks and Firewalls
Modify
benchmark_spec.pyto delete networks and firewalls regardless of VM state:Option 2: Comprehensive Resource Tracking
Track all created resources in the pickled spec, not just VMs:
Option 3: Validate Cleanup
Add validation to ensure resources were actually deleted:
Workaround
Until this is fixed, users must manually clean up orphaned resources:
Environment
Additional Context
This bug has significant cost implications for organizations running PKB at scale. Each failed benchmark run leaves orphaned resources that accumulate over time. Without automated cleanup or proper error reporting, these resources can go unnoticed until billing alerts trigger.
The bug is particularly insidious because:
Deep Dive Analysis (November 17, 2025 - 21:36)
Pickle File Investigation
Examined the pickled BenchmarkSpec for
run_uri=e848137d:Critical Discovery: Corrupted Pickle Data
The
spec.networksdictionary contains raw JSON strings as keys instead ofGceNetworkobjects. This corruption explains the silent failure:if self.deleted: return- Thedeletedflag is alreadyTrue, soDelete()returns immediatelyif self.networks:- Evaluates toTrue(dict has JSON string keys)for net in self.networks.values():- Iterates over JSON strings, not network objectsnet.Delete()- Calling.Delete()on a string does nothing (no error, no deletion)Root Cause: Three-Layer Failure
networksdict was pickled with malformed datadeletedflag prevents re-execution, and invalid objects cause silent no-opsWhy Previous Fix Failed
The previous attempt added resource discovery but only for
FileNotFoundError. In this case:/tmp/perfkitbenchmarker/runs/e848137d/nginx0deleted=Trueflag causes immediate return fromDelete()Revised Proposed Fix
Strategy: Robust Teardown with Validation and Recovery
The fix must handle three scenarios:
Scenario 1: Normal Teardown (provision succeeded)
self.networkscontains validGceNetworkobjectsself.firewallscontains validGceFirewallobjectsScenario 2: Corrupted Pickle (THIS CASE)
self.networkscontains invalid data (JSON strings, empty, etc.)self.firewallsmay be empty or invaliddeletedflag may beTrueScenario 3: Missing Pickle
Implementation Plan
Key Changes
deletedflag during teardown-only runsDisallowAllPorts()toDelete()in GceFirewall classRecommendation
This should be treated as a critical priority bug because: