Skip to content

Implement Script Isolation#1803

Open
rhysparry wants to merge 14 commits intomainfrom
rhys/eft-251/script-isolation
Open

Implement Script Isolation#1803
rhysparry wants to merge 14 commits intomainfrom
rhys/eft-251/script-isolation

Conversation

@rhysparry
Copy link
Contributor

@rhysparry rhysparry commented Feb 25, 2026

Overview

This change provides an implementation to enforce script isolation within Calamari.

Currently, Tentacle provides script isolation functionality, by maintaining a series of in-process locks.

For SSH, this is achieved by performing the locking on the Octopus Instance running the script. This introduces issues for multi-node clusters as the lock is limited to a single node. Our execution scaling strategies rely on being able to create additional nodes to process service bus messages, so this needs to be addressed to support SSH targets and workers.

This implementation uses a file locking approach to share the lock across multiple processes.

Result

When the appropriate common arguments are supplied, Calamari will acquire an appropriate lock before running the relevant command.

The relevant common arguments are:

  • scriptIsolationLevel which defines the script isolation level Calamari should enforce. This must be one of two values (case insensitive): FullIsolation to run the command independent of all other commands and NoIsolation to allow the command to run concurrently with other NoIsolation commands. This argument is required for isolation to be enforced.
  • scriptIsolationMutexName defines a mutex name to use. Only other commands running with the same mutex name will be isolated. This name should be valid to use in a file name. This argument is required for isolation to be enforced.
  • scriptIsolationTimeout (optional) defines a timeout to use when waiting for the mutex. A timeout value should be greater than 10ms and less than 1 day. Providing a value outside of this range will remove the timeout and either reduce the number of retries or retry indefinitely. If not provided, the script will wait indefinitely for the mutex.

Additionally, TentacleHome is used as a location to store the files used for the lock. This directory resolves to something like /home/username/.octopus/machine-slug. This means that if the same machine is registered with a different slug it will use independent isolation locks.

Testing

  • This PR includes unit and integration tests validating lock behaviour
  • CI tests are configured to cover Windows and Linux platforms
  • MacOS has been tested locally

Current Caveats

  • This approach only isolates the Calamari component of the execution, not additional tasks, such as extracting and installing Calamari in the first place.
  • The underlying filesystem in use may not properly support locking.

Miscellaneous

@rhysparry rhysparry force-pushed the rhys/eft-251/script-isolation branch from ca2cf06 to e49c2f6 Compare February 25, 2026 03:58
Comment on lines +27 to +28
LockType.Exclusive => FileShare.None,
LockType.Shared => FileShare.ReadWrite,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to use a custom enum to map to the appropriate FileShare value used in the underlying lock. This keeps the intended usage clearer.

Comment on lines +12 to +13
static readonly TimeSpan RetryInitialDelay = TimeSpan.FromMilliseconds(10);
static readonly TimeSpan RetryMaxDelay = TimeSpan.FromMilliseconds(500);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use a polling strategy in Tentacle's ScriptIsolationMutex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should change Tentacle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this comment was supposed to show that this lock mechanism has to poll, whereas the Tentacle based one does not. Therefore this is new and these values are not drawn from anywhere else beyond being what I would consider to be reasonabel starting points.

@rhysparry rhysparry force-pushed the rhys/eft-251/script-isolation branch 4 times, most recently from a181073 to ccf545f Compare March 5, 2026 01:29
@rhysparry rhysparry marked this pull request as ready for review March 5, 2026 01:29
@rhysparry rhysparry self-assigned this Mar 5, 2026
@rhysparry rhysparry force-pushed the rhys/eft-251/script-isolation branch from ccf545f to d50b0b5 Compare March 5, 2026 04:35
Pass isolation options as parameters

Add unit tests for script isolation

Update test to pass options
@rhysparry rhysparry force-pushed the rhys/eft-251/script-isolation branch from d50b0b5 to a5c54e5 Compare March 6, 2026 01:00
Copy link
Contributor

@gb-8 gb-8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
I think we need to tests to confirm that a lock can be acquired when another one is released, and that we will wait for a period for this to happen.

Comment on lines +12 to +13
static readonly TimeSpan RetryInitialDelay = TimeSpan.FromMilliseconds(10);
static readonly TimeSpan RetryMaxDelay = TimeSpan.FromMilliseconds(500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should change Tentacle?

@rhysparry rhysparry requested a review from gb-8 March 9, 2026 03:26
Copy link
Contributor

@gb-8 gb-8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments that we need to discuss.

}

var lockFileInfo = GetLockFileInfo(options.TentacleHome, options.MutexName);
_ = lockFileInfo.Exists;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?
If this is required to force caching or some other subtle behaviour, then we need a comment to explain that.
Otherwise we can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is a remnant to try and force FileInfo to actually throw exceptions due to a bad path.

static ResiliencePipeline<ILockHandle> BuildLockAcquisitionPipeline(LockOptions lockOptions)
{
var builder = new ResiliencePipelineBuilder<ILockHandle>();
// Timeout must be between 10ms and 1 day. (Polly)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does (Polly) mean here? Are you saying those timeout limits are imposed by Polly? If so, can you link to the docs?

There is a big difference between timing out after 1 day and never timing out. If we really cannot support longer timeouts, then probably we should cap the timeout rather than removing it.

Let's discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a limit imposed by Polly: https://www.pollydocs.org/api/Polly.Timeout.TimeoutStrategyOptions.html#Polly_Timeout_TimeoutStrategyOptions_Timeout

Happy to switch this to cap the timeout. Will discuss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since timeout can be user-defined, I think we need to find a solution that will support timeouts >24h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the constraint on the timeout only applies when setting the timeout directly. By using the timeout provider, you can specify higher values, although it is still limited by CancellationTokenSource which is ~25 days. The same limitation applies to existing script isolation implementations (Tentacle and Server-based SSH implementation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I reckon that is plenty high enough.

var methodBody = runMethod!.GetMethodBody();
methodBody.Should().NotBeNull();

var localsContainILockHandle = methodBody!.LocalVariables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not test this by executing two Calamari processes and verifying isolation behaviour?
That would be a harder test to write and slower, but would actually verify behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added integration tests now

results[1].ExitCode.Should().Be(0,
"Process2 should succeed. Output: " + results[1].CapturedOutput);

// With NoIsolation (shared locks), both processes should have run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how asserting on overlap could be flaky.
Have you manually verified that they do overlap most of the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick way to eyeball this is to compare the execution times. Each script has a 2 second sleep.

image

The timings indicate concurrency as sequentially would take 4 seconds plus overhead and this was less than that.

Copy link
Contributor

@gb-8 gb-8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all issues have been addressed, but the PR needs updating with an account of what testing has been done, in particular with respect to exercising platform specific Hresult values for checking on lock failures.

@rhysparry
Copy link
Contributor Author

rhysparry commented Mar 18, 2026

I think all issues have been addressed, but the PR needs updating with an account of what testing has been done, in particular with respect to exercising platform specific Hresult values for checking on lock failures.

I've added the following to the PR description:

Testing

  • This PR includes unit and integration tests validating lock behaviour
  • CI tests are configured to cover Windows and Linux platforms
  • MacOS has been tested locally

These unit tests cover the HResult values.

Copy link
Contributor

@gb-8 gb-8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Pre-approved (but we're still waiting for confirmation of CI builds covering Windows).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants