Skip to content

Support graceful script isolation fallback when filesystem does not support locking#1838

Draft
rhysparry wants to merge 29 commits intorhys/eft-251/script-isolationfrom
rhys/eft-383/graceful-lock-fallback
Draft

Support graceful script isolation fallback when filesystem does not support locking#1838
rhysparry wants to merge 29 commits intorhys/eft-251/script-isolationfrom
rhys/eft-383/graceful-lock-fallback

Conversation

@rhysparry
Copy link
Contributor

@rhysparry rhysparry commented Mar 19, 2026

Script isolation lock directory fallback for filesystems without full locking support

Motivation

Calamari's script isolation mechanism uses filesystem file locks to serialise or permit concurrent script execution. The lock file has historically been written into the Tentacle working directory (TentacleHome).

On some deployment targets the working directory resides on a filesystem that does not fully support file locking semantics — NFS mounts, certain FUSE-based filesystems, and some SMB shares being common examples. On these filesystems, lock acquisition either silently succeeds when it should block, or throws errors that are not meaningful to the caller. The result is that script isolation is silently broken: scripts that should be mutually exclusive may run concurrently, or scripts that should run in parallel are incorrectly serialised (or errored).

This PR introduces runtime filesystem lock capability detection and an automatic fallback to a temporary directory that does support locking, so that script isolation remains as strong as the host OS allows — without requiring any manual configuration from operators.


What changed

New abstractions

File Purpose
ScriptIsolation/LockCapability.cs Enum: Unknown, Unsupported, ExclusiveOnly, Supported
ScriptIsolation/LockFile.cs Value type pairing a LockDirectory with a specific file path
ScriptIsolation/LockDirectory.cs Value type capturing a directory path + its detected LockCapability; hosts the GetLockDirectory fallback algorithm
ScriptIsolation/CachedDriveInfo.cs Snapshot of a drive/mount's format, type, and detected lock capability; performs live 4-probe detection
ScriptIsolation/MountedDrives.cs Enumerates OS mounts; maps a path to its best-matching drive via longest-prefix matching with symlink resolution
ScriptIsolation/TemporaryDirectoryFallback.cs Platform-aware enumeration of candidate temp directories to fall back to
ScriptIsolation/IFileLockService.cs + FileLockService.cs Injectable abstraction over CreateDirectory, AcquireLock, and GetFallbackTemporaryDirectories — enables hermetic unit tests
ScriptIsolation/IPathResolutionService.cs + DefaultPathResolutionService.cs + PathResolutionServiceExtensions.cs Injectable abstraction and extension over Path.GetFullPath and symlink resolution — necessary to correctly match paths on macOS where /tmp → /private/tmp

Modified files

File Change
ScriptIsolation/Isolation.cs Adds PrepareLockOptions and ResolveLockOptions — the degradation policy when a directory with full lock support cannot be found
ScriptIsolation/LockOptions.cs Delegates IsFullySupported / IsSupported / Supports(LockType) to LockFile; adds FromScriptIsolationOptionsOrNull path through LockDirectory.GetLockDirectory
ScriptIsolation/FileLock.cs Minor: adapts to LockFile record shape
Plumbing/Commands/CommonOptions.cs Adds scriptIsolationSharedFallback CLI option and PromoteToExclusiveLockWhenSharedLockUnavailable property

Decision trees

Phase 1 — Lock directory selection (LockDirectory.GetLockDirectory)

flowchart TD
    Start([Candidate path]) --> MapDrive[Map candidate to its drive]
    MapDrive --> CandidateKnown{Candidate LockSupport?}
    CandidateKnown -- Supported --> ReturnCandidate([Return candidate\nLockCapability.Supported])
    CandidateKnown -- Unknown or Unsupported --> ProbeCandidateDrive

    ProbeCandidateDrive[DetectLockSupport on candidate drive]
    ProbeCandidateDrive --> CandidateProbeResult{Candidate\ndetected capability?}
    CandidateProbeResult -- Supported --> ReturnCandidate2([Return candidate\nLockCapability.Supported])
    CandidateProbeResult -- ExclusiveOnly or Unsupported or Unknown --> IterateFallbacks

    IterateFallbacks[Iterate temp fallback directories]
    IterateFallbacks --> HasTemp{More temp\ncandidates?}
    HasTemp -- No --> AnyExclusiveTemp
    HasTemp -- Yes --> MapTempDrive[Map temp dir to its drive]
    MapTempDrive --> ProbeTemp[DetectLockSupport on temp drive]
    ProbeTemp --> TempResult{Temp capability?}
    TempResult -- Supported --> ReturnTemp([Return temp dir\nLockCapability.Supported])
    TempResult -- ExclusiveOnly --> RecordFirstExclusive[Record as best-so-far\nif none recorded]
    TempResult -- Unsupported or Unknown --> HasTemp

    AnyExclusiveTemp{Any ExclusiveOnly\ntemp found?}
    AnyExclusiveTemp -- Yes, and candidate is ExclusiveOnly --> ReturnCandidate3([Return candidate\nLockCapability.ExclusiveOnly])
    AnyExclusiveTemp -- Yes, and candidate is Unsupported/Unknown --> ReturnBestTemp([Return best temp dir\nLockCapability.ExclusiveOnly])
    AnyExclusiveTemp -- No --> ReturnUnsupported([Return candidate\nLockCapability.Unsupported])
Loading

Key design choice: The candidate drive is probed before iterating temp directories. If the candidate detects as Supported it is returned immediately without ever consulting temp paths. When both the candidate and the best temp are ExclusiveOnly, the candidate is still preferred — using a temp directory offers no additional isolation benefit and introduces unnecessary path indirection. The temp is only used when it genuinely improves on the candidate.


Phase 2 — Lock options resolution (Isolation.ResolveLockOptions)

flowchart TD
    Start([LockOptions\nType + LockDirectory capability]) --> FullySupported{IsFullySupported?\nShared + Exclusive work\ncorrectly}

    FullySupported -- Yes --> ReturnOriginal([Use requested lock\nNo warning])
    FullySupported -- No --> IsSupported{IsSupported?\nRequested LockType\nis supported}

    IsSupported -- Yes, Exclusive requested --> PromoteFlag1{promoteToExclusiveLock?}
    PromoteFlag1 -- false --> ReturnExclusiveWarn([Use Exclusive lock\nWarn: shared scripts\nmay run concurrently])
    PromoteFlag1 -- true --> ReturnExclusiveNoWarn([Use Exclusive lock\nNo warning])

    IsSupported -- No, Shared requested\non ExclusiveOnly dir --> ExclusiveAvail{Supports\nExclusive?}
    ExclusiveAvail -- Yes --> PromoteFlag2{promoteToExclusiveLock?}
    PromoteFlag2 -- true --> PromoteToExclusive([Promote to Exclusive lock\nWarn: shared unavailable])
    PromoteFlag2 -- false --> NoLockShared([No lock acquired\nWarn: shared unavailable])

    ExclusiveAvail -- No --> NoLockUnsupported([No lock acquired\nWarn: no isolation available])
Loading

The promoteToExclusiveLock flag is controlled by the new scriptIsolationSharedFallback CLI option. It defaults to Exclusive (promote), which is the conservative choice — a script running with NoIsolation that cannot get a shared lock is promoted to exclusive rather than running with zero isolation. Pass Skip to disable promotion.


Areas especially relevant for review

CachedDriveInfo.DetectLockSupport — live filesystem probing

Four sequential probe operations are run against a temporary .tmp file in the lock directory to empirically verify lock semantics. The probes use TimeSpan.Zero timeout so they fail fast if the filesystem rejects the operation. The test file is always cleaned up in a finally block. Any exception during probing is treated conservatively as Unsupported. Reviewers should check whether the probe-file cleanup is sufficient and whether the TimeSpan.Zero strategy is reliable across all target filesystems.

PathResolutionServiceExtensions.ResolvePath — ancestor-walk symlink resolution

The ancestor-walk is necessary because a lock directory might not exist yet at the time of drive matching (e.g. first ever run). The algorithm peels off non-existent path segments from the end, resolves the nearest existing ancestor's symlink, then re-attaches the original tail. This is especially important on macOS where Path.GetFullPath("/tmp/x") returns /tmp/x but the drive is mounted at /private/tmp. All five GetFullPath exception types are caught and cause the input to be returned unchanged.

MountedDrives.GetAssociatedDrive — longest-prefix mount matching

The match is done on the symlink-resolved path, with platform-appropriate case sensitivity (OrdinalIgnoreCase on Windows/macOS, Ordinal on Linux). Reviewers should verify the StartsWith prefix check correctly handles the case where the drive root IS the path (no trailing separator needed since RootDirectory.FullName ends with a separator on all .NET platforms).

TemporaryDirectoryFallback.GetCandidates — platform-specific candidate order

On Linux, /dev/shm is included as a candidate after /tmp. tmpfs-backed /dev/shm is one of the most reliably lock-capable paths on systems where the working directory is NFS-mounted. The candidates are namespace-scoped using the last segment of the candidate path to prevent cross-deployment lock file collisions.

CommonOptions.ScriptIsolationOptions — new CLI parameter

The scriptIsolationSharedFallback option is additive and defaults to the safe value (Exclusive). Callers that do not pass the option get PromoteToExclusiveLockWhenSharedLockUnavailable = true.


Test coverage

LockDirectoryFixture — 1,063 lines, all groups tagged RunOnceOnWindowsAndLinux

All tests in this fixture use injected fakes (FakeLockService, FakePathResolutionService) and run entirely in-memory — no real filesystem access.

Group What is tested
ALockDirectory.Supports All LockCapability × LockType combinations (6 cases)
BCachedDriveInfo.LockSupport static Known-good formats (apfs, ntfs, ext4, etc.) → Supported; network drives → Unknown; DetectedLockSupport overrides (13 cases)
CCachedDriveInfo.DetectLockSupport All four live-probe outcomes: Unsupported, three variants of ExclusiveOnly (broken shared, broken exclusive-blocks-shared, broken shared-blocks-exclusive), and Supported (4 cases)
DLockDirectory.GetLockDirectory 9 scenario tests: candidate already Supported; candidate Unknown → detected Supported before temps inspected (candidate preferred over a Supported temp); candidate UnsupportedSupported temp used; candidate Unknown → detected Supported before ExclusiveOnly temps inspected; both candidate and temp are ExclusiveOnly → stay on candidate; temp ExclusiveOnly + candidate Unsupported → use temp; nothing works → Unsupported; empty drive table; temp with no associated drive skipped; first Supported temp wins when multiple candidates
EMountedDrives.GetAssociatedDrive 7 tests: symlink resolution before matching; case-sensitive rejection; case-insensitive match; normalisation; longest-prefix wins (POSIX); DirectoryNotFoundException when no match; ancestor symlink resolved when child doesn't exist
FDefaultPathResolutionService 3 real-filesystem integration tests (2 Unix-only): existing symlink, non-existent child under symlink ancestor, fully non-existent path
GPathResolutionServiceExtensions.ResolvePath 9 tests: all GetFullPath exception types caught (5 parameterised), non-existent path, existing non-symlink, symlink followed, ancestor symlink + tail re-attached

IsolationFixture — extended with ResolveLockOptions unit tests

Tests added What is tested
6 new unit tests All branches of Isolation.ResolveLockOptions: Supported → original returned; ExclusiveOnly + Exclusive + promote=false → warning; ExclusiveOnly + Exclusive + promote=true → no warning; ExclusiveOnly + Shared + promote=true → promoted to Exclusive + warning; ExclusiveOnly + Shared + promote=false → null + warning; Unsupported + any request → null + warning

LockOptionsFixture

FromScriptIsolationOptionsOrNull_BuildsCorrectLockFileName updated to assert on LockFile.File.Name rather than the full path (since the lock directory is now resolved dynamically rather than being a fixed subdirectory of TentacleHome).

What is not covered by unit tests

  • The end-to-end path through LockDirectory.GetLockDirectory against a real NFS or FUSE mount. The live probing logic is verified against a real (local) filesystem via IsolationFixture and Group F in LockDirectoryFixture, but the specific scenario of a network filesystem detecting as Unsupported and falling back to /tmp is exercised only through the fake in Group D.
  • Windows-specific temp directory candidates (%LOCALAPPDATA%\Calamari, %TEMP%) are not covered by integration tests.

⚠️ Does this change require a corresponding Server Change?
⚠️ If so - please add a "Requires Server Change" label to this PR!

…quire delegate

CachedDriveInfo is extracted from its nested position inside LockDirectory
into its own file. DetectLockSupport gains an overload that accepts a
Func<LockOptions, ILockHandle> delegate in place of the direct FileLock.Acquire
call, making the four lock-probe tests (exclusive, shared, exclusive-blocks-shared,
shared-blocks-exclusive) unit-testable without real filesystem operations.
The existing public overload forwards to FileLock.Acquire.
MountedDrives is extracted from its nested position inside LockDirectory
into its own file with no logic changes.
The public GetLockDirectory(string) now delegates to an internal overload
that accepts a MountedDrives instance and an optional acquire delegate.
This allows tests to supply controlled drive lists and lock-acquisition
behaviour without real filesystem probing.

The refactor also closes a design gap: GetAssociatedDrive calls that
throw DirectoryNotFoundException (e.g. when MountedDrives is empty or a
path does not sit under any known drive root) are now caught, and the
affected path is treated as having no known capability rather than
propagating the exception.
The fallback decision logic (routing between IsFullySupported, IsSupported,
PromoteToExclusiveLock, and returning null) is separated into an internal
static ResolveLockOptions(LockOptions, bool) method. PrepareLockOptions
becomes a thin wrapper that builds the options, logs, then delegates.
This makes all fallback branches unit-testable by constructing a LockOptions
with a controlled LockCapability, without any filesystem access.
…metic tests

GetLockDirectory tests against non-existent paths (e.g. /home/octopus/tentacle)
were failing because DetectLockSupport called Directory.CreateDirectory
unconditionally before probing, causing the outer catch to return Unsupported
instead of running the fake lock service. Thread a createDirectory Action through
GetLockDirectory and DetectLockSupport (defaulting to the real implementation),
and pass a no-op in the two affected Group D fixture tests.
The two delegates (acquireDelegate and createDirectory) passed through
GetLockDirectory and DetectLockSupport were always used together and
represented a single concern: interacting with the filesystem during
lock-support probing. Introducing IFileLockService makes that coupling
explicit and gives tests a single seam to inject.

- Add IFileLockService with AcquireLock and CreateDirectory
- Add FileLockService (singleton) as the real implementation
- Collapse the multi-overload chain in CachedDriveInfo.DetectLockSupport
  and LockDirectory.GetLockDirectory to a single IFileLockService parameter
- FakeLockService in LockDirectoryFixture now implements IFileLockService
  (CreateDirectory is a no-op; Acquire renamed to AcquireLock)
…oryFallback

Move GetTemporaryCandidates logic into a new ITemporaryDirectoryFallback interface
with real implementation TemporaryDirectoryFallback (singleton). This allows the
internal GetLockDirectory overload to accept an injectable fallback, making tests
hermetic and independent of environment variables ($TMPDIR) and filesystem
existence checks (/tmp, /dev/shm).

Also fix bug: the /dev/shm candidate was incorrectly joined under /tmp instead of
/dev/shm.

Add FakeTemporaryDirectoryFallback to tests returning a fixed list of candidates.
All Group D tests now inject this fake, removing any dependency on $TMPDIR, /tmp,
and /dev/shm.

Add 2 new Group D tests:
- Verify temp candidates with no matching drive are silently skipped
- Verify multiple temp candidates: the first Supported one is chosen

All 89 tests pass.
Consolidate ITemporaryDirectoryFallback into IFileLockService by adding
GetFallbackTemporaryDirectories. This removes the separate interface and its
injectable seam, simplifying the design: callers only need to inject one
service. TemporaryDirectoryFallback becomes a static helper used exclusively
by FileLockService. Tests updated to use FakeLockService with temp directory
params instead of the now-deleted FakeTemporaryDirectoryFallback.
…d platform-aware path comparison

Introduce IPathResolutionService (+ DefaultPathResolutionService) to address
three gaps in the original prefix-matching logic:

1. Symlink resolution: input paths are now resolved (via BCL
   FileInfo.ResolveLinkTarget) before prefix-matching against drive roots.
   This fixes the macOS /tmp → /private/tmp case where DriveInfo returns
   /private/tmp as the mount root but callers may pass /tmp/foo.

2. Platform-aware case comparison: StringComparison is now sourced from
   IPathResolutionService.PathComparison (OrdinalIgnoreCase on Windows/macOS,
   Ordinal on Linux) rather than hardcoded OrdinalIgnoreCase.

3. Path normalisation: Path.GetFullPath is applied before any symlink
   resolution, eliminating relative paths and .. components.

The IPathResolutionService seam keeps the production change entirely in
DefaultPathResolutionService (a thin BCL wrapper), while a FakePathResolutionService
in the test fixture allows all new Group E tests to run hermeticly without
real symlinks or filesystem state.
All Group D tests in LockDirectoryFixture used fake paths (e.g.
/home/octopus/tentacle, /tmp/tentacle) that don't exist on macOS.
DefaultPathResolutionService would walk up ancestors and resolve /tmp
to /private/tmp, causing IsAncestor to fail against the /tmp drive root.

Added pathResolver: FakePathResolutionService.PassThrough to all Group D
test calls so they bypass real filesystem resolution, matching the intent
of hermetic unit tests.
…gic to extension method

IPathResolutionService now exposes only two thin BCL delegates — GetFullPath
(Path.GetFullPath) and ResolveLinkTarget (FileInfo.ResolveLinkTarget) — plus
PathComparison.  The ancestor-walk resolution logic is extracted into
PathResolutionServiceExtensions.ResolvePath so it can be unit-tested in
isolation via FakePathResolutionService.

DefaultPathResolutionService becomes a trivial wrapper with no logic of its
own.  FakePathResolutionService is updated to implement the new interface
with separate fullPathMap (normalisation) and symlinkMap (symlink simulation)
dictionaries, and the PassThrough preset throws FileNotFoundException from
ResolveLinkTarget so the extension method falls back to returning the
GetFullPath result unchanged.

Added Group G tests (6 new) that directly exercise ResolvePath extension
method logic: non-existent path fallback, existing non-symlink, direct
symlink, ancestor-symlink walk, GetFullPath pre-normalisation, and
multi-segment tail re-attachment.
…lock others

Path.GetFullPath(string) documents five exception types that can be thrown
for invalid, inaccessible, or oversized paths: ArgumentException,
ArgumentNullException, SecurityException, NotSupportedException, and
PathTooLongException.  Previously these propagated out of ResolvePath,
which would have aborted the entire drive-selection loop in LockDirectory
for all remaining candidate and temp paths.

ResolvePath now catches all five and returns the original raw path as a
safe fallback, preserving the invariant that one malformed path cannot
prevent other options from being evaluated.

Changes:
- IPathResolutionService.GetFullPath: added <exception> docs for all five
  exception types
- PathResolutionServiceExtensions.ResolvePath: added try/catch around the
  GetFullPath call; updated XML docs to describe the fallback behaviour
- FakePathResolutionService: added getFullPathException constructor
  parameter to inject any of the five exception types for testing
- Added 5 parameterised tests (one per exception type) in LockDirectoryFixture
@rhysparry rhysparry force-pushed the rhys/eft-383/graceful-lock-fallback branch from dea3c2a to 9ce1b18 Compare March 20, 2026 00:32
On Linux, FileInfo.ResolveLinkTarget throws UnauthorizedAccessException
(wrapping IOException: Permission denied) for paths whose parent
directories exist but are not accessible to the current user. The
ancestor-walk loop in ResolvePath only caught FileNotFoundException,
DirectoryNotFoundException, and IOException; UnauthorizedAccessException
is not a subclass of IOException so it escaped the handler and
propagated to callers.

Add UnauthorizedAccessException to the catch filter so that inaccessible
path components are treated the same as non-existent ones — the walk
peels off the last segment and retries the parent.
The previous hardcoded paths (C:\Octopus\Tentacle / /home/octopus/tentacle)
do not exist on developer or CI machines and cause
UnauthorizedAccessException on Linux when ResolveLinkTarget tries to
inspect an inaccessible parent directory.

Switching to Path.GetTempPath() guarantees the path exists and is
accessible on all platforms, making the fixture hermetic without
requiring any mock infrastructure.
The LockDirectory used to probe filesystem lock support was constructed
with LockCapability.Unknown, causing LockFile.Open to throw
NotSupportedException before any flock() attempt was made. This meant
every filesystem requiring runtime detection was classified as
Unsupported rather than being probed.

Fix: construct the probe LockDirectory with LockCapability.Supported so
the probe calls can actually reach the kernel.
Copy link
Contributor

@gb-8 gb-8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a scan of the PR and while the approach seems good, and the test coverage seems good, I did find the code hard to follow.
I had to really work to follow the logic through, and see exactly where the determination of locking support was performed.
The logic is all hidden inside things that don't make it clear what they are doing.
Once you are happy that the functionality is all correct, then I'd like to pair on seeing if we can refactor to make it more readable and understandable.


public enum LockCapability
{
Unknown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude reckons we are using these values wrongly, since when we are unable to determine whether locking is supported, we fallback on Unsupported when it thinks it technically should be Unknown.

I think we are actually using Unknown to mean Untested, so could rename it as such.


public enum LockCapability
{
Unknown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good practice to set the default value explicitly to 0.

yield return ApplyNamespace(tmp);
}

const string devShm = "/dev/shm";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should actually preference /dev/shm over other locations, as it has the nice behaviour that locks will automatically get cleaned up on reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants