Skip to content

fix: prevent mv lock timeout causing missing L0/L1 files#1064

Open
r266-tech wants to merge 1 commit intovolcengine:mainfrom
r266-tech:fix/lock-timeout-mv-retry
Open

fix: prevent mv lock timeout causing missing L0/L1 files#1064
r266-tech wants to merge 1 commit intovolcengine:mainfrom
r266-tech:fix/lock-timeout-mv-retry

Conversation

@r266-tech
Copy link
Copy Markdown
Contributor

Closes #1047

Problem

The SemanticProcessor fails to move generated .abstract.md (L0) and .overview.md (L1) files from the temp directory to the target resource directory. Server logs show "Failed to acquire mv lock" errors, resulting in missing layer files even though the VLM successfully generated them.

Root cause: Two compounding issues:

  1. TransactionConfig.lock_timeout and LockManager both default to 0.0, meaning any lock contention causes immediate failure — no waiting, no retry.
  2. SemanticProcessor._sync_topdown_recursive() has no retry logic around viking_fs.mv() calls. When concurrent operations compete for subtree locks, the first contention kills the move.

Fix

1. Change default lock_timeout from 0.0 to 5.0 (seconds)

Updated in both TransactionConfig and LockManager/init_lock_manager. Five seconds is enough for transient contention to clear without causing indefinite blocking on real deadlocks. Users who want fail-fast behavior can still set lock_timeout=0.

2. Add retry logic with backoff for mv operations

New _mv_with_retry() helper in semantic_processor.py:

  • Retries up to 3 times with increasing delay (0.3s → 0.6s → 0.9s)
  • Only retries on lock-related errors (checks if "lock" is in the error message)
  • Logs a warning on each retry for debugging
  • Raises the original exception if retries are exhausted

Applied to all 4 viking_fs.mv() call sites in _sync_topdown_recursive().

Changes

File Change
openviking_cli/utils/config/transaction_config.py Default lock_timeout: 0.05.0
openviking/storage/transaction/lock_manager.py Default lock_timeout: 0.05.0 (2 locations)
openviking/storage/queuefs/semantic_processor.py New _mv_with_retry() helper, 4 call sites updated

3 files changed, +37/-13

Two-part fix for SemanticProcessor failing to move generated layer
files (.abstract.md, .overview.md) from temp to target directory.

1. Change default lock_timeout from 0.0 to 5.0 (seconds)
   - LockManager and TransactionConfig both defaulted to 0.0,
     meaning any lock contention caused immediate failure
   - Updated to 5.0s: enough for transient contention, not so
     long that a real deadlock blocks indefinitely
   - Users can still set lock_timeout=0 for fail-fast behavior

2. Add retry logic with backoff for mv operations in
   SemanticProcessor._sync_topdown_recursive()
   - New _mv_with_retry() helper: retries up to 3 times with
     increasing delay (0.3s, 0.6s, 0.9s) on lock errors
   - Applied to all 4 viking_fs.mv() call sites
   - Logs warning on each retry attempt for debugging

Closes volcengine#1047
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: SemanticProcessor fails to move L0/L1 files from temp to target directory due to lock timeout

3 participants