Skip to content

fix: preserve full path in repo cache key for GitLab subgroup URLs (#438, #439)#522

Open
mvanhorn wants to merge 1 commit into
AsyncFuncAI:mainfrom
mvanhorn:fix/438-deepwiki-438-gitlab-subgroup
Open

fix: preserve full path in repo cache key for GitLab subgroup URLs (#438, #439)#522
mvanhorn wants to merge 1 commit into
AsyncFuncAI:mainfrom
mvanhorn:fix/438-deepwiki-438-gitlab-subgroup

Conversation

@mvanhorn
Copy link
Copy Markdown

Addresses the cache-key half of #438 (and related #439).

The previous _extract_repo_name_from_url (api/data_pipeline.py:773) collapsed https://gitlab.com/group/subgroup/repo down to a cache key of subgroup_repo. Two different parent groups with the same project/subproject last-two-segments would collide on ~/.adalflow/databases/{key}.pkl even though the clone URL itself was passed through intact.

This PR switches to urlparse and joins every path segment with underscores:

URL Old key New key
github.com/owner/repo owner_repo owner_repo (unchanged)
bitbucket.org/owner/repo owner_repo owner_repo (unchanged)
gitlab.com/owner/repo owner_repo owner_repo (unchanged)
gitlab.com/group/subgroup/repo subgroup_repo group_subgroup_repo
gitlab.com/gitlab-org/ruby/gems/prometheus-client-mmap gems_prometheus-client-mmap gitlab-org_ruby_gems_prometheus-client-mmap

The existing GitLab subgroup test case is updated to reflect the new shape. Other test cases (2-segment URLs) keep their existing expectations.

Scope note. The reporter on #438/#439 says the URL "fails" even after being entered correctly. Running a GitLab subgroup URL through the frontend regex shows it's accepted (the lazy mid-group capture handles arbitrary depth), so the user-visible failure is downstream of input parsing. The cache-key fix here is the most localized, demonstrably-correct piece of that puzzle. If the reporter's failure persists after this lands, the next step is to repro https://gitlab.com/gitlab-org/ruby/gems/prometheus-client-mmap end-to-end and trace where wiki generation actually fails (likely in the clone or chat pipeline, not the cache).

Existing users with cached subgroup_repo databases will need to regenerate the wiki once after this lands; the new cache key won't match the old file. There are no schema migrations needed.

Relates to #438, #439

The previous _extract_repo_name_from_url collapsed
https://gitlab.com/group/subgroup/repo down to a key of subgroup_repo,
which collides if two different parent groups have a project/subproject
pair with the same last two segments. The clone URL was passed through
intact, but the cache lookup at ~/.adalflow/databases/{key}.pkl could
return the wrong project.

Switch to urlparse and join every path segment with underscores, so
https://gitlab.com/group/subgroup/repo becomes group_subgroup_repo.
Plain 2-segment URLs (github.com/owner/repo, bitbucket.org/owner/repo,
gitlab.com/owner/repo) keep the existing owner_repo shape.

The .git suffix on the final segment is still stripped.

Update the existing GitLab subgroup test case to reflect the new shape.

Relates to AsyncFuncAI#438 and AsyncFuncAI#439. The reporter's specific failure mode may be
downstream of this (the clone itself, not the cache key); this commit
fixes the cache-key half. The remaining downstream investigation is
noted in the PR body.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the _extract_repo_name_from_url method to more robustly handle various repository URL formats, including those with multiple subgroups like GitLab. The implementation now utilizes urlparse and joins path components with underscores to create unique identifiers. Feedback was provided to simplify the logic for processing path parts by consolidating conditional branches into a more concise block.

Comment thread api/data_pipeline.py
Comment on lines +789 to +793
if len(path_parts) >= 2:
path_parts[-1] = strip_git_suffix(path_parts[-1])
repo_name = "_".join(path_parts)
else:
repo_name = strip_git_suffix(path_parts[-1]) if path_parts else ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling path_parts can be simplified. Since "_".join(path_parts) on a list with a single element returns that element itself, you can combine the two branches into one, checking only if the list is non-empty.

Suggested change
if len(path_parts) >= 2:
path_parts[-1] = strip_git_suffix(path_parts[-1])
repo_name = "_".join(path_parts)
else:
repo_name = strip_git_suffix(path_parts[-1]) if path_parts else ""
if path_parts:
path_parts[-1] = strip_git_suffix(path_parts[-1])
repo_name = "_".join(path_parts)
else:
repo_name = ""

@sashimikun
Copy link
Copy Markdown

rlm-wiki review

PR Review: fix: preserve full path in repo cache key for GitLab subgroup URLs

Summary

This PR refactors _extract_repo_name_from_url to handle multiple path components in repository URLs, specifically targeting GitLab subgroups. It replaces the logic that only considered the last two URL segments with a more robust approach using urlparse and joining all path parts with underscores.


Findings

1. [LOW] Code Quality: Redundant Branching in Path Processing

  • File: api/data_pipeline.py:793
  • Severity: Low
  • Description: The logic for handling path_parts can be simplified. The if len(path_parts) >= 2 check is redundant because "_".join() on a single-element list returns that element, and path_parts[-1] is valid for any non-empty list.
  • Recommendation: Adopt the suggested simplification:
    if path_parts:
        path_parts[-1] = strip_git_suffix(path_parts[-1])
        repo_name = "_".join(path_parts)
    else:
        repo_name = ""

2. [LOW] Consistency: Inconsistent .git Stripping for Local Paths

  • File: api/data_pipeline.py:835 vs api/data_pipeline.py:796
  • Severity: Low
  • Description: When a local path is passed to _create_repo, it uses os.path.basename() which does not strip the .git suffix. However, _extract_repo_name_from_url (used for URLs) explicitly strips it. This results in inconsistent naming (e.g., /path/repo.git becomes repo.git while https://.../repo.git becomes repo).
  • Recommendation: Consider using _extract_repo_name_from_url even for local paths in _create_repo to ensure consistent identifier generation.

3. [INFO] Test Confusion: test_extract_repo_name_current_implementation_bug

  • File: test/test_extract_repo_name.py:80
  • Severity: Informational
  • Description: The test named test_extract_repo_name_current_implementation_bug claims the implementation "references 'type' which is not in scope". However, the current code correctly uses repo_type and does not appear to have this scope issue. The test seems to be verifying that the method fails when called with missing arguments (which raises a TypeError), but the description is slightly misleading.
  • Recommendation: Rename or update the test description to clarify that it's testing error handling for incorrect method signatures if that was the intent.

Positive Highlights

  • Robust Parsing: The switch to urlparse significantly improves the reliability of URL component extraction compared to manual string splitting.
  • Subgroup Support: Correctly handles arbitrary nesting in GitLab, ensuring group/sub1/sub2/repo becomes group_sub1_sub2_repo.
  • Improved Uniqueness: By including the full path, the risk of collisions between different repositories with the same name (but different owners/groups) is greatly reduced.

Residual Test Risk

  • Cache Invalidation: Users with existing cached repositories under GitLab subgroups will find their cache keys have changed (e.g., from subgroup_repo to group_subgroup_repo). This will trigger a re-download and re-indexing of those repositories.
  • Collision Potential: While reduced, collisions can still occur if an owner name contains an underscore that matches a subgroup boundary (e.g., owner/repo vs owner_repo). This is a pre-existing risk inherited from the underscore-joining strategy.

Posted from rlm-wiki.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants