fix: preserve full path in repo cache key for GitLab subgroup URLs (#438, #439)#522
fix: preserve full path in repo cache key for GitLab subgroup URLs (#438, #439)#522mvanhorn wants to merge 1 commit into
Conversation
The previous _extract_repo_name_from_url collapsed https://gitlab.com/group/subgroup/repo down to a key of subgroup_repo, which collides if two different parent groups have a project/subproject pair with the same last two segments. The clone URL was passed through intact, but the cache lookup at ~/.adalflow/databases/{key}.pkl could return the wrong project. Switch to urlparse and join every path segment with underscores, so https://gitlab.com/group/subgroup/repo becomes group_subgroup_repo. Plain 2-segment URLs (github.com/owner/repo, bitbucket.org/owner/repo, gitlab.com/owner/repo) keep the existing owner_repo shape. The .git suffix on the final segment is still stripped. Update the existing GitLab subgroup test case to reflect the new shape. Relates to AsyncFuncAI#438 and AsyncFuncAI#439. The reporter's specific failure mode may be downstream of this (the clone itself, not the cache key); this commit fixes the cache-key half. The remaining downstream investigation is noted in the PR body.
There was a problem hiding this comment.
Code Review
This pull request refactors the _extract_repo_name_from_url method to more robustly handle various repository URL formats, including those with multiple subgroups like GitLab. The implementation now utilizes urlparse and joins path components with underscores to create unique identifiers. Feedback was provided to simplify the logic for processing path parts by consolidating conditional branches into a more concise block.
| if len(path_parts) >= 2: | ||
| path_parts[-1] = strip_git_suffix(path_parts[-1]) | ||
| repo_name = "_".join(path_parts) | ||
| else: | ||
| repo_name = strip_git_suffix(path_parts[-1]) if path_parts else "" |
There was a problem hiding this comment.
The logic for handling path_parts can be simplified. Since "_".join(path_parts) on a list with a single element returns that element itself, you can combine the two branches into one, checking only if the list is non-empty.
| if len(path_parts) >= 2: | |
| path_parts[-1] = strip_git_suffix(path_parts[-1]) | |
| repo_name = "_".join(path_parts) | |
| else: | |
| repo_name = strip_git_suffix(path_parts[-1]) if path_parts else "" | |
| if path_parts: | |
| path_parts[-1] = strip_git_suffix(path_parts[-1]) | |
| repo_name = "_".join(path_parts) | |
| else: | |
| repo_name = "" |
rlm-wiki reviewPR Review: fix: preserve full path in repo cache key for GitLab subgroup URLsSummaryThis PR refactors Findings1. [LOW] Code Quality: Redundant Branching in Path Processing
2. [LOW] Consistency: Inconsistent
|
Addresses the cache-key half of #438 (and related #439).
The previous
_extract_repo_name_from_url(api/data_pipeline.py:773) collapsedhttps://gitlab.com/group/subgroup/repodown to a cache key ofsubgroup_repo. Two different parent groups with the sameproject/subprojectlast-two-segments would collide on~/.adalflow/databases/{key}.pkleven though the clone URL itself was passed through intact.This PR switches to
urlparseand joins every path segment with underscores:github.com/owner/repoowner_repoowner_repo(unchanged)bitbucket.org/owner/repoowner_repoowner_repo(unchanged)gitlab.com/owner/repoowner_repoowner_repo(unchanged)gitlab.com/group/subgroup/reposubgroup_repogroup_subgroup_repogitlab.com/gitlab-org/ruby/gems/prometheus-client-mmapgems_prometheus-client-mmapgitlab-org_ruby_gems_prometheus-client-mmapThe existing GitLab subgroup test case is updated to reflect the new shape. Other test cases (2-segment URLs) keep their existing expectations.
Scope note. The reporter on #438/#439 says the URL "fails" even after being entered correctly. Running a GitLab subgroup URL through the frontend regex shows it's accepted (the lazy mid-group capture handles arbitrary depth), so the user-visible failure is downstream of input parsing. The cache-key fix here is the most localized, demonstrably-correct piece of that puzzle. If the reporter's failure persists after this lands, the next step is to repro
https://gitlab.com/gitlab-org/ruby/gems/prometheus-client-mmapend-to-end and trace where wiki generation actually fails (likely in the clone or chat pipeline, not the cache).Existing users with cached
subgroup_repodatabases will need to regenerate the wiki once after this lands; the new cache key won't match the old file. There are no schema migrations needed.Relates to #438, #439