Skip to content

Feature/dataset creation#2

Open
Allen-J0421 wants to merge 10 commits intodevfrom
feature/dataset_creation
Open

Feature/dataset creation#2
Allen-J0421 wants to merge 10 commits intodevfrom
feature/dataset_creation

Conversation

@Allen-J0421
Copy link
Owner

This repo is a pipeline for building a JSONL dataset. It uses the GitHub GraphQL API to discover merged PRs and their closing issues, producing (issue_id, pr_id) pairs without changelog parsing. For each pair it clones the repo (if needed), creates Git branches (base and human from the PR merge commit), and optionally runs the Cursor agent to add cursor and cursor-creative branches. The pipeline is master controlled by main.py subcommands: resolve (get pr/issue and pair), extract (per-pair branch setup and extract cache), apply-cursor (agent runs and branch updates in the cache), and build (read extract cache and write dataset.jsonl with project, issue/PR text, hashes, and base-vs-human and base-vs-cursor diffs). Caches (resolve, pairs, extract) live under a configurable _cache directory;

For workflow logic and specific functionality/usage, checkout README.md

Copy link
Collaborator

@sadrasabouri sadrasabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you send me the created dataset in Slack so I can start from the byproduct?

@Allen-J0421
Copy link
Owner Author

Just pushed some new commits to this pr addressing all the issues we discussed from our last meeting.

Sending finished dataset in slack

Copy link
Collaborator

@sadrasabouri sadrasabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested changes for some nitpicks. Since I still can't run the project, I can't test the functionality, but we can save that for later since it's more about reproducibility.

Comment on lines +152 to +154
CACHE_DIR_DEFAULT = f"{REPO}_cache"
RESOLVE_CACHE_FILENAME = "resolve_cache.json"
PAIRS_FILENAME = "pairs.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to pararms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants