Feature/dataset creation by Allen-J0421 · Pull Request #2 · Allen-J0421/ACE-PR-Extraction-Automation

Allen-J0421 · 2026-02-16T22:27:18Z

This repo is a pipeline for building a JSONL dataset. It uses the GitHub GraphQL API to discover merged PRs and their closing issues, producing (issue_id, pr_id) pairs without changelog parsing. For each pair it clones the repo (if needed), creates Git branches (base and human from the PR merge commit), and optionally runs the Cursor agent to add cursor and cursor-creative branches. The pipeline is master controlled by main.py subcommands: resolve (get pr/issue and pair), extract (per-pair branch setup and extract cache), apply-cursor (agent runs and branch updates in the cache), and build (read extract cache and write dataset.jsonl with project, issue/PR text, hashes, and base-vs-human and base-vs-cursor diffs). Caches (resolve, pairs, extract) live under a configurable _cache directory;

For workflow logic and specific functionality/usage, checkout README.md

sadrasabouri

Can you send me the created dataset in Slack so I can start from the byproduct?

Allen-J0421 · 2026-02-24T21:44:12Z

Just pushed some new commits to this pr addressing all the issues we discussed from our last meeting.

Sending finished dataset in slack

sadrasabouri

I requested changes for some nitpicks. Since I still can't run the project, I can't test the functionality, but we can save that for later since it's more about reproducibility.

dataset.jsonl

main.py

params.py

features/agent_change.py

features/build_dataset.py

features/resolve_pairs.py

sadrasabouri · 2026-02-27T00:13:17Z

features/resolve_pairs.py

+CACHE_DIR_DEFAULT = f"{REPO}_cache"
+RESOLVE_CACHE_FILENAME = "resolve_cache.json"
+PAIRS_FILENAME = "pairs.json"


move to pararms.

Allen-J0421 added 6 commits February 12, 2026 22:17

added params.py for easy access

c56b330

supports dataset creation and re-structured workflow

11ca012

fixed integration problems in resolve_pairs.py and extract.py

e523682

dataset organization change

b219e87

added cursor folder premission

c010de2

seperated build functionality with extract, read from extract_cache

48860de

Allen-J0421 requested a review from sadrasabouri February 16, 2026 22:27

sadrasabouri reviewed Feb 18, 2026

View reviewed changes

Allen-J0421 added 4 commits February 23, 2026 21:09

find issue from pr, support rebase merge

dbd62bf

fixed extract.py to handle issue_num = none

3d0a6c4

support build update dataset

a447b26

complete flask dataset, first 5 pairs with cursor implementation

b9a1e74

sadrasabouri requested changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/dataset creation#2

Feature/dataset creation#2
Allen-J0421 wants to merge 10 commits intodevfrom
feature/dataset_creation

Allen-J0421 commented Feb 16, 2026

Uh oh!

sadrasabouri left a comment

Uh oh!

Allen-J0421 commented Feb 24, 2026

Uh oh!

sadrasabouri left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadrasabouri Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Allen-J0421 commented Feb 16, 2026

Uh oh!

sadrasabouri left a comment

Choose a reason for hiding this comment

Uh oh!

Allen-J0421 commented Feb 24, 2026

Uh oh!

sadrasabouri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadrasabouri Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants