-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/dataset creation #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Allen-J0421
wants to merge
10
commits into
dev
Choose a base branch
from
feature/dataset_creation
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
c56b330
added params.py for easy access
Allen-J0421 11ca012
supports dataset creation and re-structured workflow
Allen-J0421 e523682
fixed integration problems in resolve_pairs.py and extract.py
Allen-J0421 b219e87
dataset organization change
Allen-J0421 c010de2
added cursor folder premission
Allen-J0421 48860de
seperated build functionality with extract, read from extract_cache
Allen-J0421 dbd62bf
find issue from pr, support rebase merge
Allen-J0421 3d0a6c4
fixed extract.py to handle issue_num = none
Allen-J0421 a447b26
support build update dataset
Allen-J0421 b9a1e74
complete flask dataset, first 5 pairs with cursor implementation
Allen-J0421 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,9 @@ | ||
| __pycache__/ | ||
| *.pyc | ||
| .DS_Store | ||
| # Pipeline output | ||
| pairs.json | ||
| data/ | ||
| # Cache directory (resolve_cache.json + extract_cache.json; default: <reponame>_cache/) | ||
| *_cache/ | ||
| flask/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| ## Environment and packages | ||
|
|
||
| - **Python:** 3.7+ (stdlib only for parsing/resolve; no pip packages) | ||
| - **Git:** on `PATH` | ||
| - **GitHub CLI (`gh`):** [install](https://cli.github.com/), authenticated (`gh auth login`) | ||
| - **Cursor Agent CLI (`agent`):** needed only for `apply-cursor`; [Cursor](https://cursor.com/) v2.4.7+. Set `AGENT_PATH` if not on `PATH`. | ||
|
|
||
| ## Concepts | ||
|
|
||
| - **Refs** = Issue and PR IDs derived from the repo in `params.py` (GITHUB_OWNER, GITHUB_REPO), via a single batched GraphQL query for merged PRs and their closing-issue references. | ||
| - **Pairs** = `(issue_id, pr_id)` built from merged PRs' `closingIssuesReferences` in `features/resolve_pairs.py` (no changelog parsing). | ||
| - **params.py** = config: `CREATIVE_PROMPT_SUFFIX`, `GITHUB_OWNER`, `GITHUB_REPO`, `REPO_URL`. | ||
|
|
||
| # Single entry — main.py | ||
|
|
||
| Use **main.py** as the single entry point. It is the high-level controller. Run from **project root**. | ||
|
|
||
| 1. **resolve** — calls `features/resolve_pairs.py` to fetch merged PRs via GraphQL and build (issue_id, pr_id) pairs from closing-issue references. Writes **resolve_cache.json** and **pairs.json** in the cache directory (default: **\<reponame\>_cache/**). Use **--refresh** to force re-fetch. | ||
| 2. **extract** — runs extract for each pair. Pairs are loaded from the resolve cache; if missing, resolve runs first. Writes **extract_cache.json** with `base_hash`, `merge_hash`, and `branches` per entry. No agent_change, no build. | ||
| 3. **apply-cursor** — reads **extract_cache.json**. For each entry whose `branches` does not yet have cursor refs, runs `agent_change` to create cursor branches, then adds `{h}-cursor` and `{h}-cursor-creative` to the entry's `branches` dict. Does NOT touch `dataset.jsonl`. | ||
| 4. **build** — strictly reads **extract_cache.json** and calls `build_one_row` per entry to produce **dataset.jsonl**. No extraction, no agent_change. If extract cache is missing, errors out. | ||
| 5. **all** — resolve (if needed) → extract → build. No cursor step; run `apply-cursor` separately if needed. | ||
|
|
||
| ## Pipeline flow | ||
|
|
||
| ``` | ||
| resolve → pairs.json → resolve_cache.json | ||
| ↓ | ||
| extract → run extract per pair, create {h}-base / {h}-human branches → extract_cache.json | ||
| ↓ ↓ | ||
| apply-cursor → run agent_change, create {h}-cursor / {h}-cursor-creative → extract_cache.json (branches updated with cursor refs) | ||
| ↓ ↓ | ||
| build → read extract_cache.json → build_one_row per entry → dataset.jsonl | ||
| ``` | ||
|
|
||
| **Branches:** `features/extract.py` creates only `{h}-base` and `{h}-human`. `features/agent_change.py` creates `{h}-cursor` and `{h}-cursor-creative` from base, then runs the Cursor agent on each. | ||
|
|
||
| **Cache directory:** By default, cache files are stored in **\<reponame\>_cache/** (e.g. `flask_cache/`) in the project root: `resolve_cache.json`, `pairs.json`, and `extract_cache.json`. Override with **--cache-dir DIR** (main.py) or **--cache DIR** (resolve_pairs.py). | ||
|
|
||
| **Extract cache format:** | ||
|
|
||
| ```json | ||
| {"issue_id": 348, "pr_id": 2686, "base_hash": "16d83d6b...", "merge_hash": "abba4b2a...", "branches": {"16d83d6b-base": "16d83d6b...", "16d83d6b-human": "abba4b2a..."}} | ||
| ``` | ||
|
|
||
| After `apply-cursor`: | ||
|
|
||
| ```json | ||
| {"issue_id": 348, "pr_id": 2686, "base_hash": "16d83d6b...", "merge_hash": "abba4b2a...", "branches": {"16d83d6b-base": "16d83d6b...", "16d83d6b-human": "abba4b2a...", "16d83d6b-cursor": "abc123...", "16d83d6b-cursor-creative": "def456..."}} | ||
| ``` | ||
|
|
||
| ## Subcommands | ||
|
|
||
| | Subcommand | What it does | Options | Output | | ||
| |------------|--------------|---------|--------| | ||
| | **resolve** | Fetch merged PRs via GraphQL, build (issue_id, pr_id) pairs | `--cache-dir DIR`, `--refresh` | resolve_cache.json, pairs.json | | ||
| | **extract** | Run extract per pair, create branches, write extract cache | `--limit N`, `--cache-dir DIR` | extract_cache.json | | ||
| | **apply-cursor** | Run agent_change for pairs missing cursor hashes, update extract cache | `--limit N`, `--cache-dir DIR` | extract_cache.json (updated) | | ||
| | **build** | Build dataset.jsonl strictly from extract_cache.json | `--limit N`, `--cache-dir DIR` | dataset.jsonl | | ||
| | **all** | Resolve + extract + build (no cursor) | `--limit N`, `--cache-dir DIR` | extract_cache.json, dataset.jsonl | | ||
|
|
||
| ## Examples | ||
|
|
||
| ```bash | ||
| python main.py resolve # fetch pairs via GraphQL | ||
| python main.py extract --limit 5 # extract first 5 pairs | ||
| python main.py apply-cursor --limit 5 # apply cursor agent to first 5 pairs | ||
| python main.py build # build dataset from extract cache | ||
| python main.py all --limit 5 # resolve + extract + build (first 5) | ||
| ``` | ||
|
|
||
| **Dataset output:** Each line in `dataset.jsonl` has `project`, `issue_text`, `issue_id`, `pr_text`, `pr_id`, `root_hash`, `base_hash`, `merge_hash`, `pr_diff`, `cursor_diff`, `cursor_creative_diff`. The cursor diffs are populated only if `apply-cursor` was run before `build`. Use `--limit` for testing. Failed pairs are skipped and reported. | ||
|
|
||
| --- | ||
|
|
||
| # Each file's single use | ||
|
|
||
| Standalone usage for each script in **features/** (run from **project root** unless noted). | ||
|
|
||
| ## features/resolve_pairs.py | ||
|
|
||
| Fetch merged PRs via GraphQL, build (issue_id, pr_id) pairs. | ||
|
|
||
| ```bash | ||
| python features/resolve_pairs.py [--json] [--refs-only] [--cache DIR] | ||
| ``` | ||
|
|
||
| - No flags — pairs, print one `issue_id pr_id` per line. | ||
| - `--json` — pairs as JSON array, or (with `--refs-only`) refs as JSON. | ||
| - `--refs-only` — output refs only (issue_ids, pr_ids); no pairing. | ||
| - `--cache DIR` — directory for cache files (default: cwd/\<reponame\>_cache). Resolve writes `resolve_cache.json` there. | ||
|
|
||
| ## features/extract.py | ||
|
|
||
| Creates **two** branches only: `{h}-base` and `{h}-human`. Does not run the Cursor agent. Cursor branches are created in agent_change.py. | ||
|
|
||
| ```bash | ||
| python features/extract.py <repo_url> <issue_number> <pr_number> [--json] [--autoc] | ||
| ``` | ||
|
|
||
| - `--json` — machine-readable output: `base_hash`, `merge_hash`, `branches`. | ||
| - `--autoc` — clone repo if missing (non-interactive). | ||
|
|
||
| ## features/agent_change.py | ||
|
|
||
| Creates `{h}-cursor` and `{h}-cursor-creative` from `{h}-base`, then runs the Cursor agent on each. Call after **extract.py** (which creates base and human only). | ||
|
|
||
| ```bash | ||
| python features/agent_change.py <repo_url> <issue_number> <pr_number> <h> [--project-root P] | ||
| ``` | ||
|
|
||
| - `repo_url` — same as extract.py (e.g. `https://github.com/owner/repo`). | ||
| - `h` — 8-char branch prefix from extract output (`base_hash[:8]`). | ||
| - Requires agent CLI; set `AGENT_PATH` if not on PATH. | ||
|
|
||
| ## features/build_dataset.py | ||
|
|
||
| Build JSONL from cached state only (no extract/agent calls). Expects entries with `base_hash` and `merge_hash` (e.g. from `extract_cache.json`). Use **main.py build** for the standard flow. | ||
|
|
||
| ```bash | ||
| python features/build_dataset.py --pairs state.json [--output dataset.jsonl] [--limit N] [--project-root P] | ||
| ``` | ||
|
|
||
| - `--pairs` — **required**. JSON array of `{issue_id, pr_id, base_hash, merge_hash}` (branches must already exist). | ||
| - `--output` — default `dataset.jsonl`. | ||
| - `--limit` — process only first N pairs. |
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| # Feature scripts: resolve_pairs, extract, agent_change, build_dataset |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.