fix(15): make the fine-tuning pipeline runnable end-to-end by corticalstack · Pull Request #17 · corticalstack/awesome-foundry-nextgen

corticalstack · 2026-05-27T15:40:03Z

Summary

Section 15 had multiple unrelated failures hidden behind each other. Running through 15-01 → 15-04 in order surfaced six distinct issues; this PR fixes them all so the pipeline runs end-to-end from a fresh clone.

Fixes (in the order you'd hit them)

1. 15-01 cell 3: `ModuleNotFoundError: No module named 'torch'`

torch / transformers / peft / matplotlib / azure-storage-blob / azure-ai-inference were never in pyproject.toml. Cell 2's inline %pip install silently fails in this uv-managed .venv (no pip shipped, so %pip errors with No module named pip and then unhelpfully follows with Dependencies installed.).

Fix: new [dependency-groups] finetune entry. Users run uv sync --group finetune once. Cell 2 converted from broken code to a markdown note. Same pattern as azure-ai-evaluation[redteam] in section 14.

2. 15-01 cell 8: `OpenAIError: Missing credentials`

FINETUNE_GATEWAY_KEY was never set in .env because 15-fine-tune/main.bicep doesn't create a dedicated foundry-gateway-finetune APIM subscription (10-01 / 11-01 do this in a separate step but 15-* never added it).

Fix: the teacher-model call only needs any valid APIM key. Cell 8 now reads ALPHA_GATEWAY_KEY (already in .env). 15-00 prereqs + Bicep deploy command updated to match. Removes a phantom env variable and one redundant Azure resource.

3. 15-02 ACA job: `The specified resource name length is not within the permissible limits`

Container name "ft" violated Azure Storage's 3-63 char minimum. The container creation silently failed (helper used check=False), then the in-job download from "ft" errored visibly.

Fix: renamed to "finetune" in 15-02 / 15-03 / 15-04 cell-3 constants and in main.bicep. azure_infra.py takes the name as a parameter so no source change there.

4. 15-03: `NameError: name 'eval_dataset' is not defined`

15-03 implicitly relied on eval_dataset / reports_data / env_id / accuracy being in memory from 15-01 + 15-02's kernel.

Fix: Cell 3 now re-derives all four. Also added missing import pandas as pd + from IPython.display import display (caused a separate NameError: pd not defined in the comparison cell), and fixed an outdated "DeepSeek-V3.2 (Teacher)" label in the summary table to match the actual gpt-4.1-mini teacher.

5. 15-04 cell 5: `ImportError: cannot import name 'LossKwargs' from transformers.utils`

Phi-4-mini's custom remote modeling code imports symbols that were removed in transformers 5.x. The [finetune] group's transformers>=4.40.0 allowed 5.x.

Fix: tightened to >=4.46.0,<5.0.0 with a comment. Matches the transformers==4.53.3 pin the ACA fine-tune job uses inline.

6. 15-04 `generate()`: pad/eos attention-mask warning

apply_chat_template returned just the input_ids tensor and Phi-4-mini has pad_token == eos_token, so transformers couldn't auto-infer the mask.

Fix: apply_chat_template(..., return_dict=True) then generate(**inputs, ...). Removes the warning, makes generation reliable rather than best-effort.

Other changes

TARGET_TOTAL = 500 → 100 in 15-01 (demo-sized, ~200 teacher calls). Expanded markdown header + inline comment explain when to bump back up.
Committed 15-fine-tune/data/train.jsonl (100-example training set, scanned and clean) so 15-02 can be re-run without regenerating data.
Gitignored 15-fine-tune/data/, 15-fine-tune/models/, 15-fine-tune/eval_job.json, 15-fine-tune/job.json (regenerable artifacts; models/ is ~350MB; the JSONs contain real subscription IDs when written locally). data/train.jsonl is the only file in data/ that's tracked, via git add -f.

Patch release 0.8.10.

Test plan

Fresh clone + uv sync --group finetune succeeds
15-01 runs end-to-end: cell 3 imports succeed, cell 8 creates the AzureOpenAI client using ALPHA_GATEWAY_KEY, training data prep completes with ~100 examples
15-02 runs end-to-end: provision_infrastructure creates the finetune container; ACA fine-tune job submits and reaches Succeeded status
15-03 runs end-to-end in a fresh kernel: cell 3 re-derives eval_dataset / reports_data / env_id; eval job completes; comparison chart + summary table render
15-04 runs end-to-end: model loads without ImportError, generate() runs without the pad/eos warning

…nstall 15-01-data-preparation.ipynb failed at cell 3 with "ModuleNotFoundError: No module named 'torch'". Two underlying issues: 1. torch / transformers / peft / matplotlib / azure-storage-blob / azure-ai-inference were never declared in pyproject.toml. 2. Cell 2's inline `%pip install` silently fails in the uv-managed .venv because the venv doesn't ship pip. The cached output shows "<repo-root>/.venv/bin/python: No module named pip" yet the cell prints "Dependencies installed." regardless, so the failure is easy to miss. 15-04-local-inference.ipynb has the same dependency requirements (torch, transformers, peft) but no install cell at all - it was silently relying on 15-01 having run first. This affects it too. Fix: declare the heavy ML deps in a new [dependency-groups] finetune entry in pyproject.toml. Users run `uv sync --group finetune` once before opening section 15. Base install stays lean for everyone else (~3 GB saved when section 15 isn't needed). Matches the pattern used for azure-ai-evaluation[redteam] in section 14. Cell 2 of 15-01 converted from broken %pip install code to a markdown cell pointing at the new group. 15-00 prerequisites step 4 updated to specify `uv sync --group finetune` with explanation of what the group includes.

…m FINETUNE_GATEWAY_KEY After the dependency-group fix unblocked imports, 15-01 cell 8 failed next with "OpenAIError: Missing credentials" because the env var FINETUNE_GATEWAY_KEY was never set in .env. Tracing it: 15-fine-tune/main.bicep does not create a dedicated foundry-gateway-finetune APIM subscription (10-01 / 11-01 do this in a separate step but 15-* never added it). The variable was always going to be missing for anyone following the docs. The teacher-model call (and the apimSubscriptionKey parameter the Bicep takes) only need any valid APIM subscription key for the shared gateway. ALPHA_GATEWAY_KEY is already in .env from the project-spoke deployment, so reuse it. Removes one phantom env variable and avoids provisioning a redundant APIM subscription. Changes: - 15-01-data-preparation.ipynb cell 8: GATEWAY_KEY now reads ALPHA_GATEWAY_KEY, with an inline comment explaining the choice. - 15-00-fine-tune.md prerequisites: .env block lists ALPHA_GATEWAY_KEY instead of FINETUNE_GATEWAY_KEY, with a note explaining the rationale. Bicep deploy command uses $ALPHA_GATEWAY_KEY as apimSubscriptionKey.

…ttention mask Six additional fixes after the [finetune] dep group + ALPHA_GATEWAY_KEY swap (earlier commits in this PR). Together these make the section 15 pipeline runnable end-to-end. 1. Container name "ft" was too short (Azure Storage requires 3-63 chars) and silently failed to create, causing the ACA fine-tune job to error on download. Renamed to "finetune" in: - 15-02 / 15-03 / 15-04 cell-3 constants - 15-fine-tune/main.bicep Note: azure_infra.py uses the parameter so no source change needed. 2. 15-03 was not self-contained. It relied on eval_dataset / reports_data / env_id / accuracy being in memory from 15-01 + 15-02's kernel. Running 15-03 in a fresh kernel produced NameError. Cell 3 now re-derives eval_dataset (via iss_utils.get_evaluation_dataset), reports_data (NASA fetch), env_id (az containerapp env show), and sets fallback teacher/base accuracies with comments. Also added pandas + IPython.display imports to cell 2 (previously caused NameError in the comparison cell), and fixed an outdated "DeepSeek-V3.2 (Teacher)" label in the summary table to match the actual teacher model gpt-4.1-mini used throughout the chain. 3. The [finetune] dep group's transformers constraint was too loose (">=4.40.0" allowed 5.x). Phi-4-mini's custom remote modeling code imports symbols (LossKwargs, etc.) that were removed in transformers 5.x, causing 15-04 to fail with ImportError at model load. Tightened to ">=4.46.0,<5.0.0", with a comment explaining why; matches the transformers==4.53.3 pin the ACA fine-tune job uses inline. 4. 15-04's generate() call passed just the input_ids tensor, triggering a warning that the attention mask couldn't be inferred (pad_token == eos_token for Phi-4-mini). Reworked to call apply_chat_template with return_dict=True and pass **inputs to generate(), so both input_ids and attention_mask are present. Removes the warning and makes generation behaviour reliable rather than best-effort. 5. 15-01 TARGET_TOTAL reduced from 500 to 100 with a rationale block ("demo-sized, ~200 teacher calls; bump to 500-1000+ for richer distillation, costs/runtime scale linearly"). The markdown header above the cell ("Generate Training Data") was rewritten with the same explanation so readers see it before hitting the code. 6. Gitignored section 15 artifacts that should not be committed: 15-fine-tune/data/ (regenerable training JSONL), 15-fine-tune/models/ (~350MB of LoRA adapter weights), 15-fine-tune/eval_job.json + job.json (contain real subscription IDs when run locally).

100-example synthetic + real-data training set generated by 15-01 data-preparation against the gpt-4.1-mini teacher. Useful as a fixed artifact so: - 15-02 can be re-run for the fine-tune ACA job without regenerating data from scratch (saves ~200 teacher calls + a few minutes). - The PR review / future reader can inspect what the LoRA actually trains on without provisioning anything. File contains only synthetic ISS incident classifications (system prompt + NASA report text + assistant-generated severity/category label). Scanned for tenant identifiers, /home paths, UUIDs, and APIM-key-shaped hex strings - all clean. The 15-fine-tune/data/ directory itself stays gitignored so future notebook runs that produce additional intermediates don't accidentally get committed; train.jsonl is tracked explicitly via `git add -f`.

corticalstack added 3 commits May 27, 2026 17:39

chore: bump version to 0.8.10 and add release notes

58cb73a

corticalstack changed the title ~~fix(15): add [finetune] dependency group; remove broken inline %pip install~~ fix(15): unblock 15-01 (add [finetune] dep group + reuse ALPHA_GATEWAY_KEY) May 27, 2026

corticalstack added 3 commits May 27, 2026 23:29

chore: expand 0.8.10 changelog with the bundled section-15 fixes

2e0a679

corticalstack changed the title ~~fix(15): unblock 15-01 (add [finetune] dep group + reuse ALPHA_GATEWAY_KEY)~~ fix(15): make the fine-tuning pipeline runnable end-to-end May 27, 2026

corticalstack merged commit 403ee0d into main May 27, 2026
1 of 2 checks passed

corticalstack deleted the fix/section-15-finetune-dependency-group branch May 27, 2026 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(15): make the fine-tuning pipeline runnable end-to-end#17

fix(15): make the fine-tuning pipeline runnable end-to-end#17
corticalstack merged 6 commits into
mainfrom
fix/section-15-finetune-dependency-group

corticalstack commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

corticalstack commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes (in the order you'd hit them)

1. 15-01 cell 3: ModuleNotFoundError: No module named 'torch'

2. 15-01 cell 8: OpenAIError: Missing credentials

3. 15-02 ACA job: The specified resource name length is not within the permissible limits

4. 15-03: NameError: name 'eval_dataset' is not defined

5. 15-04 cell 5: ImportError: cannot import name 'LossKwargs' from transformers.utils

6. 15-04 generate(): pad/eos attention-mask warning

Other changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

corticalstack commented May 27, 2026 •

edited

Loading

1. 15-01 cell 3: `ModuleNotFoundError: No module named 'torch'`

2. 15-01 cell 8: `OpenAIError: Missing credentials`

3. 15-02 ACA job: `The specified resource name length is not within the permissible limits`

4. 15-03: `NameError: name 'eval_dataset' is not defined`

5. 15-04 cell 5: `ImportError: cannot import name 'LossKwargs' from transformers.utils`

6. 15-04 `generate()`: pad/eos attention-mask warning