fix(15): make the fine-tuning pipeline runnable end-to-end#17
Merged
Conversation
…nstall 15-01-data-preparation.ipynb failed at cell 3 with "ModuleNotFoundError: No module named 'torch'". Two underlying issues: 1. torch / transformers / peft / matplotlib / azure-storage-blob / azure-ai-inference were never declared in pyproject.toml. 2. Cell 2's inline `%pip install` silently fails in the uv-managed .venv because the venv doesn't ship pip. The cached output shows "<repo-root>/.venv/bin/python: No module named pip" yet the cell prints "Dependencies installed." regardless, so the failure is easy to miss. 15-04-local-inference.ipynb has the same dependency requirements (torch, transformers, peft) but no install cell at all - it was silently relying on 15-01 having run first. This affects it too. Fix: declare the heavy ML deps in a new [dependency-groups] finetune entry in pyproject.toml. Users run `uv sync --group finetune` once before opening section 15. Base install stays lean for everyone else (~3 GB saved when section 15 isn't needed). Matches the pattern used for azure-ai-evaluation[redteam] in section 14. Cell 2 of 15-01 converted from broken %pip install code to a markdown cell pointing at the new group. 15-00 prerequisites step 4 updated to specify `uv sync --group finetune` with explanation of what the group includes.
…m FINETUNE_GATEWAY_KEY After the dependency-group fix unblocked imports, 15-01 cell 8 failed next with "OpenAIError: Missing credentials" because the env var FINETUNE_GATEWAY_KEY was never set in .env. Tracing it: 15-fine-tune/main.bicep does not create a dedicated foundry-gateway-finetune APIM subscription (10-01 / 11-01 do this in a separate step but 15-* never added it). The variable was always going to be missing for anyone following the docs. The teacher-model call (and the apimSubscriptionKey parameter the Bicep takes) only need any valid APIM subscription key for the shared gateway. ALPHA_GATEWAY_KEY is already in .env from the project-spoke deployment, so reuse it. Removes one phantom env variable and avoids provisioning a redundant APIM subscription. Changes: - 15-01-data-preparation.ipynb cell 8: GATEWAY_KEY now reads ALPHA_GATEWAY_KEY, with an inline comment explaining the choice. - 15-00-fine-tune.md prerequisites: .env block lists ALPHA_GATEWAY_KEY instead of FINETUNE_GATEWAY_KEY, with a note explaining the rationale. Bicep deploy command uses $ALPHA_GATEWAY_KEY as apimSubscriptionKey.
…ttention mask
Six additional fixes after the [finetune] dep group + ALPHA_GATEWAY_KEY
swap (earlier commits in this PR). Together these make the section 15
pipeline runnable end-to-end.
1. Container name "ft" was too short (Azure Storage requires 3-63 chars)
and silently failed to create, causing the ACA fine-tune job to error
on download. Renamed to "finetune" in:
- 15-02 / 15-03 / 15-04 cell-3 constants
- 15-fine-tune/main.bicep
Note: azure_infra.py uses the parameter so no source change needed.
2. 15-03 was not self-contained. It relied on eval_dataset / reports_data
/ env_id / accuracy being in memory from 15-01 + 15-02's kernel.
Running 15-03 in a fresh kernel produced NameError. Cell 3 now
re-derives eval_dataset (via iss_utils.get_evaluation_dataset),
reports_data (NASA fetch), env_id (az containerapp env show), and
sets fallback teacher/base accuracies with comments. Also added
pandas + IPython.display imports to cell 2 (previously caused
NameError in the comparison cell), and fixed an outdated "DeepSeek-V3.2
(Teacher)" label in the summary table to match the actual teacher
model gpt-4.1-mini used throughout the chain.
3. The [finetune] dep group's transformers constraint was too loose
(">=4.40.0" allowed 5.x). Phi-4-mini's custom remote modeling code
imports symbols (LossKwargs, etc.) that were removed in transformers
5.x, causing 15-04 to fail with ImportError at model load. Tightened
to ">=4.46.0,<5.0.0", with a comment explaining why; matches the
transformers==4.53.3 pin the ACA fine-tune job uses inline.
4. 15-04's generate() call passed just the input_ids tensor, triggering a
warning that the attention mask couldn't be inferred (pad_token ==
eos_token for Phi-4-mini). Reworked to call apply_chat_template with
return_dict=True and pass **inputs to generate(), so both input_ids
and attention_mask are present. Removes the warning and makes
generation behaviour reliable rather than best-effort.
5. 15-01 TARGET_TOTAL reduced from 500 to 100 with a rationale block
("demo-sized, ~200 teacher calls; bump to 500-1000+ for richer
distillation, costs/runtime scale linearly"). The markdown header
above the cell ("Generate Training Data") was rewritten with the
same explanation so readers see it before hitting the code.
6. Gitignored section 15 artifacts that should not be committed:
15-fine-tune/data/ (regenerable training JSONL),
15-fine-tune/models/ (~350MB of LoRA adapter weights),
15-fine-tune/eval_job.json + job.json (contain real subscription IDs
when run locally).
100-example synthetic + real-data training set generated by 15-01 data-preparation against the gpt-4.1-mini teacher. Useful as a fixed artifact so: - 15-02 can be re-run for the fine-tune ACA job without regenerating data from scratch (saves ~200 teacher calls + a few minutes). - The PR review / future reader can inspect what the LoRA actually trains on without provisioning anything. File contains only synthetic ISS incident classifications (system prompt + NASA report text + assistant-generated severity/category label). Scanned for tenant identifiers, /home paths, UUIDs, and APIM-key-shaped hex strings - all clean. The 15-fine-tune/data/ directory itself stays gitignored so future notebook runs that produce additional intermediates don't accidentally get committed; train.jsonl is tracked explicitly via `git add -f`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Section 15 had multiple unrelated failures hidden behind each other. Running through 15-01 → 15-04 in order surfaced six distinct issues; this PR fixes them all so the pipeline runs end-to-end from a fresh clone.
Fixes (in the order you'd hit them)
1. 15-01 cell 3:
ModuleNotFoundError: No module named 'torch'torch/transformers/peft/matplotlib/azure-storage-blob/azure-ai-inferencewere never inpyproject.toml. Cell 2's inline%pip installsilently fails in this uv-managed.venv(no pip shipped, so%piperrors withNo module named pipand then unhelpfully follows withDependencies installed.).Fix: new
[dependency-groups] finetuneentry. Users runuv sync --group finetuneonce. Cell 2 converted from broken code to a markdown note. Same pattern asazure-ai-evaluation[redteam]in section 14.2. 15-01 cell 8:
OpenAIError: Missing credentialsFINETUNE_GATEWAY_KEYwas never set in.envbecause15-fine-tune/main.bicepdoesn't create a dedicatedfoundry-gateway-finetuneAPIM subscription (10-01 / 11-01 do this in a separate step but 15-* never added it).Fix: the teacher-model call only needs any valid APIM key. Cell 8 now reads
ALPHA_GATEWAY_KEY(already in.env). 15-00 prereqs + Bicep deploy command updated to match. Removes a phantom env variable and one redundant Azure resource.3. 15-02 ACA job:
The specified resource name length is not within the permissible limitsContainer name
"ft"violated Azure Storage's 3-63 char minimum. The container creation silently failed (helper usedcheck=False), then the in-job download from"ft"errored visibly.Fix: renamed to
"finetune"in 15-02 / 15-03 / 15-04 cell-3 constants and inmain.bicep.azure_infra.pytakes the name as a parameter so no source change there.4. 15-03:
NameError: name 'eval_dataset' is not defined15-03 implicitly relied on
eval_dataset/reports_data/env_id/accuracybeing in memory from 15-01 + 15-02's kernel.Fix: Cell 3 now re-derives all four. Also added missing
import pandas as pd+from IPython.display import display(caused a separateNameError: pd not definedin the comparison cell), and fixed an outdated"DeepSeek-V3.2 (Teacher)"label in the summary table to match the actualgpt-4.1-miniteacher.5. 15-04 cell 5:
ImportError: cannot import name 'LossKwargs' from transformers.utilsPhi-4-mini's custom remote modeling code imports symbols that were removed in transformers 5.x. The
[finetune]group'stransformers>=4.40.0allowed 5.x.Fix: tightened to
>=4.46.0,<5.0.0with a comment. Matches thetransformers==4.53.3pin the ACA fine-tune job uses inline.6. 15-04
generate(): pad/eos attention-mask warningapply_chat_templatereturned just the input_ids tensor and Phi-4-mini haspad_token == eos_token, so transformers couldn't auto-infer the mask.Fix:
apply_chat_template(..., return_dict=True)thengenerate(**inputs, ...). Removes the warning, makes generation reliable rather than best-effort.Other changes
TARGET_TOTAL = 500 → 100in 15-01 (demo-sized, ~200 teacher calls). Expanded markdown header + inline comment explain when to bump back up.15-fine-tune/data/train.jsonl(100-example training set, scanned and clean) so 15-02 can be re-run without regenerating data.15-fine-tune/data/,15-fine-tune/models/,15-fine-tune/eval_job.json,15-fine-tune/job.json(regenerable artifacts; models/ is ~350MB; the JSONs contain real subscription IDs when written locally).data/train.jsonlis the only file indata/that's tracked, viagit add -f.Patch release 0.8.10.
Test plan
uv sync --group finetunesucceedsALPHA_GATEWAY_KEY, training data prep completes with ~100 examplesprovision_infrastructurecreates thefinetunecontainer; ACA fine-tune job submits and reachesSucceededstatuseval_dataset/reports_data/env_id; eval job completes; comparison chart + summary table renderImportError,generate()runs without the pad/eos warning