Skip to content

fix(15): make the fine-tuning pipeline runnable end-to-end#17

Merged
corticalstack merged 6 commits into
mainfrom
fix/section-15-finetune-dependency-group
May 27, 2026
Merged

fix(15): make the fine-tuning pipeline runnable end-to-end#17
corticalstack merged 6 commits into
mainfrom
fix/section-15-finetune-dependency-group

Conversation

@corticalstack
Copy link
Copy Markdown
Owner

@corticalstack corticalstack commented May 27, 2026

Summary

Section 15 had multiple unrelated failures hidden behind each other. Running through 15-01 → 15-04 in order surfaced six distinct issues; this PR fixes them all so the pipeline runs end-to-end from a fresh clone.

Fixes (in the order you'd hit them)

1. 15-01 cell 3: ModuleNotFoundError: No module named 'torch'

torch / transformers / peft / matplotlib / azure-storage-blob / azure-ai-inference were never in pyproject.toml. Cell 2's inline %pip install silently fails in this uv-managed .venv (no pip shipped, so %pip errors with No module named pip and then unhelpfully follows with Dependencies installed.).

Fix: new [dependency-groups] finetune entry. Users run uv sync --group finetune once. Cell 2 converted from broken code to a markdown note. Same pattern as azure-ai-evaluation[redteam] in section 14.

2. 15-01 cell 8: OpenAIError: Missing credentials

FINETUNE_GATEWAY_KEY was never set in .env because 15-fine-tune/main.bicep doesn't create a dedicated foundry-gateway-finetune APIM subscription (10-01 / 11-01 do this in a separate step but 15-* never added it).

Fix: the teacher-model call only needs any valid APIM key. Cell 8 now reads ALPHA_GATEWAY_KEY (already in .env). 15-00 prereqs + Bicep deploy command updated to match. Removes a phantom env variable and one redundant Azure resource.

3. 15-02 ACA job: The specified resource name length is not within the permissible limits

Container name "ft" violated Azure Storage's 3-63 char minimum. The container creation silently failed (helper used check=False), then the in-job download from "ft" errored visibly.

Fix: renamed to "finetune" in 15-02 / 15-03 / 15-04 cell-3 constants and in main.bicep. azure_infra.py takes the name as a parameter so no source change there.

4. 15-03: NameError: name 'eval_dataset' is not defined

15-03 implicitly relied on eval_dataset / reports_data / env_id / accuracy being in memory from 15-01 + 15-02's kernel.

Fix: Cell 3 now re-derives all four. Also added missing import pandas as pd + from IPython.display import display (caused a separate NameError: pd not defined in the comparison cell), and fixed an outdated "DeepSeek-V3.2 (Teacher)" label in the summary table to match the actual gpt-4.1-mini teacher.

5. 15-04 cell 5: ImportError: cannot import name 'LossKwargs' from transformers.utils

Phi-4-mini's custom remote modeling code imports symbols that were removed in transformers 5.x. The [finetune] group's transformers>=4.40.0 allowed 5.x.

Fix: tightened to >=4.46.0,<5.0.0 with a comment. Matches the transformers==4.53.3 pin the ACA fine-tune job uses inline.

6. 15-04 generate(): pad/eos attention-mask warning

apply_chat_template returned just the input_ids tensor and Phi-4-mini has pad_token == eos_token, so transformers couldn't auto-infer the mask.

Fix: apply_chat_template(..., return_dict=True) then generate(**inputs, ...). Removes the warning, makes generation reliable rather than best-effort.

Other changes

  • TARGET_TOTAL = 500 → 100 in 15-01 (demo-sized, ~200 teacher calls). Expanded markdown header + inline comment explain when to bump back up.
  • Committed 15-fine-tune/data/train.jsonl (100-example training set, scanned and clean) so 15-02 can be re-run without regenerating data.
  • Gitignored 15-fine-tune/data/, 15-fine-tune/models/, 15-fine-tune/eval_job.json, 15-fine-tune/job.json (regenerable artifacts; models/ is ~350MB; the JSONs contain real subscription IDs when written locally). data/train.jsonl is the only file in data/ that's tracked, via git add -f.

Patch release 0.8.10.

Test plan

  • Fresh clone + uv sync --group finetune succeeds
  • 15-01 runs end-to-end: cell 3 imports succeed, cell 8 creates the AzureOpenAI client using ALPHA_GATEWAY_KEY, training data prep completes with ~100 examples
  • 15-02 runs end-to-end: provision_infrastructure creates the finetune container; ACA fine-tune job submits and reaches Succeeded status
  • 15-03 runs end-to-end in a fresh kernel: cell 3 re-derives eval_dataset / reports_data / env_id; eval job completes; comparison chart + summary table render
  • 15-04 runs end-to-end: model loads without ImportError, generate() runs without the pad/eos warning

…nstall

15-01-data-preparation.ipynb failed at cell 3 with
"ModuleNotFoundError: No module named 'torch'". Two underlying issues:

1. torch / transformers / peft / matplotlib / azure-storage-blob /
   azure-ai-inference were never declared in pyproject.toml.
2. Cell 2's inline `%pip install` silently fails in the uv-managed
   .venv because the venv doesn't ship pip. The cached output shows
   "<repo-root>/.venv/bin/python: No module named pip" yet the cell
   prints "Dependencies installed." regardless, so the failure is
   easy to miss.

15-04-local-inference.ipynb has the same dependency requirements
(torch, transformers, peft) but no install cell at all - it was
silently relying on 15-01 having run first. This affects it too.

Fix: declare the heavy ML deps in a new [dependency-groups] finetune
entry in pyproject.toml. Users run `uv sync --group finetune` once
before opening section 15. Base install stays lean for everyone else
(~3 GB saved when section 15 isn't needed). Matches the pattern used
for azure-ai-evaluation[redteam] in section 14.

Cell 2 of 15-01 converted from broken %pip install code to a markdown
cell pointing at the new group. 15-00 prerequisites step 4 updated to
specify `uv sync --group finetune` with explanation of what the group
includes.
…m FINETUNE_GATEWAY_KEY

After the dependency-group fix unblocked imports, 15-01 cell 8 failed
next with "OpenAIError: Missing credentials" because the env var
FINETUNE_GATEWAY_KEY was never set in .env.

Tracing it: 15-fine-tune/main.bicep does not create a dedicated
foundry-gateway-finetune APIM subscription (10-01 / 11-01 do this in a
separate step but 15-* never added it). The variable was always going
to be missing for anyone following the docs.

The teacher-model call (and the apimSubscriptionKey parameter the
Bicep takes) only need any valid APIM subscription key for the shared
gateway. ALPHA_GATEWAY_KEY is already in .env from the project-spoke
deployment, so reuse it. Removes one phantom env variable and avoids
provisioning a redundant APIM subscription.

Changes:
- 15-01-data-preparation.ipynb cell 8: GATEWAY_KEY now reads
  ALPHA_GATEWAY_KEY, with an inline comment explaining the choice.
- 15-00-fine-tune.md prerequisites: .env block lists ALPHA_GATEWAY_KEY
  instead of FINETUNE_GATEWAY_KEY, with a note explaining the
  rationale. Bicep deploy command uses $ALPHA_GATEWAY_KEY as
  apimSubscriptionKey.
@corticalstack corticalstack changed the title fix(15): add [finetune] dependency group; remove broken inline %pip install fix(15): unblock 15-01 (add [finetune] dep group + reuse ALPHA_GATEWAY_KEY) May 27, 2026
…ttention mask

Six additional fixes after the [finetune] dep group + ALPHA_GATEWAY_KEY
swap (earlier commits in this PR). Together these make the section 15
pipeline runnable end-to-end.

1. Container name "ft" was too short (Azure Storage requires 3-63 chars)
   and silently failed to create, causing the ACA fine-tune job to error
   on download. Renamed to "finetune" in:
   - 15-02 / 15-03 / 15-04 cell-3 constants
   - 15-fine-tune/main.bicep
   Note: azure_infra.py uses the parameter so no source change needed.

2. 15-03 was not self-contained. It relied on eval_dataset / reports_data
   / env_id / accuracy being in memory from 15-01 + 15-02's kernel.
   Running 15-03 in a fresh kernel produced NameError. Cell 3 now
   re-derives eval_dataset (via iss_utils.get_evaluation_dataset),
   reports_data (NASA fetch), env_id (az containerapp env show), and
   sets fallback teacher/base accuracies with comments. Also added
   pandas + IPython.display imports to cell 2 (previously caused
   NameError in the comparison cell), and fixed an outdated "DeepSeek-V3.2
   (Teacher)" label in the summary table to match the actual teacher
   model gpt-4.1-mini used throughout the chain.

3. The [finetune] dep group's transformers constraint was too loose
   (">=4.40.0" allowed 5.x). Phi-4-mini's custom remote modeling code
   imports symbols (LossKwargs, etc.) that were removed in transformers
   5.x, causing 15-04 to fail with ImportError at model load. Tightened
   to ">=4.46.0,<5.0.0", with a comment explaining why; matches the
   transformers==4.53.3 pin the ACA fine-tune job uses inline.

4. 15-04's generate() call passed just the input_ids tensor, triggering a
   warning that the attention mask couldn't be inferred (pad_token ==
   eos_token for Phi-4-mini). Reworked to call apply_chat_template with
   return_dict=True and pass **inputs to generate(), so both input_ids
   and attention_mask are present. Removes the warning and makes
   generation behaviour reliable rather than best-effort.

5. 15-01 TARGET_TOTAL reduced from 500 to 100 with a rationale block
   ("demo-sized, ~200 teacher calls; bump to 500-1000+ for richer
   distillation, costs/runtime scale linearly"). The markdown header
   above the cell ("Generate Training Data") was rewritten with the
   same explanation so readers see it before hitting the code.

6. Gitignored section 15 artifacts that should not be committed:
   15-fine-tune/data/ (regenerable training JSONL),
   15-fine-tune/models/ (~350MB of LoRA adapter weights),
   15-fine-tune/eval_job.json + job.json (contain real subscription IDs
   when run locally).
100-example synthetic + real-data training set generated by 15-01
data-preparation against the gpt-4.1-mini teacher. Useful as a fixed
artifact so:

- 15-02 can be re-run for the fine-tune ACA job without regenerating
  data from scratch (saves ~200 teacher calls + a few minutes).
- The PR review / future reader can inspect what the LoRA actually
  trains on without provisioning anything.

File contains only synthetic ISS incident classifications (system
prompt + NASA report text + assistant-generated severity/category
label). Scanned for tenant identifiers, /home paths, UUIDs, and
APIM-key-shaped hex strings - all clean.

The 15-fine-tune/data/ directory itself stays gitignored so future
notebook runs that produce additional intermediates don't accidentally
get committed; train.jsonl is tracked explicitly via `git add -f`.
@corticalstack corticalstack changed the title fix(15): unblock 15-01 (add [finetune] dep group + reuse ALPHA_GATEWAY_KEY) fix(15): make the fine-tuning pipeline runnable end-to-end May 27, 2026
@corticalstack corticalstack merged commit 403ee0d into main May 27, 2026
1 of 2 checks passed
@corticalstack corticalstack deleted the fix/section-15-finetune-dependency-group branch May 27, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant