Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,9 @@ docs/05-custom-mcp-server/maestro-mcp/data/
# Ralph / tmux tooling
.ralph/
.tmux/
.tmux.ralph.conf
.tmux.ralph.conf
# Section 15 fine-tuning artifacts (regenerable / too large to commit)
15-fine-tune/data/
15-fine-tune/models/
15-fine-tune/eval_job.json
15-fine-tune/job.json
12 changes: 9 additions & 3 deletions 15-fine-tune/15-00-fine-tune.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,29 +65,35 @@ APIM Gateway
```
GATEWAY_URL=https://<apim-name>.azure-api.net/openai
CHAT_MODEL=gpt-4.1-mini
ALPHA_GATEWAY_KEY=<existing apim subscription key for alpha>
FINETUNE_FOUNDRY_PROJECT_ENDPOINT=https://aif-spoke-multi-{suffix}.services.ai.azure.com/api/projects/finetune-project
FINETUNE_GATEWAY_KEY=<apim subscription key for finetune>
FINETUNE_APIM_CONNECTION=finetune-apim-connection
FINETUNE_RESOURCE_GROUP=rg-foundry-multi-{suffix}
FINETUNE_STORAGE_ACCOUNT=issft{suffix}
FINETUNE_ACA_ENVIRONMENT=acae-finetune-{suffix}
```

> **No dedicated finetune APIM key.** The teacher-model calls only need a valid APIM subscription key for the shared gateway, so the notebooks reuse `ALPHA_GATEWAY_KEY` (already in `.env` from the project-spoke deployment) rather than provisioning a separate `foundry-gateway-finetune` subscription. This removes one resource and one extra env variable. If you want isolated quotas later, you can create a dedicated APIM subscription and wire its key in - the call site is straightforward to swap.

3. **Azure CLI logged in** - run `az login` and ensure you have Contributor access to the resource group.

4. **`uv` installed** - install dependencies with:
```bash
uv sync
uv sync --group finetune
```

This section needs heavy ML dependencies (PyTorch, Hugging Face Transformers, PEFT) that are not in the base install. They live in the `finetune` dependency group in `pyproject.toml`. Plain `uv sync` will leave `torch` / `transformers` / `peft` / `matplotlib` / `azure-storage-blob` / `azure-ai-inference` missing and the notebooks will fail with `ModuleNotFoundError`. The `--group finetune` flag pulls them in.

> The base install was kept lean because the fine-tuning group adds ~3 GB (PyTorch + CUDA libs). Once you've synced the group it persists in `.venv`; you don't need the flag on subsequent `uv sync` calls.

5. **Bicep deployed** - deploy this lab's infrastructure into the existing multi-spoke resource group:
```bash
az deployment group create \
--resource-group rg-foundry-multi-{suffix} \
--template-file 15-fine-tune/main.bicep \
--parameters deployerPrincipalId=$(az ad signed-in-user show --query id -o tsv) \
apimUrl=$GATEWAY_URL \
apimSubscriptionKey=$FINETUNE_GATEWAY_KEY \
apimSubscriptionKey=$ALPHA_GATEWAY_KEY \
existingAccountName=aif-spoke-multi-{suffix}
```

Expand Down
466 changes: 406 additions & 60 deletions 15-fine-tune/15-01-data-preparation.ipynb

Large diffs are not rendered by default.

114 changes: 82 additions & 32 deletions 15-fine-tune/15-02-fine-tune.ipynb
Original file line number Diff line number Diff line change
@@ -1,40 +1,43 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.0"
}
},
"cells": [
{
"cell_type": "markdown",
"id": "",
"metadata": {},
"source": "# ACA fine-tuning job\n\nThis notebook:\n1. Reads infra configuration from `FINETUNE_*` environment variables\n2. Provisions ACA environment + storage (idempotent via `azure_infra.py`)\n3. Submits the Olive LoRA fine-tuning job to an A100 GPU on Azure Container Apps\n4. Monitors job progress until completion",
"outputs": []
"source": [
"# ACA fine-tuning job\n",
"\n",
"This notebook:\n",
"1. Reads infra configuration from `FINETUNE_*` environment variables\n",
"2. Provisions ACA environment + storage (idempotent via `azure_infra.py`)\n",
"3. Submits the Olive LoRA fine-tuning job to an A100 GPU on Azure Container Apps\n",
"4. Monitors job progress until completion"
]
},
{
"cell_type": "code",
"id": "",
"execution_count": null,
"id": "58ea672f",
"metadata": {},
"source": "from pathlib import Path\nfrom dotenv import load_dotenv\nrepo_root = Path.cwd().parents[0]\nload_dotenv(repo_root / '.env', override=True)",
"outputs": [],
"execution_count": null
"source": [
"from pathlib import Path\n",
"from dotenv import load_dotenv\n",
"repo_root = Path.cwd().parents[0]\n",
"load_dotenv(repo_root / '.env', override=True)"
]
},
{
"cell_type": "code",
"id": "",
"execution_count": null,
"id": "cd5fc620",
"metadata": {},
"source": "import os\nfrom azure.identity import DefaultAzureCredential\nfrom azure.storage.blob import BlobServiceClient\nfrom azure_infra import provision_infrastructure, submit_finetune_job, monitor_job",
"outputs": [],
"execution_count": null
"source": [
"import os\n",
"from azure.identity import DefaultAzureCredential\n",
"from azure.storage.blob import BlobServiceClient\n",
"from azure_infra import provision_infrastructure, submit_finetune_job, monitor_job"
]
},
{
"cell_type": "markdown",
Expand All @@ -48,27 +51,74 @@
},
{
"cell_type": "code",
"id": "",
"execution_count": null,
"id": "06f37c66",
"metadata": {},
"source": "# Configuration from environment variables\nRESOURCE_GROUP = os.getenv(\"FINETUNE_RESOURCE_GROUP\")\nSTORAGE_ACCOUNT = os.getenv(\"FINETUNE_STORAGE_ACCOUNT\")\nACA_ENV = os.getenv(\"FINETUNE_ACA_ENVIRONMENT\")\nBASE_MODEL_ID = \"microsoft/Phi-4-mini-instruct\"\nLOCATION = \"swedencentral\" # ACA GPU NC24-A100 only available in Sweden Central\nCONTAINER_NAME = \"ft\"\nJOB_NAME = \"iss-ft-job\"\n\nprint(f\"Resource Group: {RESOURCE_GROUP}\")\nprint(f\"Storage Account: {STORAGE_ACCOUNT}\")\nprint(f\"ACA Environment: {ACA_ENV}\")\nprint(f\"Base Model: {BASE_MODEL_ID}\")",
"outputs": [],
"execution_count": null
"source": [
"# Configuration from environment variables\n",
"RESOURCE_GROUP = os.getenv(\"FINETUNE_RESOURCE_GROUP\")\n",
"STORAGE_ACCOUNT = os.getenv(\"FINETUNE_STORAGE_ACCOUNT\")\n",
"ACA_ENV = os.getenv(\"FINETUNE_ACA_ENVIRONMENT\")\n",
"BASE_MODEL_ID = \"microsoft/Phi-4-mini-instruct\"\n",
"LOCATION = \"swedencentral\" # ACA GPU NC24-A100 only available in Sweden Central\n",
"CONTAINER_NAME = \"finetune\" # 3-63 chars required; \"ft\" was too short and failed silently\n",
"JOB_NAME = \"iss-ft-job\"\n",
"\n",
"print(f\"Resource Group: {RESOURCE_GROUP}\")\n",
"print(f\"Storage Account: {STORAGE_ACCOUNT}\")\n",
"print(f\"ACA Environment: {ACA_ENV}\")\n",
"print(f\"Base Model: {BASE_MODEL_ID}\")"
]
},
{
"cell_type": "code",
"id": "",
"execution_count": null,
"id": "4172766e",
"metadata": {},
"source": "print(f\"Training Config: {RESOURCE_GROUP} | {LOCATION}\")\n\n# 1. Provision Infrastructure\nenv_id = provision_infrastructure(RESOURCE_GROUP, LOCATION, STORAGE_ACCOUNT, CONTAINER_NAME, ACA_ENV, \"data/train.jsonl\")\n\n# 2. Submit Job\nsubmit_finetune_job(\n JOB_NAME, RESOURCE_GROUP, env_id, STORAGE_ACCOUNT, CONTAINER_NAME, \n base_model=BASE_MODEL_ID, location=LOCATION\n)\n\n# 3. Monitor (this takes ~15-20 mins)\nsuccess = monitor_job(JOB_NAME, RESOURCE_GROUP)\n",
"outputs": [],
"execution_count": null
"source": [
"print(f\"Training Config: {RESOURCE_GROUP} | {LOCATION}\")\n",
"\n",
"# 1. Provision Infrastructure\n",
"env_id = provision_infrastructure(RESOURCE_GROUP, LOCATION, STORAGE_ACCOUNT, CONTAINER_NAME, ACA_ENV, \"data/train.jsonl\")\n",
"\n",
"# 2. Submit Job\n",
"submit_finetune_job(\n",
" JOB_NAME, RESOURCE_GROUP, env_id, STORAGE_ACCOUNT, CONTAINER_NAME, \n",
" base_model=BASE_MODEL_ID, location=LOCATION\n",
")\n",
"\n",
"# 3. Monitor (this takes ~15-20 mins)\n",
"success = monitor_job(JOB_NAME, RESOURCE_GROUP)\n"
]
},
{
"cell_type": "code",
"id": "",
"execution_count": null,
"id": "6057ae8d",
"metadata": {},
"source": "if success:\n print(\"✅ Fine-tuning completed successfully!\")\n print(\"Adapter saved to blob storage. Will evaluate on ACA next.\")\nelse:\n print(\"❌ Training failed. Check Azure Portal for logs.\")",
"outputs": [],
"execution_count": null
"source": [
"if success:\n",
" print(\"✅ Fine-tuning completed successfully!\")\n",
" print(\"Adapter saved to blob storage. Will evaluate on ACA next.\")\n",
"else:\n",
" print(\"❌ Training failed. Check Azure Portal for logs.\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.0"
}
]
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading