corticalstack · corticalstack · May 27, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/.gitignore b/.gitignore
@@ -42,4 +42,9 @@ docs/05-custom-mcp-server/maestro-mcp/data/
 # Ralph / tmux tooling
 .ralph/
 .tmux/
-.tmux.ralph.conf
+.tmux.ralph.conf
+# Section 15 fine-tuning artifacts (regenerable / too large to commit)
+15-fine-tune/data/
+15-fine-tune/models/
+15-fine-tune/eval_job.json
+15-fine-tune/job.json
diff --git a/15-fine-tune/15-00-fine-tune.md b/15-fine-tune/15-00-fine-tune.md
@@ -65,29 +65,35 @@ APIM Gateway
    ```
    GATEWAY_URL=https://<apim-name>.azure-api.net/openai
    CHAT_MODEL=gpt-4.1-mini
+   ALPHA_GATEWAY_KEY=<existing apim subscription key for alpha>
    FINETUNE_FOUNDRY_PROJECT_ENDPOINT=https://aif-spoke-multi-{suffix}.services.ai.azure.com/api/projects/finetune-project
-   FINETUNE_GATEWAY_KEY=<apim subscription key for finetune>
    FINETUNE_APIM_CONNECTION=finetune-apim-connection
    FINETUNE_RESOURCE_GROUP=rg-foundry-multi-{suffix}
    FINETUNE_STORAGE_ACCOUNT=issft{suffix}
    FINETUNE_ACA_ENVIRONMENT=acae-finetune-{suffix}
    ```
 
+   > **No dedicated finetune APIM key.** The teacher-model calls only need a valid APIM subscription key for the shared gateway, so the notebooks reuse `ALPHA_GATEWAY_KEY` (already in `.env` from the project-spoke deployment) rather than provisioning a separate `foundry-gateway-finetune` subscription. This removes one resource and one extra env variable. If you want isolated quotas later, you can create a dedicated APIM subscription and wire its key in - the call site is straightforward to swap.
+
 3. **Azure CLI logged in** - run `az login` and ensure you have Contributor access to the resource group.
 
 4. **`uv` installed** - install dependencies with:
    ```bash
-   uv sync
+   uv sync --group finetune
    ```
 
+   This section needs heavy ML dependencies (PyTorch, Hugging Face Transformers, PEFT) that are not in the base install. They live in the `finetune` dependency group in `pyproject.toml`. Plain `uv sync` will leave `torch` / `transformers` / `peft` / `matplotlib` / `azure-storage-blob` / `azure-ai-inference` missing and the notebooks will fail with `ModuleNotFoundError`. The `--group finetune` flag pulls them in.
+
+   > The base install was kept lean because the fine-tuning group adds ~3 GB (PyTorch + CUDA libs). Once you've synced the group it persists in `.venv`; you don't need the flag on subsequent `uv sync` calls.
+
 5. **Bicep deployed** - deploy this lab's infrastructure into the existing multi-spoke resource group:
    ```bash
    az deployment group create \
      --resource-group rg-foundry-multi-{suffix} \
      --template-file 15-fine-tune/main.bicep \
      --parameters deployerPrincipalId=$(az ad signed-in-user show --query id -o tsv) \
                   apimUrl=$GATEWAY_URL \
-                  apimSubscriptionKey=$FINETUNE_GATEWAY_KEY \
+                  apimSubscriptionKey=$ALPHA_GATEWAY_KEY \
                   existingAccountName=aif-spoke-multi-{suffix}
    ```
 

diff --git a/15-fine-tune/15-01-data-preparation.ipynb b/15-fine-tune/15-01-data-preparation.ipynb
diff --git a/15-fine-tune/15-02-fine-tune.ipynb b/15-fine-tune/15-02-fine-tune.ipynb
@@ -1,40 +1,43 @@
 {
- "nbformat": 4,
- "nbformat_minor": 5,
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python",
-   "version": "3.11.0"
-  }
- },
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "",
    "metadata": {},
-   "source": "# ACA fine-tuning job\n\nThis notebook:\n1. Reads infra configuration from `FINETUNE_*` environment variables\n2. Provisions ACA environment + storage (idempotent via `azure_infra.py`)\n3. Submits the Olive LoRA fine-tuning job to an A100 GPU on Azure Container Apps\n4. Monitors job progress until completion",
-   "outputs": []
+   "source": [
+    "# ACA fine-tuning job\n",
+    "\n",
+    "This notebook:\n",
+    "1. Reads infra configuration from `FINETUNE_*` environment variables\n",
+    "2. Provisions ACA environment + storage (idempotent via `azure_infra.py`)\n",
+    "3. Submits the Olive LoRA fine-tuning job to an A100 GPU on Azure Container Apps\n",
+    "4. Monitors job progress until completion"
+   ]
   },
   {
    "cell_type": "code",
-   "id": "",
+   "execution_count": null,
+   "id": "58ea672f",
    "metadata": {},
-   "source": "from pathlib import Path\nfrom dotenv import load_dotenv\nrepo_root = Path.cwd().parents[0]\nload_dotenv(repo_root / '.env', override=True)",
    "outputs": [],
-   "execution_count": null
+   "source": [
+    "from pathlib import Path\n",
+    "from dotenv import load_dotenv\n",
+    "repo_root = Path.cwd().parents[0]\n",
+    "load_dotenv(repo_root / '.env', override=True)"
+   ]
   },
   {
    "cell_type": "code",
-   "id": "",
+   "execution_count": null,
+   "id": "cd5fc620",
    "metadata": {},
-   "source": "import os\nfrom azure.identity import DefaultAzureCredential\nfrom azure.storage.blob import BlobServiceClient\nfrom azure_infra import provision_infrastructure, submit_finetune_job, monitor_job",
    "outputs": [],
-   "execution_count": null
+   "source": [
+    "import os\n",
+    "from azure.identity import DefaultAzureCredential\n",
+    "from azure.storage.blob import BlobServiceClient\n",
+    "from azure_infra import provision_infrastructure, submit_finetune_job, monitor_job"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -48,27 +51,74 @@
   },
   {
    "cell_type": "code",
-   "id": "",
+   "execution_count": null,
+   "id": "06f37c66",
    "metadata": {},
-   "source": "# Configuration from environment variables\nRESOURCE_GROUP = os.getenv(\"FINETUNE_RESOURCE_GROUP\")\nSTORAGE_ACCOUNT = os.getenv(\"FINETUNE_STORAGE_ACCOUNT\")\nACA_ENV = os.getenv(\"FINETUNE_ACA_ENVIRONMENT\")\nBASE_MODEL_ID = \"microsoft/Phi-4-mini-instruct\"\nLOCATION = \"swedencentral\"  # ACA GPU NC24-A100 only available in Sweden Central\nCONTAINER_NAME = \"ft\"\nJOB_NAME = \"iss-ft-job\"\n\nprint(f\"Resource Group:  {RESOURCE_GROUP}\")\nprint(f\"Storage Account: {STORAGE_ACCOUNT}\")\nprint(f\"ACA Environment: {ACA_ENV}\")\nprint(f\"Base Model:      {BASE_MODEL_ID}\")",
    "outputs": [],
-   "execution_count": null
+   "source": [
+    "# Configuration from environment variables\n",
+    "RESOURCE_GROUP = os.getenv(\"FINETUNE_RESOURCE_GROUP\")\n",
+    "STORAGE_ACCOUNT = os.getenv(\"FINETUNE_STORAGE_ACCOUNT\")\n",
+    "ACA_ENV = os.getenv(\"FINETUNE_ACA_ENVIRONMENT\")\n",
+    "BASE_MODEL_ID = \"microsoft/Phi-4-mini-instruct\"\n",
+    "LOCATION = \"swedencentral\"  # ACA GPU NC24-A100 only available in Sweden Central\n",
+    "CONTAINER_NAME = \"finetune\"  # 3-63 chars required; \"ft\" was too short and failed silently\n",
+    "JOB_NAME = \"iss-ft-job\"\n",
+    "\n",
+    "print(f\"Resource Group:  {RESOURCE_GROUP}\")\n",
+    "print(f\"Storage Account: {STORAGE_ACCOUNT}\")\n",
+    "print(f\"ACA Environment: {ACA_ENV}\")\n",
+    "print(f\"Base Model:      {BASE_MODEL_ID}\")"
+   ]
   },
   {
    "cell_type": "code",
-   "id": "",
+   "execution_count": null,
+   "id": "4172766e",
    "metadata": {},
-   "source": "print(f\"Training Config: {RESOURCE_GROUP} | {LOCATION}\")\n\n# 1. Provision Infrastructure\nenv_id = provision_infrastructure(RESOURCE_GROUP, LOCATION, STORAGE_ACCOUNT, CONTAINER_NAME, ACA_ENV, \"data/train.jsonl\")\n\n# 2. Submit Job\nsubmit_finetune_job(\n    JOB_NAME, RESOURCE_GROUP, env_id, STORAGE_ACCOUNT, CONTAINER_NAME, \n    base_model=BASE_MODEL_ID, location=LOCATION\n)\n\n# 3. Monitor (this takes ~15-20 mins)\nsuccess = monitor_job(JOB_NAME, RESOURCE_GROUP)\n",
    "outputs": [],
-   "execution_count": null
+   "source": [
+    "print(f\"Training Config: {RESOURCE_GROUP} | {LOCATION}\")\n",
+    "\n",
+    "# 1. Provision Infrastructure\n",
+    "env_id = provision_infrastructure(RESOURCE_GROUP, LOCATION, STORAGE_ACCOUNT, CONTAINER_NAME, ACA_ENV, \"data/train.jsonl\")\n",
+    "\n",
+    "# 2. Submit Job\n",
+    "submit_finetune_job(\n",
+    "    JOB_NAME, RESOURCE_GROUP, env_id, STORAGE_ACCOUNT, CONTAINER_NAME, \n",
+    "    base_model=BASE_MODEL_ID, location=LOCATION\n",
+    ")\n",
+    "\n",
+    "# 3. Monitor (this takes ~15-20 mins)\n",
+    "success = monitor_job(JOB_NAME, RESOURCE_GROUP)\n"
+   ]
   },
   {
    "cell_type": "code",
-   "id": "",
+   "execution_count": null,
+   "id": "6057ae8d",
    "metadata": {},
-   "source": "if success:\n    print(\"✅ Fine-tuning completed successfully!\")\n    print(\"Adapter saved to blob storage. Will evaluate on ACA next.\")\nelse:\n    print(\"❌ Training failed. Check Azure Portal for logs.\")",
    "outputs": [],
-   "execution_count": null
+   "source": [
+    "if success:\n",
+    "    print(\"✅ Fine-tuning completed successfully!\")\n",
+    "    print(\"Adapter saved to blob storage. Will evaluate on ACA next.\")\n",
+    "else:\n",
+    "    print(\"❌ Training failed. Check Azure Portal for logs.\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
   }
- ]
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
 }