diff --git a/notebooks/azure-ai-search-custom-schema-citations.ipynb b/notebooks/azure-ai-search-custom-schema-citations.ipynb new file mode 100644 index 0000000..c848f3e --- /dev/null +++ b/notebooks/azure-ai-search-custom-schema-citations.ipynb @@ -0,0 +1,356 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6907b88f", + "metadata": {}, + "source": [ + "You have an Azure AI Search index whose URL, title, and content fields are **not** named\n", + "`url` / `title` / `content` — they're `blob_url`, `uid`, `snippet`, or whatever your blob or\n", + "SharePoint integrated-vectorization pipeline produced. You wire it into a Foundry Agent with the\n", + "**Azure AI Search tool**, the answers are great, but the `url_citation` annotations come back as\n", + "useless placeholders:\n", + "\n", + "```text\n", + "title='doc_0' url='https://.search.windows.net/'\n", + "```\n", + "\n", + "**The pattern this recipe teaches:** register the index as a *project asset* with a `FieldMapping`,\n", + "then point the `AzureAISearchTool` at it via `index_asset_id`. The agent's citations then resolve to\n", + "your real fields. No re-indexing, no schema change, no touching the index.\n", + "\n", + "### What you'll do\n", + "\n", + "1. Register your existing index as a project asset with a `FieldMapping`\n", + "2. Create an agent that references the asset by `index_asset_id`\n", + "3. Ask a question and read citations that resolve to your real `url` / `title` fields\n", + "4. Learn the failure modes that produce `doc_0` placeholders and how to avoid each one\n", + "\n", + "By the end you have a copyable two-step (`create_or_update` + `index_asset_id`) you can drop into any\n", + "agent that grounds on a custom-schema Azure AI Search index." + ] + }, + { + "cell_type": "markdown", + "id": "a01e6271", + "metadata": {}, + "source": [ + "## 1 · Prerequisites\n", + "\n", + "| | |\n", + "|---|---|\n", + "| Microsoft Foundry project | A project endpoint and one chat deployment (e.g. `gpt-4.1`) |\n", + "| Azure AI Search | An existing index, connected to the project as a `CognitiveSearch` connection |\n", + "| Identity | `az login` — the notebook uses `DefaultAzureCredential` |\n", + "\n", + "You do **not** need to re-index or rename any fields. This recipe works against the schema you\n", + "already have.\n", + "\n", + "### Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4efa992a", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --quiet \"azure-ai-projects>=2.0.0\" \"azure-identity>=1.19.0\"" + ] + }, + { + "cell_type": "markdown", + "id": "6a1981e5", + "metadata": {}, + "source": [ + "## 2 · Configure endpoints and your index's real field names\n", + "\n", + "Set these in your shell (or a local `.env`) and the cell below reads them. The three field names at\n", + "the bottom are the whole point — they are the part of your index that differs from the docs.\n", + "\n", + "```bash\n", + "PROJECT_ENDPOINT=https://.services.ai.azure.com/api/projects/\n", + "SEARCH_CONNECTION_NAME=my-search-connection # the CognitiveSearch connection NAME, not its id\n", + "INDEX_NAME=my-custom-index # your existing index\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f94d2fe4", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "PROJECT_ENDPOINT = os.getenv(\"PROJECT_ENDPOINT\", \"https://.services.ai.azure.com/api/projects/\")\n", + "SEARCH_CONNECTION_NAME = os.getenv(\"SEARCH_CONNECTION_NAME\", \"my-search-connection\") # connection NAME, not id\n", + "INDEX_NAME = os.getenv(\"INDEX_NAME\", \"my-custom-index\")\n", + "MODEL = os.getenv(\"MODEL\", \"gpt-4.1\")\n", + "\n", + "# Your index's real field names -- the part that differs from the docs.\n", + "URL_FIELD = os.getenv(\"URL_FIELD\", \"blob_url\") # your URL field -> annotation.url\n", + "TITLE_FIELD = os.getenv(\"TITLE_FIELD\", \"uid\") # your title field -> annotation.title\n", + "CONTENT_FIELD = os.getenv(\"CONTENT_FIELD\", \"snippet\") # your content field\n", + "\n", + "print(f\"project : {PROJECT_ENDPOINT}\")\n", + "print(f\"connection : {SEARCH_CONNECTION_NAME}\")\n", + "print(f\"index : {INDEX_NAME}\")\n", + "print(f\"fields : url={URL_FIELD!r} title={TITLE_FIELD!r} content={CONTENT_FIELD!r}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ceec87b3", + "metadata": {}, + "source": [ + "## 3 · Create the client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41e1f5b1", + "metadata": {}, + "outputs": [], + "source": [ + "from azure.identity import DefaultAzureCredential\n", + "from azure.ai.projects import AIProjectClient\n", + "\n", + "project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=DefaultAzureCredential())\n", + "openai = project.get_openai_client()\n", + "print(\"client created\")" + ] + }, + { + "cell_type": "markdown", + "id": "1e442bfa", + "metadata": {}, + "source": [ + "## 4 · Register the index as an asset **with a field mapping**\n", + "\n", + "This is the step that makes citations work. `FieldMapping` maps your custom fields onto the citation\n", + "slots the tool understands. The mapping lives on the **registered asset** — not on the tool (see the\n", + "Gotchas table for why that distinction matters)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cbbb4f13", + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.projects.models import AzureAISearchIndex, FieldMapping\n", + "\n", + "ASSET_NAME, ASSET_VERSION = \"my-custom-index-mapped\", \"1\"\n", + "\n", + "asset = project.indexes.create_or_update(\n", + " name=ASSET_NAME, version=ASSET_VERSION,\n", + " index=AzureAISearchIndex(\n", + " name=ASSET_NAME, version=ASSET_VERSION,\n", + " connection_name=SEARCH_CONNECTION_NAME, # connection NAME\n", + " index_name=INDEX_NAME,\n", + " field_mapping=FieldMapping(\n", + " content_fields=[CONTENT_FIELD], # required\n", + " url_field=URL_FIELD, # -> annotation.url\n", + " title_field=TITLE_FIELD, # -> annotation.title\n", + " # filepath_field=\"...\", # optional\n", + " ),\n", + " ),\n", + ")\n", + "print(f\"registered asset {ASSET_NAME}/versions/{ASSET_VERSION}\")" + ] + }, + { + "cell_type": "markdown", + "id": "78fa244a", + "metadata": {}, + "source": [ + "## 5 · Create the agent, referencing the asset by `index_asset_id`\n", + "\n", + "> ⚠️ `index_asset_id` **must** be `\"/versions/\"`, and it is **mutually exclusive**\n", + "> with `project_connection_id` + `index_name`. Set **only** `index_asset_id`, or the service rejects\n", + "> the request with `Multiple values specified for oneof knowledge_index`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e6d55723", + "metadata": {}, + "outputs": [], + "source": [ + "from azure.ai.projects.models import (\n", + " AzureAISearchTool, AzureAISearchToolResource, AISearchIndexResource,\n", + " AzureAISearchQueryType, PromptAgentDefinition,\n", + ")\n", + "\n", + "agent = project.agents.create_version(\n", + " agent_name=\"search-custom-schema\",\n", + " definition=PromptAgentDefinition(\n", + " model=MODEL,\n", + " instructions=(\n", + " \"Answer only from the Azure AI Search tool. Always cite sources, \"\n", + " \"rendered as [message_idx:search_idx†source].\"\n", + " ),\n", + " tools=[AzureAISearchTool(azure_ai_search=AzureAISearchToolResource(indexes=[\n", + " AISearchIndexResource(\n", + " index_asset_id=f\"{ASSET_NAME}/versions/{ASSET_VERSION}\",\n", + " query_type=AzureAISearchQueryType.SEMANTIC, # or VECTOR_SEMANTIC_HYBRID\n", + " top_k=5,\n", + " )\n", + " ]))],\n", + " ),\n", + ")\n", + "print(f\"agent {agent.name} v{agent.version}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e5f68d84", + "metadata": {}, + "source": [ + "## 6 · Ask a question and read the citations\n", + "\n", + "Stream a response and pull the `url_citation` annotations off the final message. With the mapping in\n", + "place, `title` and `url` now carry your real field values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d087fd4b", + "metadata": {}, + "outputs": [], + "source": [ + "stream = openai.responses.create(\n", + " stream=True, tool_choice=\"required\",\n", + " input=\"What does the P4324 do?\",\n", + " extra_body={\"agent_reference\": {\"name\": agent.name, \"type\": \"agent_reference\"}},\n", + ")\n", + "\n", + "for event in stream:\n", + " if event.type == \"response.output_text.delta\":\n", + " print(event.delta, end=\"\")\n", + " elif event.type == \"response.output_item.done\":\n", + " item = event.item\n", + " if item.type == \"message\" and item.content:\n", + " last = item.content[-1]\n", + " if getattr(last, \"type\", None) == \"output_text\":\n", + " for a in (last.annotations or []):\n", + " if a.type == \"url_citation\":\n", + " print(f\"\\nCITATION title={a.title!r} url={a.url!r}\")" + ] + }, + { + "cell_type": "markdown", + "id": "711a6c33", + "metadata": {}, + "source": [ + "**Expected output** — your real fields now surface instead of `doc_0` placeholders:\n", + "\n", + "```text\n", + "CITATION title='P4324 Programmable Flow Controller — Overview'\n", + " url='https://contoso-docs.example.com/p4324/overview'\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "7ddaee5b", + "metadata": {}, + "source": [ + "## 7 · Clean up (optional)\n", + "\n", + "Delete the agent version. Keep the asset if you want to reuse the mapping for other agents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60909bf8", + "metadata": {}, + "outputs": [], + "source": [ + "project.agents.delete_version(agent_name=agent.name, agent_version=agent.version)\n", + "# project.indexes.delete(name=ASSET_NAME, version=ASSET_VERSION) # keep it to reuse the mapping\n", + "print(\"cleaned up agent\")" + ] + }, + { + "cell_type": "markdown", + "id": "874a873c", + "metadata": {}, + "source": [ + "## Gotchas\n", + "\n", + "Every row here is a failure mode that produces broken or placeholder citations.\n", + "\n", + "| Symptom | Cause | Fix |\n", + "|---|---|---|\n", + "| `title=\"doc_0\"`, `url=https://.search.windows.net/` | Direct `project_connection_id` + `index_name` path — citations only read literal `url` / `title` fields | Use the `index_asset_id` + `FieldMapping` path above |\n", + "| `Invalid IndexId format` | `index_asset_id` was a bare name or `name:1` | Must be `\"/versions/\"` (e.g. `.../versions/1` or `.../versions/latest`) |\n", + "| `Multiple values specified for oneof knowledge_index` | Set both `index_asset_id` and `project_connection_id` / `index_name` | Set **only** `index_asset_id` |\n", + "| Field mapping ignored | Passed `parameters.field_mapping` as a dict on the tool | That key is silently dropped; the mapping must live on the **registered asset**, not the tool |\n", + "| Answer is right but citations wrong | The tool concatenates content regardless of field names, so answers work even when citations don't | The mapping fixes citations specifically |\n", + "\n", + "**Alternative (no asset registration):** rename or alias your URL and title fields to literally `url`\n", + "and `title` in the index (indexer output field mappings, or write both on push). The direct path then\n", + "works too. Prefer the asset + `FieldMapping` route when you can't touch the index." + ] + }, + { + "cell_type": "markdown", + "id": "f905039e", + "metadata": {}, + "source": [ + "## Verified run (real output)\n", + "\n", + "Ran these steps **verbatim** on 2026-06-05 against a real index `azstool-e2e-custom`\n", + "(fields `id`, `uid`, `blob_url`, `snippet`) on a live Foundry project, starting from a fresh asset\n", + "registration. Actual console output:\n", + "\n", + "```text\n", + "[step 1] client created\n", + "[step 2] registered asset cookbook-verify-mapped/versions/1:\n", + " {'type': 'AzureSearch', 'connectionName': 'fsunavala-srch-demos-prod',\n", + " 'indexName': 'azstool-e2e-custom',\n", + " 'fieldMapping': {'contentFields': ['snippet'], 'titleField': 'uid', 'urlField': 'blob_url'},\n", + " 'name': 'cookbook-verify-mapped', 'version': '1'}\n", + "[step 3] agent search-custom-schema v1\n", + "[step 4] streamed answer + citations:\n", + "\n", + "The P4324 is a programmable industrial flow controller designed to regulate the flow rate of\n", + "liquids and gases in process pipelines. It does this by modulating a built-in proportional valve.\n", + "The device takes 4-20mA and Modbus RTU setpoints and can maintain flow to within +/- 0.5 percent\n", + "of the target value【4:0†source】.\n", + "\n", + "CITATION title='P4324 Programmable Flow Controller — Overview' url='https://contoso-docs.example.com/p4324/overview'\n", + "\n", + "[step 5] cleaned up agent + asset\n", + "```\n", + "\n", + "**Confirmed:** `title` resolved from the index's `uid` field and `url` from its `blob_url` field —\n", + "no `doc_0` placeholder, no `https://.search.windows.net/` fallback. The same index referenced\n", + "*directly* (without the asset + `FieldMapping`) returns `title='doc_0'`,\n", + "`url='https://.search.windows.net/'` — the broken baseline this recipe fixes." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/registry.yaml b/registry.yaml index 5460900..d9136d7 100644 --- a/registry.yaml +++ b/registry.yaml @@ -189,3 +189,18 @@ - mcp - agents - agent-service + +- slug: azure-ai-search-custom-schema-citations + path: notebooks/azure-ai-search-custom-schema-citations.ipynb + title: "Fix Agent Citations for Custom Azure AI Search Schemas" + description: "Get real url/title citations from the Azure AI Search tool when your index fields aren't named url/title/content — register the index as an asset with a FieldMapping and use index_asset_id." + date: "2026-06-05" + authors: + - github: farzad528 + tags: + - azure-ai-search + - agents + - agent-service + - tools + - grounding + - retrieval