-
Notifications
You must be signed in to change notification settings - Fork 0
Revise Docupipe plan with concrete endpoint workflow #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
matthewdavidian
wants to merge
3
commits into
master
Choose a base branch
from
codex/plan-integration-for-docupipe-api
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| # Docupipe Integration Plan for Datatech AI Library | ||
|
|
||
| ## Objective | ||
| Implement Docupipe support in DTAI as a **parallel document pipeline** (not a chat provider), focused on: | ||
| 1. Submitting documents for processing. | ||
| 2. Launching schema-based standardization jobs. | ||
| 3. Polling asynchronous status. | ||
| 4. Returning **raw Docupipe JSON + minimal metadata**. | ||
|
|
||
| This aligns with the current requirement that the consuming application owns schema sourcing/selection, queue persistence, and UI workflow. | ||
|
|
||
| ## Confirmed Product Constraints | ||
| - Parallel document-pipeline interface (not `cAiInterface` sibling behavior). | ||
| - Async-only helper methods (submit and poll later; no blocking wait helper). | ||
| - Schema source is application-owned (DB-backed in your implementation); library is schema-source agnostic. | ||
| - Application chooses schema (with optional user override). | ||
| - Library stops at extraction / standardization retrieval. | ||
| - Single retry on transient failure, then bubble error. | ||
| - Polling-first architecture (no webhook dependency initially). | ||
|
|
||
| ## Docupipe Endpoints to Implement | ||
| Use `https://app.docupipe.ai` as default base URL. | ||
|
|
||
| 1. `POST /document` | ||
| - Submit document for processing. | ||
| - Returns `documentId`. | ||
|
|
||
| 2. `GET /document/{document_id}` | ||
| - Retrieve processed document status/details. | ||
|
|
||
| 3. `POST /v2/standardize/batch` | ||
| - Start async standardization for one or more `documentIds`. | ||
| - Supports optional `schemaId`, `guidelines`, `displayMode`, `splitMode`, `effortLevel`. | ||
| - Returns `jobId`. | ||
|
|
||
| 4. `GET /job/{job_id}` | ||
| - Poll asynchronous job status. | ||
|
|
||
| 5. `GET /standardization/{standardization_id}` | ||
| - Retrieve standardized output JSON (final extraction payload). | ||
|
|
||
|
|
||
| ## Documentation-Driven Endpoint Strategy | ||
| Given the broader DocuPipe API surface, the initial implementation remains intentionally narrow, but should reserve extension points for: | ||
|
|
||
| - **Workflow-driven submission** | ||
| - `POST /workflow/on-submit-document` to create/update workflows. | ||
| - `POST /document` with `workflowId` for one-call classify/split/standardize orchestration. | ||
| - **Bulk upload/standardize convenience endpoints** | ||
| - `upload-multiple` and `upload-and-standardize-multiple` style endpoints for future throughput improvements. | ||
| - **Operational endpoints** | ||
| - health checks, rate-limit guidance, and list/retrieve endpoints (`list_documents`, `list_jobs`, `list_standardizations`) to support observability and admin tooling. | ||
| - **Review and quality tooling** | ||
| - visual review generation/retrieval endpoints for extraction QA workflows if customer support requires citeable evidence. | ||
|
|
||
| These are **not required for v1** but should influence class boundaries so they can be added without breaking the public API. | ||
|
|
||
| ## Authentication and Headers | ||
| - `X-API-Key: <api key>` on all requests. | ||
| - `Accept: application/json` on all requests. | ||
| - `Content-Type: application/json` for JSON request bodies. | ||
| - Support multipart or JSON(base64) for document submission; if multipart is complex, prioritize URL/JSON flow first and add multipart next. | ||
|
|
||
| ## Proposed Package Layout | ||
| Add a dedicated module set under `AppSrc`: | ||
| - `cDocuPipeClient.pkg` | ||
| - Low-level HTTP calls, headers, retry wrapper, diagnostics. | ||
| - `cDocuPipeDocumentService.pkg` | ||
| - Submit document, get document status/details. | ||
| - `cDocuPipeStandardizationService.pkg` | ||
| - Start standardization batch, get standardization output. | ||
| - `cDocuPipeJobService.pkg` | ||
| - Poll job status. | ||
| - `docupipeai.h` | ||
| - Minimal structs for orchestration metadata + raw JSON. | ||
|
|
||
| This keeps Docupipe concerns isolated from chat-model abstractions. | ||
|
|
||
| ## Data Contracts (Minimal + Raw JSON) | ||
| Define orchestration-first structs in `docupipeai.h`: | ||
|
|
||
| - `tDocuPipeConfig` | ||
| - `String sApiKey` | ||
| - `String sBaseUrl` | ||
| - timeout/retry properties | ||
|
|
||
| - `tDocuPipeResult` | ||
| - `Boolean bOk` | ||
| - `Integer iHttpStatus` | ||
| - `String sError` | ||
| - `String sRawJson` | ||
| - `String sRequestId` (optional) | ||
|
|
||
| - `tDocuPipeDocumentSubmitResponse` | ||
| - `String sDocumentId` | ||
| - `String sRawJson` | ||
|
|
||
| - `tDocuPipeStandardizeResponse` | ||
| - `String sJobId` | ||
| - `String sRawJson` | ||
|
|
||
| - `tDocuPipeJobResponse` | ||
| - `String sJobId` | ||
| - `String sStatus` | ||
| - `String sRawJson` | ||
|
|
||
| - `tDocuPipeStandardizationGetResponse` | ||
| - `String sStandardizationId` | ||
| - `String sRawJson` | ||
| - `String sDataJson` (optional convenience if payload nests under `data`) | ||
|
|
||
| Guiding rule: parse only IDs/status required for orchestration; pass through everything else. | ||
|
|
||
| ## Public API Surface | ||
| Expose async-first methods only: | ||
|
|
||
| - `Procedure Configure tDocuPipeConfig cfg` | ||
| - `Function SubmitDocumentFromFile String sFilePath String sDataset String sWorkflowId Returns tDocuPipeDocumentSubmitResponse` | ||
| - `Function SubmitDocumentFromUrl String sUrl String sDataset String sWorkflowId Returns tDocuPipeDocumentSubmitResponse` | ||
| - `Function GetDocument String sDocumentId Returns tDocuPipeResult` | ||
| - `Function Standardize String[] aDocumentIds String sSchemaId String sGuidelines String sDisplayMode String sSplitMode String sEffortLevel Returns tDocuPipeStandardizeResponse` | ||
| - `Function GetJob String sJobId Returns tDocuPipeJobResponse` | ||
| - `Function GetStandardization String sStandardizationId Returns tDocuPipeStandardizationGetResponse` | ||
|
|
||
| Optional utility: | ||
| - `Function IsTerminalJobStatus String sStatus Returns Boolean` | ||
|
|
||
| ## Retry + Error Policy | ||
| Implement once in `cDocuPipeClient` and reuse everywhere: | ||
| - Retry exactly once on: | ||
| - transport/network failure | ||
| - HTTP 429 | ||
| - HTTP 5xx | ||
| - Do not retry on 4xx except 429. | ||
| - Return structured errors; no UI popups inside library classes. | ||
|
|
||
| ## Logging and Diagnostics | ||
| Add low-friction diagnostics hooks: | ||
| - request method + endpoint | ||
| - elapsed time | ||
| - HTTP status | ||
| - core IDs (`documentId`, `jobId`, `standardizationId`) | ||
| - optional raw JSON logging toggle | ||
|
|
||
| Return raw error payloads whenever available to aid customer troubleshooting. | ||
|
|
||
| ## End-to-End Workflow (Application + Library) | ||
| 1. App selects schema and submission options. | ||
| 2. Library `SubmitDocument...` -> returns `documentId`. | ||
| 3. App stores `documentId`/URI in its existing PDF tracking table. | ||
| 4. App calls `Standardize(...)` -> receives `jobId`. | ||
| 5. App polls `GetJob(jobId)` on its own schedule. | ||
| 6. On completion, app calls `GetStandardization(standardizationId)` (ID from job payload). | ||
| 7. App consumes raw JSON for downstream AP/payroll posting logic. | ||
|
|
||
| ## Implementation Phases | ||
|
|
||
| ### Phase 1: Contracts + Client Shell | ||
| 1. Add `docupipeai.h` structs. | ||
| 2. Add `cDocuPipeClient.pkg` with header injection + retry wrapper. | ||
| 3. Add document/standardization/job service class shells. | ||
|
|
||
| ### Phase 2: Endpoint Wiring | ||
| 1. Implement `POST /document` (URL-based and/or file-based). | ||
| 2. Implement `GET /document/{id}`. | ||
| 3. Implement `POST /v2/standardize/batch`. | ||
| 4. Implement `GET /job/{id}`. | ||
| 5. Implement `GET /standardization/{id}`. | ||
|
|
||
| ### Phase 3: Parsing + Ergonomics | ||
| 1. Parse orchestration fields (`documentId`, `jobId`, status, standardization IDs). | ||
| 2. Keep complete response as raw JSON. | ||
| 3. Add README usage snippet for async polling pattern. | ||
|
|
||
| ### Phase 4: Verification | ||
| 1. Mocked-response tests: | ||
| - submit -> documentId | ||
| - standardize -> jobId | ||
| - job status transitions | ||
| 2. Manual integration tests with sample invoices/timesheets: | ||
| - submit -> standardize -> poll -> fetch output | ||
| 3. Retry-path tests for 429/5xx/network failures. | ||
|
|
||
| ## Risks and Mitigations | ||
| - **Potential endpoint response shape drift:** Keep parsers tolerant and preserve full raw JSON. | ||
| - **File upload transport complexity in DataFlex:** Start with URL submission path if needed, then add multipart support. | ||
| - **Polling strategy variance by customer volume:** Keep cadence entirely application-controlled. | ||
|
|
||
|
|
||
| ## Source-of-Truth Reconciliation Note | ||
| This plan should be reconciled against the supplied DocuPipe docs set, especially: | ||
| - API reference pages for `post_document`, `get_document`, `post_standardize_batch_v2`, `get_job`, and `get_standardization`. | ||
| - workflow pages (`workflow-upload-classify-and-standardize`, `upload-and-standardize-using-workflow`) for future one-call orchestration. | ||
| - rate-limit and health-check pages for operational hardening. | ||
|
|
||
| Pre-implementation verification checklist: | ||
| - request payload fields (`/document`, `/v2/standardize/batch`) | ||
| - response ID fields (`documentId`, `jobId`, `standardizationId`) | ||
| - status enums and terminal-state handling | ||
| - header/auth naming (`X-API-Key`) | ||
| - documented limits (file size/page count, endpoint-level limits, rate limits) | ||
|
|
||
| If machine-readable docs, guide docs, and reference pages disagree, treat endpoint reference pages as authoritative and preserve raw JSON passthrough to keep adapters resilient. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry policy currently applies a blind retry to transport failures/429/5xx across the board, which will also hit
POST /documentandPOST /v2/standardize/batch. If the first POST succeeded server-side but the response was lost (timeout/network drop), retrying can create duplicate documents/jobs and corrupt downstream queue state or billing counts. Please scope automatic retries to idempotent reads or require an idempotency key/deduplication strategy for submit/standardize writes.Useful? React with 👍 / 👎.