Files
invoice-master-poc-v2/docs/product-plan-v2.md
Yaojia Wang 58bf75db68 WIP
2026-01-27 00:47:10 +01:00

52 KiB

Document Annotation Tool - Product Plan v2

Table of Contents

  1. Product Requirements Document (PRD)
  2. CSV Format Specification
  3. Database Schema Changes
  4. API Specification
  5. UI Wireframes (Text-Based)
  6. Implementation Phases
  7. State Machine Diagrams

1. Product Requirements Document (PRD)

1.1 Overview

This enhancement adds batch upload capabilities, document lifecycle management, manual annotation workflow with auto-label dependency, comprehensive training management, and enhanced document detail views to the Invoice Master Document Annotation Tool.

1.2 User Stories

Epic 1: Batch Upload (ZIP Support)

ID User Story Acceptance Criteria Priority
US-1.1 As a user, I want to upload a ZIP file containing multiple PDFs so that I can process many documents at once - ZIP file is extracted
- Each PDF is registered as a separate document
- Document IDs are returned for all files
- Invalid files are skipped with error message
P0
US-1.2 As a user, I want to include a CSV file in my ZIP for auto-labeling so that annotations are created automatically - CSV is parsed and validated
- DocumentId column maps to PDF filenames
- Field values are stored for auto-labeling
- Invalid CSV rows are logged
P0
US-1.3 As a user, I want to upload a single PDF with auto-label values via API so that I can integrate with my workflow - PDF is uploaded
- Auto-label values provided in JSON body
- Auto-labeling runs automatically
- Document ID returned
P0
US-1.4 As a user, I want clear feedback on batch upload progress so that I know which files succeeded or failed - Upload progress indicator
- Per-file status (success/failed)
- Error messages for failed files
- Summary count displayed
P1

Epic 2: Document List and Status

ID User Story Acceptance Criteria Priority
US-2.1 As a user, I want to see a list of all uploaded documents so that I can manage my annotations - Paginated document list
- Shows filename, status, date
- Sortable columns
- Search/filter capability
P0
US-2.2 As a user, I want to see auto-label status for each document so that I know processing progress - Status badge: pending, processing, completed, failed
- Progress indicator for processing
- Error message for failed
P0
US-2.3 As a user, I want to see the upload source (API vs UI) so that I can track document origin - Source column in list
- Filter by source
- Source shown in detail view
P1
US-2.4 As a user, I want to see annotation preview for completed documents so that I can quickly review - Thumbnail with overlaid bounding boxes
- Annotation count badge
- Click to view full detail
P1

Epic 3: Manual Annotation with Auto-Label Dependency

ID User Story Acceptance Criteria Priority
US-3.1 As a user, I want to be blocked from manual annotation if auto-label is pending so that I don't lose work - Clear message: "Auto-labeling in progress, please wait"
- Refresh button to check status
- Automatic unlock when complete
P0
US-3.2 As a user, I want to override auto-generated annotations so that I can correct errors - Can edit any annotation
- Source changes from "auto" to "manual"
- Original auto value preserved in history
- Override timestamp recorded
P0
US-3.3 As a user, I want to see which annotations are manual vs auto so that I can review confidence - Color-coded annotation badges
- Manual: solid border
- Auto: dashed border with confidence %
- Filter by source
P0
US-3.4 As a user, I want to accept or reject individual auto-annotations so that I can curate training data - Accept button marks as verified
- Reject button removes annotation
- Bulk accept/reject actions
P1

Epic 4: Training Page Features

ID User Story Acceptance Criteria Priority
US-4.1 As a user, I want to see all documents available for training so that I can select training data - Filtered list (only labeled documents)
- Shows annotation count per document
- Checkbox selection
- Select all/none options
P0
US-4.2 As a user, I want to select specific documents for training so that I can control data quality - Multi-select with checkboxes
- Selection count displayed
- Clear selection button
- Persisted selection state
P0
US-4.3 As a user, I want to see all trained models so that I can track model history - Model list with name, date, status
- Document count used
- mAP/accuracy metrics
- Download model link
P0
US-4.4 As a user, I want to see which documents were used in training so that I can track data lineage - "Used in training" badge on documents
- Click to see model list
- Filter documents by training status
P1
US-4.5 As a user, I want to start a training job with selected documents so that I can create new models - Start training button
- Training config options
- Progress monitoring
- Email notification on completion
P0

Epic 5: Document Detail View (Enhanced)

ID User Story Acceptance Criteria Priority
US-5.1 As a user, I want to see all annotations with their source so that I can review data quality - Annotation list with source column
- Confidence score for auto
- Edit/delete buttons
- Group by field type
P0
US-5.2 As a user, I want to see training history for a document so that I can understand model lineage - List of models using this document
- Training date and model name
- Link to model detail page
P1
US-5.3 As a user, I want to edit annotations inline so that I can quickly make corrections - Click to edit bounding box
- Drag to resize
- Double-click to edit text value
- Save/cancel buttons
P0
US-5.4 As a user, I want to see auto vs manual annotation comparison so that I can evaluate auto-label quality - Side-by-side comparison view
- Highlight differences
- Override history timeline
P2

Epic 6: API Endpoints

ID User Story Acceptance Criteria Priority
US-6.1 As a developer, I want to upload ZIP/PDF via API so that I can automate document ingestion - POST endpoint accepts multipart
- Returns document IDs array
- Async processing option
- Webhook callback support
P0
US-6.2 As a developer, I want to upload PDF with auto-label values so that I can pre-annotate documents - JSON body with field values
- Auto-label runs synchronously or async
- Returns annotation IDs
P0
US-6.3 As a developer, I want to query document status so that I can poll for completion - GET endpoint with document ID
- Returns full status object
- Includes annotation summary
P0
US-6.4 As a developer, I want API-uploaded documents visible in UI so that I can manage all documents centrally - Same data model for API/UI uploads
- Source field distinguishes origin
- Full UI functionality available
P0

2. CSV Format Specification

2.1 Required Headers

customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro

2.2 Column Definitions

Column Type Required Maps to Class Description Validation Rules
DocumentId string YES N/A PDF filename (without .pdf extension) Non-empty, alphanumeric + underscore/hyphen
customer_number string NO customer_number (9) Customer reference number Max 50 chars
supplier_name string NO N/A (metadata only) Supplier company name Max 255 chars
supplier_organisation_number string NO supplier_organisation_number (7) Swedish org number (XXXXXX-XXXX) Format: 6 digits, hyphen, 4 digits
supplier_accounts string NO N/A (metadata) Pipe-separated account numbers Max 500 chars
InvoiceNumber string NO invoice_number (0) Invoice reference Max 50 chars
InvoiceDate date NO invoice_date (1) Invoice issue date ISO 8601 or YYYY-MM-DD
InvoiceDueDate date NO invoice_due_date (2) Payment due date ISO 8601 or YYYY-MM-DD
Amount decimal NO amount (6) Invoice total amount Numeric, max 2 decimal places
OCR string NO ocr_number (3) Swedish OCR payment reference Numeric string, max 25 chars
Message string NO N/A (metadata only) Free-text payment message Max 140 chars
Bankgiro string NO bankgiro (4) Bankgiro account number Format: XXX-XXXX or 7-8 digits
Plusgiro string NO plusgiro (5) Plusgiro account number Format: XXXXXX-X or 6-8 digits

2.3 Field to Class Mapping

CSV_TO_CLASS_MAPPING = {
    'InvoiceNumber': 0,      # invoice_number
    'InvoiceDate': 1,        # invoice_date
    'InvoiceDueDate': 2,     # invoice_due_date
    'OCR': 3,                # ocr_number
    'Bankgiro': 4,           # bankgiro
    'Plusgiro': 5,           # plusgiro
    'Amount': 6,             # amount
    'supplier_organisation_number': 7,  # supplier_organisation_number
    # 8: payment_line (derived from OCR/Bankgiro/Amount)
    'customer_number': 9,    # customer_number
}

2.4 Example CSV

customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro
C12345,ACME Corp,556677-8899,123-4567|987-6543,INV001,F2024-001,2024-01-15,2024-02-15,1250.00,7350012345678,,123-4567,
C12346,Widget AB,112233-4455,,INV002,F2024-002,2024-01-16,2024-02-16,3450.50,,,987-6543,

2.5 Validation Rules

  1. DocumentId: Required, must match a PDF filename in the ZIP
  2. At least one matchable field: One of InvoiceNumber, OCR, Bankgiro, Plusgiro, Amount, supplier_organisation_number must be non-empty
  3. Date formats: YYYY-MM-DD, DD/MM/YYYY, DD.MM.YYYY
  4. Amount formats: 1234.56, 1 234,56, 1234,56 SEK
  5. Swedish org number: XXXXXX-XXXX pattern

3. Database Schema Changes

3.1 New Tables

3.1.1 BatchUpload Table

CREATE TABLE batch_uploads (
    batch_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    admin_token VARCHAR(255) NOT NULL REFERENCES admin_tokens(token),
    filename VARCHAR(255) NOT NULL,
    file_size INTEGER NOT NULL,
    upload_source VARCHAR(20) NOT NULL DEFAULT 'ui',  -- 'ui' or 'api'
    status VARCHAR(20) NOT NULL DEFAULT 'processing',
    -- Status: processing, completed, partial, failed
    total_files INTEGER DEFAULT 0,
    processed_files INTEGER DEFAULT 0,
    successful_files INTEGER DEFAULT 0,
    failed_files INTEGER DEFAULT 0,
    error_message TEXT,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMP
);

CREATE INDEX idx_batch_uploads_admin_token ON batch_uploads(admin_token);
CREATE INDEX idx_batch_uploads_status ON batch_uploads(status);

3.1.2 BatchUploadFile Table

CREATE TABLE batch_upload_files (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    batch_id UUID NOT NULL REFERENCES batch_uploads(batch_id) ON DELETE CASCADE,
    document_id UUID REFERENCES admin_documents(document_id),
    filename VARCHAR(255) NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    -- Status: pending, processing, completed, failed, skipped
    error_message TEXT,
    csv_row_data JSONB,  -- Parsed CSV row for this file
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    processed_at TIMESTAMP
);

CREATE INDEX idx_batch_upload_files_batch_id ON batch_upload_files(batch_id);
CREATE INDEX idx_batch_upload_files_document_id ON batch_upload_files(document_id);
CREATE TABLE training_document_links (
    link_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    task_id UUID NOT NULL REFERENCES training_tasks(task_id) ON DELETE CASCADE,
    document_id UUID NOT NULL REFERENCES admin_documents(document_id) ON DELETE CASCADE,
    annotation_snapshot JSONB,  -- Snapshot of annotations at training time
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),

    UNIQUE(task_id, document_id)
);

CREATE INDEX idx_training_doc_links_task_id ON training_document_links(task_id);
CREATE INDEX idx_training_doc_links_document_id ON training_document_links(document_id);

3.1.4 AnnotationHistory Table

CREATE TABLE annotation_history (
    history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    annotation_id UUID NOT NULL REFERENCES admin_annotations(annotation_id) ON DELETE CASCADE,
    action VARCHAR(20) NOT NULL,  -- 'created', 'updated', 'deleted', 'override'
    previous_value JSONB,  -- Full annotation state before change
    new_value JSONB,       -- Full annotation state after change
    changed_by VARCHAR(255),  -- admin_token
    change_reason TEXT,
    created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_annotation_history_annotation_id ON annotation_history(annotation_id);
CREATE INDEX idx_annotation_history_created_at ON annotation_history(created_at);

3.2 Modified Tables

3.2.1 AdminDocument Modifications

ALTER TABLE admin_documents ADD COLUMN upload_source VARCHAR(20) DEFAULT 'ui';
-- Values: 'ui', 'api'

ALTER TABLE admin_documents ADD COLUMN batch_id UUID REFERENCES batch_uploads(batch_id);

ALTER TABLE admin_documents ADD COLUMN csv_field_values JSONB;
-- Stores original CSV values for reference

ALTER TABLE admin_documents ADD COLUMN auto_label_queued_at TIMESTAMP;
-- When auto-label was queued (for dependency checking)

ALTER TABLE admin_documents ADD COLUMN annotation_lock_until TIMESTAMP;
-- Lock for manual annotation while auto-label runs

CREATE INDEX idx_admin_documents_upload_source ON admin_documents(upload_source);
CREATE INDEX idx_admin_documents_batch_id ON admin_documents(batch_id);

3.2.2 AdminAnnotation Modifications

ALTER TABLE admin_annotations ADD COLUMN is_verified BOOLEAN DEFAULT FALSE;
-- User-verified annotation

ALTER TABLE admin_annotations ADD COLUMN verified_at TIMESTAMP;
ALTER TABLE admin_annotations ADD COLUMN verified_by VARCHAR(255);

ALTER TABLE admin_annotations ADD COLUMN override_source VARCHAR(20);
-- If this annotation overrides another: 'auto' or 'imported'

ALTER TABLE admin_annotations ADD COLUMN original_annotation_id UUID;
-- Reference to the annotation this overrides

CREATE INDEX idx_admin_annotations_source ON admin_annotations(source);
CREATE INDEX idx_admin_annotations_is_verified ON admin_annotations(is_verified);

3.2.3 TrainingTask Modifications

ALTER TABLE training_tasks ADD COLUMN document_count INTEGER DEFAULT 0;
-- Count of documents used in training

ALTER TABLE training_tasks ADD COLUMN document_ids UUID[];
-- Array of document IDs used (for quick reference)

ALTER TABLE training_tasks ADD COLUMN metrics_mAP FLOAT;
ALTER TABLE training_tasks ADD COLUMN metrics_precision FLOAT;
ALTER TABLE training_tasks ADD COLUMN metrics_recall FLOAT;
-- Extracted metrics for easy querying

CREATE INDEX idx_training_tasks_metrics ON training_tasks(metrics_mAP);

3.3 SQLModel Definitions

# File: src/data/admin_models.py

from datetime import datetime
from typing import Any
from uuid import UUID, uuid4
from sqlmodel import Field, SQLModel, Column, JSON, ARRAY
from sqlalchemy import String


class BatchUpload(SQLModel, table=True):
    """Batch upload record for ZIP uploads."""

    __tablename__ = "batch_uploads"

    batch_id: UUID = Field(default_factory=uuid4, primary_key=True)
    admin_token: str = Field(foreign_key="admin_tokens.token", max_length=255, index=True)
    filename: str = Field(max_length=255)
    file_size: int
    upload_source: str = Field(default="ui", max_length=20)
    status: str = Field(default="processing", max_length=20, index=True)
    total_files: int = Field(default=0)
    processed_files: int = Field(default=0)
    successful_files: int = Field(default=0)
    failed_files: int = Field(default=0)
    error_message: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    completed_at: datetime | None = Field(default=None)


class BatchUploadFile(SQLModel, table=True):
    """Individual file within a batch upload."""

    __tablename__ = "batch_upload_files"

    file_id: UUID = Field(default_factory=uuid4, primary_key=True)
    batch_id: UUID = Field(foreign_key="batch_uploads.batch_id", index=True)
    document_id: UUID | None = Field(default=None, foreign_key="admin_documents.document_id")
    filename: str = Field(max_length=255)
    status: str = Field(default="pending", max_length=20)
    error_message: str | None = Field(default=None)
    csv_row_data: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
    created_at: datetime = Field(default_factory=datetime.utcnow)
    processed_at: datetime | None = Field(default=None)


class TrainingDocumentLink(SQLModel, table=True):
    """Link between training tasks and documents used."""

    __tablename__ = "training_document_links"

    link_id: UUID = Field(default_factory=uuid4, primary_key=True)
    task_id: UUID = Field(foreign_key="training_tasks.task_id", index=True)
    document_id: UUID = Field(foreign_key="admin_documents.document_id", index=True)
    annotation_snapshot: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
    created_at: datetime = Field(default_factory=datetime.utcnow)


class AnnotationHistory(SQLModel, table=True):
    """History of annotation changes."""

    __tablename__ = "annotation_history"

    history_id: UUID = Field(default_factory=uuid4, primary_key=True)
    annotation_id: UUID = Field(foreign_key="admin_annotations.annotation_id", index=True)
    action: str = Field(max_length=20)
    previous_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
    new_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
    changed_by: str | None = Field(default=None, max_length=255)
    change_reason: str | None = Field(default=None)
    created_at: datetime = Field(default_factory=datetime.utcnow, index=True)

4. API Specification

4.1 New Endpoints

4.1.1 Batch Upload (ZIP)

POST /api/v1/admin/batch/upload
Content-Type: multipart/form-data

Request:
  file: binary (ZIP file)
  async: boolean (default: true)
  auto_label: boolean (default: true)

Response (202 Accepted):
{
  "batch_id": "uuid",
  "status": "processing",
  "total_files": 25,
  "message": "Batch upload started. Use batch_id to check progress.",
  "status_url": "/api/v1/admin/batch/{batch_id}"
}

Response (200 OK - sync mode):
{
  "batch_id": "uuid",
  "status": "completed",
  "total_files": 25,
  "successful_files": 23,
  "failed_files": 2,
  "documents": [
    {
      "document_id": "uuid",
      "filename": "INV001.pdf",
      "status": "completed",
      "auto_label_status": "completed",
      "annotations_created": 8
    }
  ],
  "errors": [
    {
      "filename": "invalid.pdf",
      "error": "Corrupted PDF file"
    }
  ]
}

4.1.2 Batch Status

GET /api/v1/admin/batch/{batch_id}

Response:
{
  "batch_id": "uuid",
  "status": "processing",
  "progress": {
    "total": 25,
    "processed": 15,
    "successful": 14,
    "failed": 1,
    "percentage": 60
  },
  "files": [
    {
      "file_id": "uuid",
      "filename": "INV001.pdf",
      "document_id": "uuid",
      "status": "completed"
    }
  ],
  "created_at": "2024-01-15T10:00:00Z",
  "estimated_completion": "2024-01-15T10:05:00Z"
}

4.1.3 Upload PDF with Auto-Label Values

POST /api/v1/admin/documents/upload-with-labels
Content-Type: multipart/form-data

Request:
  file: binary (PDF file)
  field_values: JSON string
  {
    "InvoiceNumber": "F2024-001",
    "InvoiceDate": "2024-01-15",
    "Amount": "1250.00",
    "OCR": "7350012345678",
    "Bankgiro": "123-4567"
  }
  auto_label: boolean (default: true)
  wait_for_completion: boolean (default: false)

Response (202 Accepted):
{
  "document_id": "uuid",
  "filename": "invoice.pdf",
  "status": "auto_labeling",
  "auto_label_status": "running",
  "message": "Document uploaded. Auto-labeling in progress."
}

Response (200 OK - wait_for_completion=true):
{
  "document_id": "uuid",
  "filename": "invoice.pdf",
  "status": "labeled",
  "auto_label_status": "completed",
  "annotations": [
    {
      "annotation_id": "uuid",
      "class_id": 0,
      "class_name": "invoice_number",
      "text_value": "F2024-001",
      "confidence": 0.95,
      "bbox": { "x": 100, "y": 200, "width": 150, "height": 30 }
    }
  ]
}

4.1.4 Query Document Status

GET /api/v1/admin/documents/{document_id}/status

Response:
{
  "document_id": "uuid",
  "filename": "invoice.pdf",
  "status": "labeled",
  "auto_label_status": "completed",
  "upload_source": "api",
  "annotation_summary": {
    "total": 8,
    "manual": 2,
    "auto": 6,
    "verified": 3
  },
  "can_annotate": true,
  "annotation_lock_reason": null,
  "training_history": [
    {
      "task_id": "uuid",
      "task_name": "Training Run 2024-01",
      "trained_at": "2024-01-20T15:00:00Z"
    }
  ]
}

4.1.5 Training with Document Selection

POST /api/v1/admin/training/tasks
Content-Type: application/json

Request:
{
  "name": "Training Run 2024-01",
  "description": "First training run with 500 documents",
  "document_ids": ["uuid1", "uuid2", "uuid3"],
  "config": {
    "model_name": "yolo11n.pt",
    "epochs": 100,
    "batch_size": 16,
    "image_size": 640
  },
  "scheduled_at": "2024-01-20T22:00:00Z"
}

Response:
{
  "task_id": "uuid",
  "name": "Training Run 2024-01",
  "status": "scheduled",
  "document_count": 500,
  "message": "Training task scheduled for 2024-01-20T22:00:00Z"
}

4.1.6 Get Documents for Training

GET /api/v1/admin/training/documents

Query Parameters:
  - status: labeled (required)
  - has_annotations: true
  - min_annotation_count: 3
  - exclude_used_in_training: boolean
  - limit: 100
  - offset: 0

Response:
{
  "total": 1500,
  "documents": [
    {
      "document_id": "uuid",
      "filename": "INV001.pdf",
      "annotation_count": 8,
      "annotation_sources": { "manual": 3, "auto": 5 },
      "used_in_training": ["task_id_1", "task_id_2"],
      "last_modified": "2024-01-15T10:00:00Z"
    }
  ]
}

4.1.7 Get Model List

GET /api/v1/admin/training/models

Query Parameters:
  - status: completed
  - limit: 20
  - offset: 0

Response:
{
  "total": 15,
  "models": [
    {
      "task_id": "uuid",
      "name": "Training Run 2024-01",
      "status": "completed",
      "document_count": 500,
      "created_at": "2024-01-20T15:00:00Z",
      "completed_at": "2024-01-20T18:30:00Z",
      "metrics": {
        "mAP": 0.935,
        "precision": 0.92,
        "recall": 0.88
      },
      "model_path": "runs/train/invoice_fields_20240120/weights/best.pt",
      "download_url": "/api/v1/admin/training/models/{task_id}/download"
    }
  ]
}

4.1.8 Override Annotation

PATCH /api/v1/admin/documents/{document_id}/annotations/{annotation_id}/override
Content-Type: application/json

Request:
{
  "bbox": { "x": 110, "y": 205, "width": 145, "height": 28 },
  "text_value": "F2024-001-A",
  "reason": "Corrected OCR error"
}

Response:
{
  "annotation_id": "uuid",
  "source": "manual",
  "override_source": "auto",
  "original_annotation_id": "uuid",
  "message": "Annotation overridden successfully",
  "history_id": "uuid"
}

4.2 Modified Endpoints

4.2.1 Document List (Enhanced)

GET /api/v1/admin/documents

Query Parameters (additions):
  - upload_source: 'ui' | 'api' | null
  - has_annotations: boolean
  - auto_label_status: 'pending' | 'running' | 'completed' | 'failed'
  - used_in_training: boolean
  - batch_id: uuid

Response (additions to DocumentItem):
{
  "documents": [
    {
      // ... existing fields ...
      "upload_source": "api",
      "batch_id": "uuid",
      "can_annotate": true,
      "training_count": 2
    }
  ]
}

4.2.2 Document Detail (Enhanced)

GET /api/v1/admin/documents/{document_id}

Response (additions):
{
  // ... existing fields ...
  "upload_source": "api",
  "csv_field_values": {
    "InvoiceNumber": "F2024-001",
    "Amount": "1250.00"
  },
  "can_annotate": true,
  "annotation_lock_reason": null,
  "annotations": [
    {
      // ... existing fields ...
      "is_verified": true,
      "verified_at": "2024-01-16T09:00:00Z",
      "override_source": null
    }
  ],
  "training_history": [
    {
      "task_id": "uuid",
      "name": "Training Run 2024-01",
      "trained_at": "2024-01-20T15:00:00Z",
      "model_metrics": { "mAP": 0.935 }
    }
  ]
}

5. UI Wireframes (Text-Based)

5.1 Document List View

+------------------------------------------------------------------+
|  DOCUMENT ANNOTATION TOOL                    [User: Admin] [Logout]|
+------------------------------------------------------------------+
|  [Documents] [Training] [Models] [Settings]                       |
+------------------------------------------------------------------+
|                                                                    |
|  DOCUMENTS                                                         |
|  +-----------------+  +-----------------------------------------+ |
|  | UPLOAD          |  | FILTERS                                 | |
|  | [Single PDF]    |  | Status: [All v] Source: [All v]        | |
|  | [ZIP Batch]     |  | Auto-Label: [All v] Search: [________] | |
|  +-----------------+  +-----------------------------------------+ |
|                                                                    |
|  +--------------------------------------------------------------+ |
|  | []  Filename       | Status    | Auto-Label | Source | Date  | |
|  +--------------------------------------------------------------+ |
|  | [] INV001.pdf      | Labeled   | Completed  | API    | 01/15 | |
|  |    [8 annotations] | [Preview] |    [95%]   |        |       | |
|  +--------------------------------------------------------------+ |
|  | [] INV002.pdf      | Pending   | Running    | UI     | 01/16 | |
|  |    [0 annotations] | [Locked]  |  [====  ]  |        |       | |
|  +--------------------------------------------------------------+ |
|  | [] INV003.pdf      | Labeled   | Failed     | API    | 01/16 | |
|  |    [5 annotations] | [Preview] | [Retry]    |        |       | |
|  +--------------------------------------------------------------+ |
|  | [] INV004.pdf      | Labeled   | Completed  | UI     | 01/17 | |
|  |    [10 annotations]| [Preview] |    [98%]   | [Used] |       | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  Showing 1-20 of 1,543 documents    [<] [1] [2] [3] ... [78] [>]  |
|                                                                    |
|  [Delete Selected]  [Start Training with Selected]                 |
+------------------------------------------------------------------+

5.2 Document Detail View

+------------------------------------------------------------------+
|  < Back to Documents                           INV001.pdf         |
+------------------------------------------------------------------+
|                                                                    |
|  +---------------------------+  +-------------------------------+ |
|  |                           |  |  DOCUMENT INFO               | |
|  |                           |  |  Status: Labeled             | |
|  |   [Page 1 Image with      |  |  Source: API Upload          | |
|  |    Annotation Overlays]   |  |  Auto-Label: Completed (95%) | |
|  |                           |  |  Pages: 1                     | |
|  |   [Manual: Solid border]  |  |  Uploaded: 2024-01-15        | |
|  |   [Auto: Dashed border]   |  |                              | |
|  |                           |  |  TRAINING HISTORY            | |
|  |                           |  |  - Run 2024-01 (mAP: 93.5%)  | |
|  |                           |  |  - Run 2024-02 (mAP: 95.1%)  | |
|  |                           |  |                              | |
|  +---------------------------+  +-------------------------------+ |
|                                                                    |
|  ANNOTATIONS                        [Add Annotation] [Run OCR]    |
|  +--------------------------------------------------------------+ |
|  | Field              | Value       | Source | Conf | Actions   | |
|  +--------------------------------------------------------------+ |
|  | invoice_number     | F2024-001   | Manual | -    | [E] [D]   | |
|  +--------------------------------------------------------------+ |
|  | invoice_date       | 2024-01-15  | Auto   | 95%  | [V] [E][D]| |
|  +--------------------------------------------------------------+ |
|  | amount             | 1,250.00    | Auto   | 98%  | [V] [E][D]| |
|  +--------------------------------------------------------------+ |
|  | ocr_number         | 7350012345  | Auto   | 87%  | [V] [E][D]| |
|  +--------------------------------------------------------------+ |
|  | bankgiro           | 123-4567    | Manual | -    | [E] [D]   | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  [V] = Verify  [E] = Edit  [D] = Delete                          |
|                                                                    |
|  CSV FIELD VALUES (Reference)                                     |
|  +--------------------------------------------------------------+ |
|  | InvoiceNumber: F2024-001  | InvoiceDate: 2024-01-15          | |
|  | Amount: 1250.00           | OCR: 7350012345678               | |
|  | Bankgiro: 123-4567        |                                   | |
|  +--------------------------------------------------------------+ |
+------------------------------------------------------------------+

5.3 Training Page

+------------------------------------------------------------------+
|  DOCUMENT ANNOTATION TOOL                    [User: Admin] [Logout]|
+------------------------------------------------------------------+
|  [Documents] [Training] [Models] [Settings]                       |
+------------------------------------------------------------------+
|                                                                    |
|  TRAINING                                                          |
|                                                                    |
|  DOCUMENT SELECTION                            Selected: 500 docs  |
|  +--------------------------------------------------------------+ |
|  | []  Filename       | Annotations | Source | Last Modified    | |
|  +--------------------------------------------------------------+ |
|  | [x] INV001.pdf     | 8 (M:3 A:5) | API    | 2024-01-15       | |
|  +--------------------------------------------------------------+ |
|  | [x] INV002.pdf     | 10 (M:2 A:8)| UI     | 2024-01-16       | |
|  +--------------------------------------------------------------+ |
|  | [ ] INV003.pdf     | 5 (M:5 A:0) | UI     | 2024-01-16       | |
|  +--------------------------------------------------------------+ |
|  | [x] INV004.pdf     | 12 (M:4 A:8)| API    | 2024-01-17       | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  [Select All] [Select None] [Select Not Used in Training]         |
|                                                                    |
|  Showing labeled documents only    [<] [1] [2] [3] ... [50] [>]   |
|                                                                    |
|  TRAINING CONFIGURATION                                           |
|  +--------------------------------------------------------------+ |
|  | Name: [Training Run 2024-01____________]                      | |
|  | Description: [First training with 500 documents_________]     | |
|  |                                                               | |
|  | Base Model: [yolo11n.pt v]   Epochs: [100]   Batch: [16]     | |
|  | Image Size: [640]            Device: [GPU 0 v]               | |
|  |                                                               | |
|  | [ ] Schedule for later: [2024-01-20] [22:00]                 | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  [Start Training]                                                  |
+------------------------------------------------------------------+

5.4 Model History View

+------------------------------------------------------------------+
|  DOCUMENT ANNOTATION TOOL                    [User: Admin] [Logout]|
+------------------------------------------------------------------+
|  [Documents] [Training] [Models] [Settings]                       |
+------------------------------------------------------------------+
|                                                                    |
|  TRAINED MODELS                                                    |
|                                                                    |
|  +--------------------------------------------------------------+ |
|  | Name                    | Status     | Docs  | mAP   | Date  | |
|  +--------------------------------------------------------------+ |
|  | Training Run 2024-03    | Running    | 750   | -     | 01/25 | |
|  |                         | [====    ] |       |       |       | |
|  |                         | [View Logs] [Cancel]                | |
|  +--------------------------------------------------------------+ |
|  | Training Run 2024-02    | Completed  | 600   | 95.1% | 01/20 | |
|  |                         | P: 94% R: 92%                       | |
|  |                         | [View] [Download] [Use as Base]     | |
|  +--------------------------------------------------------------+ |
|  | Training Run 2024-01    | Completed  | 500   | 93.5% | 01/15 | |
|  |                         | P: 92% R: 88%                       | |
|  |                         | [View] [Download] [Use as Base]     | |
|  +--------------------------------------------------------------+ |
|  | Initial Training        | Completed  | 200   | 85.2% | 01/10 | |
|  |                         | P: 84% R: 80%                       | |
|  |                         | [View] [Download] [Use as Base]     | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  MODEL DETAIL: Training Run 2024-02                               |
|  +--------------------------------------------------------------+ |
|  | Created: 2024-01-20 15:00  | Completed: 2024-01-20 18:30     | |
|  | Duration: 3h 30m           | Documents: 600                   | |
|  |                                                               | |
|  | Metrics:                                                      | |
|  | - mAP@0.5: 95.1%                                             | |
|  | - Precision: 94%                                             | |
|  | - Recall: 92%                                                | |
|  |                                                               | |
|  | Configuration:                                                | |
|  | - Base: yolo11n.pt   Epochs: 100   Batch: 16   Size: 640    | |
|  |                                                               | |
|  | Documents Used: [View 600 documents]                          | |
|  +--------------------------------------------------------------+ |
+------------------------------------------------------------------+

5.5 Batch Upload Modal

+------------------------------------------------------------------+
|                      BATCH UPLOAD                           [X]   |
+------------------------------------------------------------------+
|                                                                    |
|  Upload a ZIP file containing:                                    |
|  - Multiple PDF files                                             |
|  - (Optional) CSV file for auto-labeling                         |
|                                                                    |
|  +--------------------------------------------------------------+ |
|  |                                                               | |
|  |     [Drag and drop ZIP file here]                            | |
|  |              or                                               | |
|  |         [Browse Files]                                        | |
|  |                                                               | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  [x] Auto-label documents (requires CSV)                          |
|  [ ] Process asynchronously                                       |
|                                                                    |
|  CSV FORMAT REQUIREMENTS:                                         |
|  Required columns: DocumentId                                     |
|  Optional: InvoiceNumber, InvoiceDate, Amount, OCR, Bankgiro...   |
|  [View full CSV specification]                                    |
|                                                                    |
|                                          [Cancel] [Upload]        |
+------------------------------------------------------------------+

+------------------------------------------------------------------+
|                    UPLOAD PROGRESS                          [X]   |
+------------------------------------------------------------------+
|                                                                    |
|  Processing batch upload...                                       |
|                                                                    |
|  [========================================          ] 80%          |
|                                                                    |
|  Files: 20 / 25                                                   |
|  Successful: 18                                                   |
|  Failed: 2                                                        |
|                                                                    |
|  +--------------------------------------------------------------+ |
|  | [OK] INV001.pdf - Completed (8 annotations)                   | |
|  | [OK] INV002.pdf - Completed (10 annotations)                  | |
|  | [!!] INV003.pdf - Failed: Corrupted PDF                       | |
|  | [OK] INV004.pdf - Completed (6 annotations)                   | |
|  | [...] Processing INV005.pdf...                                | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|                                          [Cancel] [Close]         |
+------------------------------------------------------------------+

6. Implementation Phases

Phase 1: Database and Core Models (Week 1)

Step Task Files Risk
1.1 Create database migration script src/data/migrations/ Low
1.2 Add new SQLModel classes src/data/admin_models.py Low
1.3 Update AdminDB with new methods src/data/admin_db.py Medium
1.4 Add unit tests for new models tests/data/test_admin_models.py Low

Dependencies: None Risk Assessment: Low - mostly additive changes to existing structure

Phase 2: Batch Upload Backend (Week 2)

Step Task Files Risk
2.1 Create ZIP extraction service src/web/batch_upload_service.py Medium
2.2 Add CSV parsing with new format src/data/csv_loader.py Low
2.3 Create batch upload routes src/web/admin_batch_routes.py Medium
2.4 Add async processing queue src/web/batch_queue.py High
2.5 Integration tests tests/web/test_batch_upload.py Medium

Dependencies: Phase 1 Risk Assessment: Medium - ZIP handling and async processing add complexity

Phase 3: Enhanced Document Management (Week 3)

Step Task Files Risk
3.1 Add upload source tracking src/data/admin_models.py Low
3.2 Update document list endpoint src/web/admin_routes.py Low
3.3 Add annotation lock mechanism src/web/admin_annotation_routes.py Medium
3.4 Add document status endpoint src/web/admin_routes.py Low
3.5 Update auto-label service src/web/admin_autolabel.py Medium

Dependencies: Phase 1, Phase 2 Risk Assessment: Medium - locking mechanism needs careful implementation

Phase 4: Manual Annotation Enhancement (Week 4)

Step Task Files Risk
4.1 Add override mechanism src/web/admin_annotation_routes.py Medium
4.2 Add annotation history src/data/admin_db.py Low
4.3 Add verification endpoint src/web/admin_annotation_routes.py Low
4.4 Update schemas with new fields src/web/admin_schemas.py Low

Dependencies: Phase 3 Risk Assessment: Low - extending existing annotation system

Phase 5: Training Integration (Week 5)

Step Task Files Risk
5.1 Add document selection for training src/web/admin_training_routes.py Medium
5.2 Add training document link table src/data/admin_db.py Low
5.3 Add model list endpoint src/web/admin_training_routes.py Low
5.4 Update export with selection src/web/admin_training_routes.py Medium
5.5 Add metrics extraction src/cli/train.py Medium

Dependencies: Phase 1, Phase 4 Risk Assessment: Medium - integration with training pipeline

Phase 6: Frontend Implementation (Weeks 6-7)

Step Task Files Risk
6.1 Create React component structure frontend/ High
6.2 Implement document list view frontend/src/components/ Medium
6.3 Implement document detail view frontend/src/components/ High
6.4 Implement training page frontend/src/components/ Medium
6.5 Implement batch upload modal frontend/src/components/ Medium
6.6 Add annotation editor frontend/src/components/ High

Dependencies: Phase 2-5 Risk Assessment: High - frontend development is a new component

Phase 7: Testing and Documentation (Week 8)

Step Task Files Risk
7.1 Integration tests tests/integration/ Medium
7.2 E2E tests tests/e2e/ High
7.3 API documentation docs/api/ Low
7.4 User guide docs/user-guide/ Low
7.5 Performance testing tests/performance/ Medium

Dependencies: All phases Risk Assessment: Medium

Risk Mitigation Strategies

Risk Impact Probability Mitigation
ZIP bomb attack High Low Limit max file count, max total size, scan before extraction
Async queue failures Medium Medium Implement retry logic, dead letter queue, manual retry endpoint
Annotation lock deadlock Medium Low Timeout-based locks, admin override capability
Large batch performance Medium High Chunked processing, progress tracking, background workers
Database migration issues High Low Backward compatible changes, rollback scripts
Frontend complexity Medium Medium Use established UI framework, incremental delivery

7. State Machine Diagrams

7.1 Document Lifecycle States

                                    +-------------+
                                    |   DELETED   |
                                    +------^------+
                                           |
                                           | delete
                                           |
+----------+     upload      +----------+  |
|          | --------------> |          |--+
|  (none)  |                 | PENDING  |
|          |                 |          |
+----------+                 +----+-----+
                                  |
                 +----------------+-----------------+
                 |                                  |
                 | trigger auto-label               | create manual annotation
                 v                                  |
          +-------------+                          |
          |             |                          |
          | AUTO_LABEL- |                          |
          |    ING      |                          |
          |             |                          |
          +------+------+                          |
                 |                                  |
       +---------+---------+                       |
       |                   |                       |
       | complete          | fail                  |
       v                   v                       |
+-------------+     +-------------+               |
|             |     |             |               |
|   LABELED   |<----+  PENDING   +<--------------+
|             |  retry|  (failed) |
+------+------+     +-------------+
       |
       | export
       v
+-------------+
|             |
|  EXPORTED   |
|             |
+-------------+

7.2 Auto-Label Workflow States

                                    +-------------+
                                    |   MANUAL    |
                                    |  OVERRIDE   |
                                    +------^------+
                                           |
                                           | user edit
                                           |
+----------+     queue       +----------+  |    +-----------+
|          | --------------> |          |  |    |           |
|  (none)  |                 | QUEUED   |--+--->| COMPLETED |
|          |                 |          |       |           |
+----------+                 +----+-----+       +-----^-----+
                                  |                   |
                                  | start             |
                                  v                   |
                           +-------------+           |
                           |             |           |
                           |  RUNNING    +-----------+
                           |             |  success
                           +------+------+
                                  |
                                  | error
                                  v
                           +-------------+
                           |             |
                           |   FAILED    |
                           |             |
                           +------+------+
                                  |
                                  | retry
                                  v
                           +-------------+
                           |             |
                           |   QUEUED    |
                           |             |
                           +-------------+

7.3 Batch Upload States

+----------+     upload      +-------------+
|          | --------------> |             |
|  (none)  |                 | PROCESSING  |
|          |                 |             |
+----------+                 +------+------+
                                    |
                    +---------------+---------------+
                    |               |               |
                    | all success   | some fail     | all fail
                    v               v               v
             +-------------+ +-------------+ +-------------+
             |             | |             | |             |
             | COMPLETED   | |   PARTIAL   | |   FAILED    |
             |             | |             | |             |
             +-------------+ +-------------+ +-------------+

7.4 Training Task States

+----------+     create      +-------------+
|          | --------------> |             |
|  (none)  |                 |   PENDING   |
|          |                 |             |
+----------+                 +------+------+
                                    |
                      +-------------+-------------+
                      |                           |
                      | immediate                 | scheduled
                      v                           v
               +-------------+             +-------------+
               |             |             |             |
               |   RUNNING   |<------------+ SCHEDULED   |
               |             |   trigger   |             |
               +------+------+             +------+------+
                      |                           |
            +---------+---------+                 | cancel
            |                   |                 v
            | success           | error    +-------------+
            v                   v          |             |
     +-------------+     +-------------+   | CANCELLED   |
     |             |     |             |   |             |
     | COMPLETED   |     |   FAILED    |   +-------------+
     |             |     |             |
     +-------------+     +------+------+
                                |
                                | retry
                                v
                         +-------------+
                         |             |
                         |   PENDING   |
                         |             |
                         +-------------+

7.5 Annotation Lock States

                                    +-------------+
                                    |   LOCKED    |
                                    | (auto-label |
                                    |  running)   |
                                    +------^------+
                                           |
                                           | auto-label starts
                                           |
+----------+     upload      +----------+  |
|          | --------------> |          |--+
|  (none)  |                 | UNLOCKED |<---------+
|          |                 |          |          |
+----------+                 +----+-----+          |
                                  |                |
                                  | auto-label     | auto-label
                                  | starts         | completes/fails
                                  |                |
                                  v                |
                           +-------------+         |
                           |             |         |
                           |   LOCKED    +---------+
                           | (timeout:   |
                           |  5 minutes) |
                           +-------------+

Summary

This comprehensive plan provides:

  1. PRD: 24 user stories across 6 epics with clear acceptance criteria and priorities
  2. CSV Specification: 13 columns with detailed validation rules and field mappings
  3. Database Schema: 4 new tables + modifications to 3 existing tables with full SQLModel definitions
  4. API Specification: 8 new endpoints + 2 modified endpoints with complete request/response schemas
  5. UI Wireframes: 5 detailed text-based wireframes covering all major views
  6. Implementation Phases: 7 phases over 8 weeks with 30+ tasks, dependencies, and risk assessments
  7. State Machines: 5 state diagrams covering document, auto-label, batch, training, and locking workflows

The implementation follows an incremental approach starting with database/backend changes before frontend development, minimizing risk and enabling continuous testing throughout the development cycle.