# Document Annotation Tool - Product Plan v2 ## Table of Contents 1. [Product Requirements Document (PRD)](#1-product-requirements-document-prd) - Epic 1-6: Original features - **Epic 7: Dashboard Enhancement** (NEW) 2. [CSV Format Specification](#2-csv-format-specification) 3. [Database Schema Changes](#3-database-schema-changes) 4. [API Specification](#4-api-specification) - 4.1-4.2: Original endpoints - **4.3: Dashboard API Endpoints** (NEW) 5. [UI Wireframes (Text-Based)](#5-ui-wireframes-text-based) - **5.0: Dashboard Overview** (NEW) - 5.1-5.5: Original wireframes 6. [Implementation Phases](#6-implementation-phases) 7. [State Machine Diagrams](#7-state-machine-diagrams) --- ## 1. Product Requirements Document (PRD) ### 1.1 Overview This enhancement adds batch upload capabilities, document lifecycle management, manual annotation workflow with auto-label dependency, comprehensive training management, and enhanced document detail views to the Invoice Master Document Annotation Tool. ### 1.2 User Stories #### Epic 1: Batch Upload (ZIP Support) | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-1.1 | As a user, I want to upload a ZIP file containing multiple PDFs so that I can process many documents at once | - ZIP file is extracted
- Each PDF is registered as a separate document
- Document IDs are returned for all files
- Invalid files are skipped with error message | P0 | | US-1.2 | As a user, I want to include a CSV file in my ZIP for auto-labeling so that annotations are created automatically | - CSV is parsed and validated
- DocumentId column maps to PDF filenames
- Field values are stored for auto-labeling
- Invalid CSV rows are logged | P0 | | US-1.3 | As a user, I want to upload a single PDF with auto-label values via API so that I can integrate with my workflow | - PDF is uploaded
- Auto-label values provided in JSON body
- Auto-labeling runs automatically
- Document ID returned | P0 | | US-1.4 | As a user, I want clear feedback on batch upload progress so that I know which files succeeded or failed | - Upload progress indicator
- Per-file status (success/failed)
- Error messages for failed files
- Summary count displayed | P1 | #### Epic 2: Document List and Status | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-2.1 | As a user, I want to see a list of all uploaded documents so that I can manage my annotations | - Paginated document list
- Shows filename, status, date
- Sortable columns
- Search/filter capability | P0 | | US-2.2 | As a user, I want to see auto-label status for each document so that I know processing progress | - Status badge: pending, processing, completed, failed
- Progress indicator for processing
- Error message for failed | P0 | | US-2.3 | As a user, I want to see the upload source (API vs UI) so that I can track document origin | - Source column in list
- Filter by source
- Source shown in detail view | P1 | | US-2.4 | As a user, I want to see annotation preview for completed documents so that I can quickly review | - Thumbnail with overlaid bounding boxes
- Annotation count badge
- Click to view full detail | P1 | #### Epic 3: Manual Annotation with Auto-Label Dependency | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-3.1 | As a user, I want to be blocked from manual annotation if auto-label is pending so that I don't lose work | - Clear message: "Auto-labeling in progress, please wait"
- Refresh button to check status
- Automatic unlock when complete | P0 | | US-3.2 | As a user, I want to override auto-generated annotations so that I can correct errors | - Can edit any annotation
- Source changes from "auto" to "manual"
- Original auto value preserved in history
- Override timestamp recorded | P0 | | US-3.3 | As a user, I want to see which annotations are manual vs auto so that I can review confidence | - Color-coded annotation badges
- Manual: solid border
- Auto: dashed border with confidence %
- Filter by source | P0 | | US-3.4 | As a user, I want to accept or reject individual auto-annotations so that I can curate training data | - Accept button marks as verified
- Reject button removes annotation
- Bulk accept/reject actions | P1 | #### Epic 4: Training Page Features | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-4.1 | As a user, I want to see all documents available for training so that I can select training data | - Filtered list (only labeled documents)
- Shows annotation count per document
- Checkbox selection
- Select all/none options | P0 | | US-4.2 | As a user, I want to select specific documents for training so that I can control data quality | - Multi-select with checkboxes
- Selection count displayed
- Clear selection button
- Persisted selection state | P0 | | US-4.3 | As a user, I want to see all trained models so that I can track model history | - Model list with name, date, status
- Document count used
- mAP/accuracy metrics
- Download model link | P0 | | US-4.4 | As a user, I want to see which documents were used in training so that I can track data lineage | - "Used in training" badge on documents
- Click to see model list
- Filter documents by training status | P1 | | US-4.5 | As a user, I want to start a training job with selected documents so that I can create new models | - Start training button
- Training config options
- Progress monitoring
- Email notification on completion | P0 | #### Epic 5: Document Detail View (Enhanced) | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-5.1 | As a user, I want to see all annotations with their source so that I can review data quality | - Annotation list with source column
- Confidence score for auto
- Edit/delete buttons
- Group by field type | P0 | | US-5.2 | As a user, I want to see training history for a document so that I can understand model lineage | - List of models using this document
- Training date and model name
- Link to model detail page | P1 | | US-5.3 | As a user, I want to edit annotations inline so that I can quickly make corrections | - Click to edit bounding box
- Drag to resize
- Double-click to edit text value
- Save/cancel buttons | P0 | | US-5.4 | As a user, I want to see auto vs manual annotation comparison so that I can evaluate auto-label quality | - Side-by-side comparison view
- Highlight differences
- Override history timeline | P2 | #### Epic 6: API Endpoints | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-6.1 | As a developer, I want to upload ZIP/PDF via API so that I can automate document ingestion | - POST endpoint accepts multipart
- Returns document IDs array
- Async processing option
- Webhook callback support | P0 | | US-6.2 | As a developer, I want to upload PDF with auto-label values so that I can pre-annotate documents | - JSON body with field values
- Auto-label runs synchronously or async
- Returns annotation IDs | P0 | | US-6.3 | As a developer, I want to query document status so that I can poll for completion | - GET endpoint with document ID
- Returns full status object
- Includes annotation summary | P0 | | US-6.4 | As a developer, I want API-uploaded documents visible in UI so that I can manage all documents centrally | - Same data model for API/UI uploads
- Source field distinguishes origin
- Full UI functionality available | P0 | #### Epic 7: Dashboard Enhancement | ID | User Story | Acceptance Criteria | Priority | |----|------------|---------------------|----------| | US-7.1 | As a user, I want to see data quality metrics on the dashboard so that I can monitor annotation completeness | - Annotation completeness rate displayed as percentage ring
- Complete/incomplete/pending document counts
- Click incomplete count to jump to filtered document list | P0 | | US-7.2 | As a user, I want to see the active model status on the dashboard so that I can monitor model performance | - Current model version and name displayed
- mAP/precision/recall metrics shown
- Activation date and training document count displayed
- Running training task shown if any | P0 | | US-7.3 | As a user, I want to see recent activity on the dashboard so that I can track system changes | - Last 10 activities displayed with relative timestamps
- Activity types: document upload, annotation change, training complete/failed, model activation
- Each activity shows icon, description, and time | P1 | | US-7.4 | As a user, I want the dashboard stats cards to show meaningful data so that I can quickly assess system state | - Total Documents count
- Annotation Complete count (documents with core fields)
- Incomplete count (labeled but missing core fields)
- Pending count (pending + auto_labeling status) | P0 | **Annotation Completeness Definition:** A document is considered "annotation complete" when it has: - `invoice_number` OR `ocr_number` (at least one identifier) - `bankgiro` OR `plusgiro` (at least one payment account) Documents with status=labeled but missing these core fields are considered "incomplete". --- ## 2. CSV Format Specification ### 2.1 Required Headers ```csv customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro ``` ### 2.2 Column Definitions | Column | Type | Required | Maps to Class | Description | Validation Rules | |--------|------|----------|---------------|-------------|------------------| | `DocumentId` | string | YES | N/A | PDF filename (without .pdf extension) | Non-empty, alphanumeric + underscore/hyphen | | `customer_number` | string | NO | customer_number (9) | Customer reference number | Max 50 chars | | `supplier_name` | string | NO | N/A (metadata only) | Supplier company name | Max 255 chars | | `supplier_organisation_number` | string | NO | supplier_organisation_number (7) | Swedish org number (XXXXXX-XXXX) | Format: 6 digits, hyphen, 4 digits | | `supplier_accounts` | string | NO | N/A (metadata) | Pipe-separated account numbers | Max 500 chars | | `InvoiceNumber` | string | NO | invoice_number (0) | Invoice reference | Max 50 chars | | `InvoiceDate` | date | NO | invoice_date (1) | Invoice issue date | ISO 8601 or YYYY-MM-DD | | `InvoiceDueDate` | date | NO | invoice_due_date (2) | Payment due date | ISO 8601 or YYYY-MM-DD | | `Amount` | decimal | NO | amount (6) | Invoice total amount | Numeric, max 2 decimal places | | `OCR` | string | NO | ocr_number (3) | Swedish OCR payment reference | Numeric string, max 25 chars | | `Message` | string | NO | N/A (metadata only) | Free-text payment message | Max 140 chars | | `Bankgiro` | string | NO | bankgiro (4) | Bankgiro account number | Format: XXX-XXXX or 7-8 digits | | `Plusgiro` | string | NO | plusgiro (5) | Plusgiro account number | Format: XXXXXX-X or 6-8 digits | ### 2.3 Field to Class Mapping ```python CSV_TO_CLASS_MAPPING = { 'InvoiceNumber': 0, # invoice_number 'InvoiceDate': 1, # invoice_date 'InvoiceDueDate': 2, # invoice_due_date 'OCR': 3, # ocr_number 'Bankgiro': 4, # bankgiro 'Plusgiro': 5, # plusgiro 'Amount': 6, # amount 'supplier_organisation_number': 7, # supplier_organisation_number # 8: payment_line (derived from OCR/Bankgiro/Amount) 'customer_number': 9, # customer_number } ``` ### 2.4 Example CSV ```csv customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro C12345,ACME Corp,556677-8899,123-4567|987-6543,INV001,F2024-001,2024-01-15,2024-02-15,1250.00,7350012345678,,123-4567, C12346,Widget AB,112233-4455,,INV002,F2024-002,2024-01-16,2024-02-16,3450.50,,,987-6543, ``` ### 2.5 Validation Rules 1. **DocumentId**: Required, must match a PDF filename in the ZIP 2. **At least one matchable field**: One of InvoiceNumber, OCR, Bankgiro, Plusgiro, Amount, supplier_organisation_number must be non-empty 3. **Date formats**: YYYY-MM-DD, DD/MM/YYYY, DD.MM.YYYY 4. **Amount formats**: 1234.56, 1 234,56, 1234,56 SEK 5. **Swedish org number**: XXXXXX-XXXX pattern --- ## 3. Database Schema Changes ### 3.1 New Tables #### 3.1.1 BatchUpload Table ```sql CREATE TABLE batch_uploads ( batch_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), admin_token VARCHAR(255) NOT NULL REFERENCES admin_tokens(token), filename VARCHAR(255) NOT NULL, file_size INTEGER NOT NULL, upload_source VARCHAR(20) NOT NULL DEFAULT 'ui', -- 'ui' or 'api' status VARCHAR(20) NOT NULL DEFAULT 'processing', -- Status: processing, completed, partial, failed total_files INTEGER DEFAULT 0, processed_files INTEGER DEFAULT 0, successful_files INTEGER DEFAULT 0, failed_files INTEGER DEFAULT 0, error_message TEXT, created_at TIMESTAMP NOT NULL DEFAULT NOW(), completed_at TIMESTAMP ); CREATE INDEX idx_batch_uploads_admin_token ON batch_uploads(admin_token); CREATE INDEX idx_batch_uploads_status ON batch_uploads(status); ``` #### 3.1.2 BatchUploadFile Table ```sql CREATE TABLE batch_upload_files ( file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), batch_id UUID NOT NULL REFERENCES batch_uploads(batch_id) ON DELETE CASCADE, document_id UUID REFERENCES admin_documents(document_id), filename VARCHAR(255) NOT NULL, status VARCHAR(20) NOT NULL DEFAULT 'pending', -- Status: pending, processing, completed, failed, skipped error_message TEXT, csv_row_data JSONB, -- Parsed CSV row for this file created_at TIMESTAMP NOT NULL DEFAULT NOW(), processed_at TIMESTAMP ); CREATE INDEX idx_batch_upload_files_batch_id ON batch_upload_files(batch_id); CREATE INDEX idx_batch_upload_files_document_id ON batch_upload_files(document_id); ``` #### 3.1.3 TrainingDocumentLink Table (Junction Table) ```sql CREATE TABLE training_document_links ( link_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), task_id UUID NOT NULL REFERENCES training_tasks(task_id) ON DELETE CASCADE, document_id UUID NOT NULL REFERENCES admin_documents(document_id) ON DELETE CASCADE, annotation_snapshot JSONB, -- Snapshot of annotations at training time created_at TIMESTAMP NOT NULL DEFAULT NOW(), UNIQUE(task_id, document_id) ); CREATE INDEX idx_training_doc_links_task_id ON training_document_links(task_id); CREATE INDEX idx_training_doc_links_document_id ON training_document_links(document_id); ``` #### 3.1.4 AnnotationHistory Table ```sql CREATE TABLE annotation_history ( history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), annotation_id UUID NOT NULL REFERENCES admin_annotations(annotation_id) ON DELETE CASCADE, action VARCHAR(20) NOT NULL, -- 'created', 'updated', 'deleted', 'override' previous_value JSONB, -- Full annotation state before change new_value JSONB, -- Full annotation state after change changed_by VARCHAR(255), -- admin_token change_reason TEXT, created_at TIMESTAMP NOT NULL DEFAULT NOW() ); CREATE INDEX idx_annotation_history_annotation_id ON annotation_history(annotation_id); CREATE INDEX idx_annotation_history_created_at ON annotation_history(created_at); ``` ### 3.2 Modified Tables #### 3.2.1 AdminDocument Modifications ```sql ALTER TABLE admin_documents ADD COLUMN upload_source VARCHAR(20) DEFAULT 'ui'; -- Values: 'ui', 'api' ALTER TABLE admin_documents ADD COLUMN batch_id UUID REFERENCES batch_uploads(batch_id); ALTER TABLE admin_documents ADD COLUMN csv_field_values JSONB; -- Stores original CSV values for reference ALTER TABLE admin_documents ADD COLUMN auto_label_queued_at TIMESTAMP; -- When auto-label was queued (for dependency checking) ALTER TABLE admin_documents ADD COLUMN annotation_lock_until TIMESTAMP; -- Lock for manual annotation while auto-label runs CREATE INDEX idx_admin_documents_upload_source ON admin_documents(upload_source); CREATE INDEX idx_admin_documents_batch_id ON admin_documents(batch_id); ``` #### 3.2.2 AdminAnnotation Modifications ```sql ALTER TABLE admin_annotations ADD COLUMN is_verified BOOLEAN DEFAULT FALSE; -- User-verified annotation ALTER TABLE admin_annotations ADD COLUMN verified_at TIMESTAMP; ALTER TABLE admin_annotations ADD COLUMN verified_by VARCHAR(255); ALTER TABLE admin_annotations ADD COLUMN override_source VARCHAR(20); -- If this annotation overrides another: 'auto' or 'imported' ALTER TABLE admin_annotations ADD COLUMN original_annotation_id UUID; -- Reference to the annotation this overrides CREATE INDEX idx_admin_annotations_source ON admin_annotations(source); CREATE INDEX idx_admin_annotations_is_verified ON admin_annotations(is_verified); ``` #### 3.2.3 TrainingTask Modifications ```sql ALTER TABLE training_tasks ADD COLUMN document_count INTEGER DEFAULT 0; -- Count of documents used in training ALTER TABLE training_tasks ADD COLUMN document_ids UUID[]; -- Array of document IDs used (for quick reference) ALTER TABLE training_tasks ADD COLUMN metrics_mAP FLOAT; ALTER TABLE training_tasks ADD COLUMN metrics_precision FLOAT; ALTER TABLE training_tasks ADD COLUMN metrics_recall FLOAT; -- Extracted metrics for easy querying CREATE INDEX idx_training_tasks_metrics ON training_tasks(metrics_mAP); ``` ### 3.3 SQLModel Definitions ```python # File: src/data/admin_models.py from datetime import datetime from typing import Any from uuid import UUID, uuid4 from sqlmodel import Field, SQLModel, Column, JSON, ARRAY from sqlalchemy import String class BatchUpload(SQLModel, table=True): """Batch upload record for ZIP uploads.""" __tablename__ = "batch_uploads" batch_id: UUID = Field(default_factory=uuid4, primary_key=True) admin_token: str = Field(foreign_key="admin_tokens.token", max_length=255, index=True) filename: str = Field(max_length=255) file_size: int upload_source: str = Field(default="ui", max_length=20) status: str = Field(default="processing", max_length=20, index=True) total_files: int = Field(default=0) processed_files: int = Field(default=0) successful_files: int = Field(default=0) failed_files: int = Field(default=0) error_message: str | None = Field(default=None) created_at: datetime = Field(default_factory=datetime.utcnow) completed_at: datetime | None = Field(default=None) class BatchUploadFile(SQLModel, table=True): """Individual file within a batch upload.""" __tablename__ = "batch_upload_files" file_id: UUID = Field(default_factory=uuid4, primary_key=True) batch_id: UUID = Field(foreign_key="batch_uploads.batch_id", index=True) document_id: UUID | None = Field(default=None, foreign_key="admin_documents.document_id") filename: str = Field(max_length=255) status: str = Field(default="pending", max_length=20) error_message: str | None = Field(default=None) csv_row_data: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON)) created_at: datetime = Field(default_factory=datetime.utcnow) processed_at: datetime | None = Field(default=None) class TrainingDocumentLink(SQLModel, table=True): """Link between training tasks and documents used.""" __tablename__ = "training_document_links" link_id: UUID = Field(default_factory=uuid4, primary_key=True) task_id: UUID = Field(foreign_key="training_tasks.task_id", index=True) document_id: UUID = Field(foreign_key="admin_documents.document_id", index=True) annotation_snapshot: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON)) created_at: datetime = Field(default_factory=datetime.utcnow) class AnnotationHistory(SQLModel, table=True): """History of annotation changes.""" __tablename__ = "annotation_history" history_id: UUID = Field(default_factory=uuid4, primary_key=True) annotation_id: UUID = Field(foreign_key="admin_annotations.annotation_id", index=True) action: str = Field(max_length=20) previous_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON)) new_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON)) changed_by: str | None = Field(default=None, max_length=255) change_reason: str | None = Field(default=None) created_at: datetime = Field(default_factory=datetime.utcnow, index=True) ``` --- ## 4. API Specification ### 4.1 New Endpoints #### 4.1.1 Batch Upload (ZIP) ```yaml POST /api/v1/admin/batch/upload Content-Type: multipart/form-data Request: file: binary (ZIP file) async: boolean (default: true) auto_label: boolean (default: true) Response (202 Accepted): { "batch_id": "uuid", "status": "processing", "total_files": 25, "message": "Batch upload started. Use batch_id to check progress.", "status_url": "/api/v1/admin/batch/{batch_id}" } Response (200 OK - sync mode): { "batch_id": "uuid", "status": "completed", "total_files": 25, "successful_files": 23, "failed_files": 2, "documents": [ { "document_id": "uuid", "filename": "INV001.pdf", "status": "completed", "auto_label_status": "completed", "annotations_created": 8 } ], "errors": [ { "filename": "invalid.pdf", "error": "Corrupted PDF file" } ] } ``` #### 4.1.2 Batch Status ```yaml GET /api/v1/admin/batch/{batch_id} Response: { "batch_id": "uuid", "status": "processing", "progress": { "total": 25, "processed": 15, "successful": 14, "failed": 1, "percentage": 60 }, "files": [ { "file_id": "uuid", "filename": "INV001.pdf", "document_id": "uuid", "status": "completed" } ], "created_at": "2024-01-15T10:00:00Z", "estimated_completion": "2024-01-15T10:05:00Z" } ``` #### 4.1.3 Upload PDF with Auto-Label Values ```yaml POST /api/v1/admin/documents/upload-with-labels Content-Type: multipart/form-data Request: file: binary (PDF file) field_values: JSON string { "InvoiceNumber": "F2024-001", "InvoiceDate": "2024-01-15", "Amount": "1250.00", "OCR": "7350012345678", "Bankgiro": "123-4567" } auto_label: boolean (default: true) wait_for_completion: boolean (default: false) Response (202 Accepted): { "document_id": "uuid", "filename": "invoice.pdf", "status": "auto_labeling", "auto_label_status": "running", "message": "Document uploaded. Auto-labeling in progress." } Response (200 OK - wait_for_completion=true): { "document_id": "uuid", "filename": "invoice.pdf", "status": "labeled", "auto_label_status": "completed", "annotations": [ { "annotation_id": "uuid", "class_id": 0, "class_name": "invoice_number", "text_value": "F2024-001", "confidence": 0.95, "bbox": { "x": 100, "y": 200, "width": 150, "height": 30 } } ] } ``` #### 4.1.4 Query Document Status ```yaml GET /api/v1/admin/documents/{document_id}/status Response: { "document_id": "uuid", "filename": "invoice.pdf", "status": "labeled", "auto_label_status": "completed", "upload_source": "api", "annotation_summary": { "total": 8, "manual": 2, "auto": 6, "verified": 3 }, "can_annotate": true, "annotation_lock_reason": null, "training_history": [ { "task_id": "uuid", "task_name": "Training Run 2024-01", "trained_at": "2024-01-20T15:00:00Z" } ] } ``` #### 4.1.5 Training with Document Selection ```yaml POST /api/v1/admin/training/tasks Content-Type: application/json Request: { "name": "Training Run 2024-01", "description": "First training run with 500 documents", "document_ids": ["uuid1", "uuid2", "uuid3"], "config": { "model_name": "yolo11n.pt", "epochs": 100, "batch_size": 16, "image_size": 640 }, "scheduled_at": "2024-01-20T22:00:00Z" } Response: { "task_id": "uuid", "name": "Training Run 2024-01", "status": "scheduled", "document_count": 500, "message": "Training task scheduled for 2024-01-20T22:00:00Z" } ``` #### 4.1.6 Get Documents for Training ```yaml GET /api/v1/admin/training/documents Query Parameters: - status: labeled (required) - has_annotations: true - min_annotation_count: 3 - exclude_used_in_training: boolean - limit: 100 - offset: 0 Response: { "total": 1500, "documents": [ { "document_id": "uuid", "filename": "INV001.pdf", "annotation_count": 8, "annotation_sources": { "manual": 3, "auto": 5 }, "used_in_training": ["task_id_1", "task_id_2"], "last_modified": "2024-01-15T10:00:00Z" } ] } ``` #### 4.1.7 Get Model List ```yaml GET /api/v1/admin/training/models Query Parameters: - status: completed - limit: 20 - offset: 0 Response: { "total": 15, "models": [ { "task_id": "uuid", "name": "Training Run 2024-01", "status": "completed", "document_count": 500, "created_at": "2024-01-20T15:00:00Z", "completed_at": "2024-01-20T18:30:00Z", "metrics": { "mAP": 0.935, "precision": 0.92, "recall": 0.88 }, "model_path": "runs/train/invoice_fields_20240120/weights/best.pt", "download_url": "/api/v1/admin/training/models/{task_id}/download" } ] } ``` #### 4.1.8 Override Annotation ```yaml PATCH /api/v1/admin/documents/{document_id}/annotations/{annotation_id}/override Content-Type: application/json Request: { "bbox": { "x": 110, "y": 205, "width": 145, "height": 28 }, "text_value": "F2024-001-A", "reason": "Corrected OCR error" } Response: { "annotation_id": "uuid", "source": "manual", "override_source": "auto", "original_annotation_id": "uuid", "message": "Annotation overridden successfully", "history_id": "uuid" } ``` ### 4.2 Modified Endpoints #### 4.2.1 Document List (Enhanced) ```yaml GET /api/v1/admin/documents Query Parameters (additions): - upload_source: 'ui' | 'api' | null - has_annotations: boolean - auto_label_status: 'pending' | 'running' | 'completed' | 'failed' - used_in_training: boolean - batch_id: uuid Response (additions to DocumentItem): { "documents": [ { // ... existing fields ... "upload_source": "api", "batch_id": "uuid", "can_annotate": true, "training_count": 2 } ] } ``` #### 4.2.2 Document Detail (Enhanced) ```yaml GET /api/v1/admin/documents/{document_id} Response (additions): { // ... existing fields ... "upload_source": "api", "csv_field_values": { "InvoiceNumber": "F2024-001", "Amount": "1250.00" }, "can_annotate": true, "annotation_lock_reason": null, "annotations": [ { // ... existing fields ... "is_verified": true, "verified_at": "2024-01-16T09:00:00Z", "override_source": null } ], "training_history": [ { "task_id": "uuid", "name": "Training Run 2024-01", "trained_at": "2024-01-20T15:00:00Z", "model_metrics": { "mAP": 0.935 } } ] } ``` ### 4.3 Dashboard API Endpoints #### 4.3.1 Dashboard Statistics ```yaml GET /api/v1/admin/dashboard/stats Response: { "total_documents": 38, "annotation_complete": 25, "annotation_incomplete": 8, "pending": 5, "completeness_rate": 75.76 } ``` **Completeness Calculation Logic:** - `annotation_complete`: Documents where status=labeled AND has (invoice_number OR ocr_number) AND has (bankgiro OR plusgiro) - `annotation_incomplete`: Documents where status=labeled BUT missing core fields - `pending`: Documents where status IN (pending, auto_labeling) - `completeness_rate`: annotation_complete / (annotation_complete + annotation_incomplete) * 100 #### 4.3.2 Active Model Info ```yaml GET /api/v1/admin/dashboard/active-model Response: { "model": { "version_id": "uuid", "version": "1.2.0", "name": "Invoice Model", "metrics_mAP": 0.951, "metrics_precision": 0.94, "metrics_recall": 0.92, "document_count": 500, "activated_at": "2024-01-20T15:00:00Z" }, "running_training": { "task_id": "uuid", "name": "Run-2024-02", "status": "running", "started_at": "2024-01-25T10:00:00Z", "progress": 45 } } Response (no active model): { "model": null, "running_training": null } ``` #### 4.3.3 Recent Activity ```yaml GET /api/v1/admin/dashboard/activity Query Parameters: - limit: 10 (default) Response: { "activities": [ { "type": "model_activated", "description": "Activated model v1.2.0", "timestamp": "2024-01-25T12:00:00Z", "metadata": { "version_id": "uuid", "version": "1.2.0" } }, { "type": "training_completed", "description": "Training complete: Run-2024-01, mAP 95.1%", "timestamp": "2024-01-24T18:30:00Z", "metadata": { "task_id": "uuid", "task_name": "Run-2024-01", "mAP": 0.951 } }, { "type": "annotation_modified", "description": "Modified INV-001.pdf invoice_number", "timestamp": "2024-01-24T14:20:00Z", "metadata": { "document_id": "uuid", "filename": "INV-001.pdf", "field_name": "invoice_number" } }, { "type": "document_uploaded", "description": "Uploaded INV-005.pdf", "timestamp": "2024-01-23T09:15:00Z", "metadata": { "document_id": "uuid", "filename": "INV-005.pdf" } }, { "type": "training_failed", "description": "Training failed: Run-2024-00", "timestamp": "2024-01-22T16:45:00Z", "metadata": { "task_id": "uuid", "task_name": "Run-2024-00", "error": "GPU memory exceeded" } } ] } ``` **Activity Types:** | Type | Description Template | Source | |------|---------------------|--------| | `document_uploaded` | "Uploaded {filename}" | `admin_documents.created_at` | | `annotation_modified` | "Modified {filename} {field_name}" | `annotation_history` | | `training_completed` | "Training complete: {task_name}, mAP {mAP}%" | `training_tasks` (status=completed) | | `training_failed` | "Training failed: {task_name}" | `training_tasks` (status=failed) | | `model_activated` | "Activated model {version}" | `model_versions.activated_at` | --- ## 5. UI Wireframes (Text-Based) ### 5.0 Dashboard Overview ``` +------------------------------------------------------------------+ | DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]| +------------------------------------------------------------------+ | [Dashboard] [Documents] [Training] [Models] [Settings] | +------------------------------------------------------------------+ | | | DASHBOARD | | | | +-------------+ +-------------+ +-------------+ +-------------+ | | | Total | | Complete | | Incomplete | | Pending | | | | Documents | | Annotations | | | | | | | | 38 | | 25 | | 8 | | 5 | | | +-------------+ +-------------+ +-------------+ +-------------+ | | [View List] | | | | +---------------------------+ +-------------------------------+ | | | DATA QUALITY | | ACTIVE MODEL | | | | +-----------+ | | | | | | | | | | v1.2.0 - Invoice Model | | | | | 78% | Annotation | | ----------------------------- | | | | | | Complete | | | | | | +-----------+ | | mAP Precision Recall | | | | | | 95.1% 94% 92% | | | | Complete: 25 | | | | | | Incomplete: 8 | | Activated: 2024-01-20 | | | | Pending: 5 | | Documents: 500 | | | | | | | | | | [View Incomplete Docs] | | Training: Run-2024-02 [====] | | | +---------------------------+ +-------------------------------+ | | | | +--------------------------------------------------------------+ | | | RECENT ACTIVITY | | | +--------------------------------------------------------------+ | | | [rocket] Activated model v1.2.0 2 hours ago| | | | [check] Training complete: Run-2024-01, mAP 95.1% yesterday| | | | [edit] Modified INV-001.pdf invoice_number yesterday| | | | [doc] Uploaded INV-005.pdf 2 days ago| | | | [doc] Uploaded INV-004.pdf 2 days ago| | | | [x] Training failed: Run-2024-00 3 days ago| | | +--------------------------------------------------------------+ | | | | +--------------------------------------------------------------+ | | | SYSTEM STATUS | | | | Backend API: Online Database: Connected GPU: Available | | | +--------------------------------------------------------------+ | +------------------------------------------------------------------+ ``` **Dashboard Components:** | Component | Data Source | Update Frequency | |-----------|-------------|------------------| | Total Documents | `admin_documents` count | Real-time | | Complete Annotations | Documents with (invoice_number OR ocr_number) AND (bankgiro OR plusgiro) | Real-time | | Incomplete | Labeled documents missing core fields | Real-time | | Pending | Documents with status pending or auto_labeling | Real-time | | Data Quality Ring | Complete / (Complete + Incomplete) * 100% | Real-time | | Active Model | `model_versions` where is_active=true | On model activation | | Recent Activity | Aggregated from multiple tables (see below) | Real-time | **Recent Activity Sources:** | Activity Type | Icon | Source Table | Query | |--------------|------|--------------|-------| | Document Upload | doc | `admin_documents` | `created_at DESC` | | Annotation Change | edit | `annotation_history` | `created_at DESC` | | Training Complete | check | `training_tasks` | `status=completed, completed_at DESC` | | Training Failed | x | `training_tasks` | `status=failed, completed_at DESC` | | Model Activated | rocket | `model_versions` | `activated_at DESC` | ### 5.1 Document List View ``` +------------------------------------------------------------------+ | DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]| +------------------------------------------------------------------+ | [Documents] [Training] [Models] [Settings] | +------------------------------------------------------------------+ | | | DOCUMENTS | | +-----------------+ +-----------------------------------------+ | | | UPLOAD | | FILTERS | | | | [Single PDF] | | Status: [All v] Source: [All v] | | | | [ZIP Batch] | | Auto-Label: [All v] Search: [________] | | | +-----------------+ +-----------------------------------------+ | | | | +--------------------------------------------------------------+ | | | [] Filename | Status | Auto-Label | Source | Date | | | +--------------------------------------------------------------+ | | | [] INV001.pdf | Labeled | Completed | API | 01/15 | | | | [8 annotations] | [Preview] | [95%] | | | | | +--------------------------------------------------------------+ | | | [] INV002.pdf | Pending | Running | UI | 01/16 | | | | [0 annotations] | [Locked] | [==== ] | | | | | +--------------------------------------------------------------+ | | | [] INV003.pdf | Labeled | Failed | API | 01/16 | | | | [5 annotations] | [Preview] | [Retry] | | | | | +--------------------------------------------------------------+ | | | [] INV004.pdf | Labeled | Completed | UI | 01/17 | | | | [10 annotations]| [Preview] | [98%] | [Used] | | | | +--------------------------------------------------------------+ | | | | Showing 1-20 of 1,543 documents [<] [1] [2] [3] ... [78] [>] | | | | [Delete Selected] [Start Training with Selected] | +------------------------------------------------------------------+ ``` ### 5.2 Document Detail View ``` +------------------------------------------------------------------+ | < Back to Documents INV001.pdf | +------------------------------------------------------------------+ | | | +---------------------------+ +-------------------------------+ | | | | | DOCUMENT INFO | | | | | | Status: Labeled | | | | [Page 1 Image with | | Source: API Upload | | | | Annotation Overlays] | | Auto-Label: Completed (95%) | | | | | | Pages: 1 | | | | [Manual: Solid border] | | Uploaded: 2024-01-15 | | | | [Auto: Dashed border] | | | | | | | | TRAINING HISTORY | | | | | | - Run 2024-01 (mAP: 93.5%) | | | | | | - Run 2024-02 (mAP: 95.1%) | | | | | | | | | +---------------------------+ +-------------------------------+ | | | | ANNOTATIONS [Add Annotation] [Run OCR] | | +--------------------------------------------------------------+ | | | Field | Value | Source | Conf | Actions | | | +--------------------------------------------------------------+ | | | invoice_number | F2024-001 | Manual | - | [E] [D] | | | +--------------------------------------------------------------+ | | | invoice_date | 2024-01-15 | Auto | 95% | [V] [E][D]| | | +--------------------------------------------------------------+ | | | amount | 1,250.00 | Auto | 98% | [V] [E][D]| | | +--------------------------------------------------------------+ | | | ocr_number | 7350012345 | Auto | 87% | [V] [E][D]| | | +--------------------------------------------------------------+ | | | bankgiro | 123-4567 | Manual | - | [E] [D] | | | +--------------------------------------------------------------+ | | | | [V] = Verify [E] = Edit [D] = Delete | | | | CSV FIELD VALUES (Reference) | | +--------------------------------------------------------------+ | | | InvoiceNumber: F2024-001 | InvoiceDate: 2024-01-15 | | | | Amount: 1250.00 | OCR: 7350012345678 | | | | Bankgiro: 123-4567 | | | | +--------------------------------------------------------------+ | +------------------------------------------------------------------+ ``` ### 5.3 Training Page ``` +------------------------------------------------------------------+ | DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]| +------------------------------------------------------------------+ | [Documents] [Training] [Models] [Settings] | +------------------------------------------------------------------+ | | | TRAINING | | | | DOCUMENT SELECTION Selected: 500 docs | | +--------------------------------------------------------------+ | | | [] Filename | Annotations | Source | Last Modified | | | +--------------------------------------------------------------+ | | | [x] INV001.pdf | 8 (M:3 A:5) | API | 2024-01-15 | | | +--------------------------------------------------------------+ | | | [x] INV002.pdf | 10 (M:2 A:8)| UI | 2024-01-16 | | | +--------------------------------------------------------------+ | | | [ ] INV003.pdf | 5 (M:5 A:0) | UI | 2024-01-16 | | | +--------------------------------------------------------------+ | | | [x] INV004.pdf | 12 (M:4 A:8)| API | 2024-01-17 | | | +--------------------------------------------------------------+ | | | | [Select All] [Select None] [Select Not Used in Training] | | | | Showing labeled documents only [<] [1] [2] [3] ... [50] [>] | | | | TRAINING CONFIGURATION | | +--------------------------------------------------------------+ | | | Name: [Training Run 2024-01____________] | | | | Description: [First training with 500 documents_________] | | | | | | | | Base Model: [yolo11n.pt v] Epochs: [100] Batch: [16] | | | | Image Size: [640] Device: [GPU 0 v] | | | | | | | | [ ] Schedule for later: [2024-01-20] [22:00] | | | +--------------------------------------------------------------+ | | | | [Start Training] | +------------------------------------------------------------------+ ``` ### 5.4 Model History View ``` +------------------------------------------------------------------+ | DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]| +------------------------------------------------------------------+ | [Documents] [Training] [Models] [Settings] | +------------------------------------------------------------------+ | | | TRAINED MODELS | | | | +--------------------------------------------------------------+ | | | Name | Status | Docs | mAP | Date | | | +--------------------------------------------------------------+ | | | Training Run 2024-03 | Running | 750 | - | 01/25 | | | | | [==== ] | | | | | | | | [View Logs] [Cancel] | | | +--------------------------------------------------------------+ | | | Training Run 2024-02 | Completed | 600 | 95.1% | 01/20 | | | | | P: 94% R: 92% | | | | | [View] [Download] [Use as Base] | | | +--------------------------------------------------------------+ | | | Training Run 2024-01 | Completed | 500 | 93.5% | 01/15 | | | | | P: 92% R: 88% | | | | | [View] [Download] [Use as Base] | | | +--------------------------------------------------------------+ | | | Initial Training | Completed | 200 | 85.2% | 01/10 | | | | | P: 84% R: 80% | | | | | [View] [Download] [Use as Base] | | | +--------------------------------------------------------------+ | | | | MODEL DETAIL: Training Run 2024-02 | | +--------------------------------------------------------------+ | | | Created: 2024-01-20 15:00 | Completed: 2024-01-20 18:30 | | | | Duration: 3h 30m | Documents: 600 | | | | | | | | Metrics: | | | | - mAP@0.5: 95.1% | | | | - Precision: 94% | | | | - Recall: 92% | | | | | | | | Configuration: | | | | - Base: yolo11n.pt Epochs: 100 Batch: 16 Size: 640 | | | | | | | | Documents Used: [View 600 documents] | | | +--------------------------------------------------------------+ | +------------------------------------------------------------------+ ``` ### 5.5 Batch Upload Modal ``` +------------------------------------------------------------------+ | BATCH UPLOAD [X] | +------------------------------------------------------------------+ | | | Upload a ZIP file containing: | | - Multiple PDF files | | - (Optional) CSV file for auto-labeling | | | | +--------------------------------------------------------------+ | | | | | | | [Drag and drop ZIP file here] | | | | or | | | | [Browse Files] | | | | | | | +--------------------------------------------------------------+ | | | | [x] Auto-label documents (requires CSV) | | [ ] Process asynchronously | | | | CSV FORMAT REQUIREMENTS: | | Required columns: DocumentId | | Optional: InvoiceNumber, InvoiceDate, Amount, OCR, Bankgiro... | | [View full CSV specification] | | | | [Cancel] [Upload] | +------------------------------------------------------------------+ +------------------------------------------------------------------+ | UPLOAD PROGRESS [X] | +------------------------------------------------------------------+ | | | Processing batch upload... | | | | [======================================== ] 80% | | | | Files: 20 / 25 | | Successful: 18 | | Failed: 2 | | | | +--------------------------------------------------------------+ | | | [OK] INV001.pdf - Completed (8 annotations) | | | | [OK] INV002.pdf - Completed (10 annotations) | | | | [!!] INV003.pdf - Failed: Corrupted PDF | | | | [OK] INV004.pdf - Completed (6 annotations) | | | | [...] Processing INV005.pdf... | | | +--------------------------------------------------------------+ | | | | [Cancel] [Close] | +------------------------------------------------------------------+ ``` --- ## 6. Implementation Phases ### Phase 1: Database and Core Models (Week 1) | Step | Task | Files | Risk | |------|------|-------|------| | 1.1 | Create database migration script | `src/data/migrations/` | Low | | 1.2 | Add new SQLModel classes | `src/data/admin_models.py` | Low | | 1.3 | Update AdminDB with new methods | `src/data/admin_db.py` | Medium | | 1.4 | Add unit tests for new models | `tests/data/test_admin_models.py` | Low | **Dependencies**: None **Risk Assessment**: Low - mostly additive changes to existing structure ### Phase 2: Batch Upload Backend (Week 2) | Step | Task | Files | Risk | |------|------|-------|------| | 2.1 | Create ZIP extraction service | `src/web/batch_upload_service.py` | Medium | | 2.2 | Add CSV parsing with new format | `src/data/csv_loader.py` | Low | | 2.3 | Create batch upload routes | `src/web/admin_batch_routes.py` | Medium | | 2.4 | Add async processing queue | `src/web/batch_queue.py` | High | | 2.5 | Integration tests | `tests/web/test_batch_upload.py` | Medium | **Dependencies**: Phase 1 **Risk Assessment**: Medium - ZIP handling and async processing add complexity ### Phase 3: Enhanced Document Management (Week 3) | Step | Task | Files | Risk | |------|------|-------|------| | 3.1 | Add upload source tracking | `src/data/admin_models.py` | Low | | 3.2 | Update document list endpoint | `src/web/admin_routes.py` | Low | | 3.3 | Add annotation lock mechanism | `src/web/admin_annotation_routes.py` | Medium | | 3.4 | Add document status endpoint | `src/web/admin_routes.py` | Low | | 3.5 | Update auto-label service | `src/web/admin_autolabel.py` | Medium | **Dependencies**: Phase 1, Phase 2 **Risk Assessment**: Medium - locking mechanism needs careful implementation ### Phase 4: Manual Annotation Enhancement (Week 4) | Step | Task | Files | Risk | |------|------|-------|------| | 4.1 | Add override mechanism | `src/web/admin_annotation_routes.py` | Medium | | 4.2 | Add annotation history | `src/data/admin_db.py` | Low | | 4.3 | Add verification endpoint | `src/web/admin_annotation_routes.py` | Low | | 4.4 | Update schemas with new fields | `src/web/admin_schemas.py` | Low | **Dependencies**: Phase 3 **Risk Assessment**: Low - extending existing annotation system ### Phase 5: Training Integration (Week 5) | Step | Task | Files | Risk | |------|------|-------|------| | 5.1 | Add document selection for training | `src/web/admin_training_routes.py` | Medium | | 5.2 | Add training document link table | `src/data/admin_db.py` | Low | | 5.3 | Add model list endpoint | `src/web/admin_training_routes.py` | Low | | 5.4 | Update export with selection | `src/web/admin_training_routes.py` | Medium | | 5.5 | Add metrics extraction | `src/cli/train.py` | Medium | **Dependencies**: Phase 1, Phase 4 **Risk Assessment**: Medium - integration with training pipeline ### Phase 6: Frontend Implementation (Weeks 6-7) | Step | Task | Files | Risk | |------|------|-------|------| | 6.1 | Create React component structure | `frontend/` | High | | 6.2 | Implement document list view | `frontend/src/components/` | Medium | | 6.3 | Implement document detail view | `frontend/src/components/` | High | | 6.4 | Implement training page | `frontend/src/components/` | Medium | | 6.5 | Implement batch upload modal | `frontend/src/components/` | Medium | | 6.6 | Add annotation editor | `frontend/src/components/` | High | **Dependencies**: Phase 2-5 **Risk Assessment**: High - frontend development is a new component ### Phase 7: Testing and Documentation (Week 8) | Step | Task | Files | Risk | |------|------|-------|------| | 7.1 | Integration tests | `tests/integration/` | Medium | | 7.2 | E2E tests | `tests/e2e/` | High | | 7.3 | API documentation | `docs/api/` | Low | | 7.4 | User guide | `docs/user-guide/` | Low | | 7.5 | Performance testing | `tests/performance/` | Medium | **Dependencies**: All phases **Risk Assessment**: Medium ### Risk Mitigation Strategies | Risk | Impact | Probability | Mitigation | |------|--------|-------------|------------| | ZIP bomb attack | High | Low | Limit max file count, max total size, scan before extraction | | Async queue failures | Medium | Medium | Implement retry logic, dead letter queue, manual retry endpoint | | Annotation lock deadlock | Medium | Low | Timeout-based locks, admin override capability | | Large batch performance | Medium | High | Chunked processing, progress tracking, background workers | | Database migration issues | High | Low | Backward compatible changes, rollback scripts | | Frontend complexity | Medium | Medium | Use established UI framework, incremental delivery | --- ## 7. State Machine Diagrams ### 7.1 Document Lifecycle States ``` +-------------+ | DELETED | +------^------+ | | delete | +----------+ upload +----------+ | | | --------------> | |--+ | (none) | | PENDING | | | | | +----------+ +----+-----+ | +----------------+-----------------+ | | | trigger auto-label | create manual annotation v | +-------------+ | | | | | AUTO_LABEL- | | | ING | | | | | +------+------+ | | | +---------+---------+ | | | | | complete | fail | v v | +-------------+ +-------------+ | | | | | | | LABELED |<----+ PENDING +<--------------+ | | retry| (failed) | +------+------+ +-------------+ | | export v +-------------+ | | | EXPORTED | | | +-------------+ ``` ### 7.2 Auto-Label Workflow States ``` +-------------+ | MANUAL | | OVERRIDE | +------^------+ | | user edit | +----------+ queue +----------+ | +-----------+ | | --------------> | | | | | | (none) | | QUEUED |--+--->| COMPLETED | | | | | | | +----------+ +----+-----+ +-----^-----+ | | | start | v | +-------------+ | | | | | RUNNING +-----------+ | | success +------+------+ | | error v +-------------+ | | | FAILED | | | +------+------+ | | retry v +-------------+ | | | QUEUED | | | +-------------+ ``` ### 7.3 Batch Upload States ``` +----------+ upload +-------------+ | | --------------> | | | (none) | | PROCESSING | | | | | +----------+ +------+------+ | +---------------+---------------+ | | | | all success | some fail | all fail v v v +-------------+ +-------------+ +-------------+ | | | | | | | COMPLETED | | PARTIAL | | FAILED | | | | | | | +-------------+ +-------------+ +-------------+ ``` ### 7.4 Training Task States ``` +----------+ create +-------------+ | | --------------> | | | (none) | | PENDING | | | | | +----------+ +------+------+ | +-------------+-------------+ | | | immediate | scheduled v v +-------------+ +-------------+ | | | | | RUNNING |<------------+ SCHEDULED | | | trigger | | +------+------+ +------+------+ | | +---------+---------+ | cancel | | v | success | error +-------------+ v v | | +-------------+ +-------------+ | CANCELLED | | | | | | | | COMPLETED | | FAILED | +-------------+ | | | | +-------------+ +------+------+ | | retry v +-------------+ | | | PENDING | | | +-------------+ ``` ### 7.5 Annotation Lock States ``` +-------------+ | LOCKED | | (auto-label | | running) | +------^------+ | | auto-label starts | +----------+ upload +----------+ | | | --------------> | |--+ | (none) | | UNLOCKED |<---------+ | | | | | +----------+ +----+-----+ | | | | auto-label | auto-label | starts | completes/fails | | v | +-------------+ | | | | | LOCKED +---------+ | (timeout: | | 5 minutes) | +-------------+ ``` --- ## Summary This comprehensive plan provides: 1. **PRD**: 24 user stories across 6 epics with clear acceptance criteria and priorities 2. **CSV Specification**: 13 columns with detailed validation rules and field mappings 3. **Database Schema**: 4 new tables + modifications to 3 existing tables with full SQLModel definitions 4. **API Specification**: 8 new endpoints + 2 modified endpoints with complete request/response schemas 5. **UI Wireframes**: 5 detailed text-based wireframes covering all major views 6. **Implementation Phases**: 7 phases over 8 weeks with 30+ tasks, dependencies, and risk assessments 7. **State Machines**: 5 state diagrams covering document, auto-label, batch, training, and locking workflows The implementation follows an incremental approach starting with database/backend changes before frontend development, minimizing risk and enabling continuous testing throughout the development cycle.