# Document Annotation Tool - Product Plan v2
## Table of Contents
1. [Product Requirements Document (PRD)](#1-product-requirements-document-prd)
2. [CSV Format Specification](#2-csv-format-specification)
3. [Database Schema Changes](#3-database-schema-changes)
4. [API Specification](#4-api-specification)
5. [UI Wireframes (Text-Based)](#5-ui-wireframes-text-based)
6. [Implementation Phases](#6-implementation-phases)
7. [State Machine Diagrams](#7-state-machine-diagrams)
---
## 1. Product Requirements Document (PRD)
### 1.1 Overview
This enhancement adds batch upload capabilities, document lifecycle management, manual annotation workflow with auto-label dependency, comprehensive training management, and enhanced document detail views to the Invoice Master Document Annotation Tool.
### 1.2 User Stories
#### Epic 1: Batch Upload (ZIP Support)
| ID | User Story | Acceptance Criteria | Priority |
|----|------------|---------------------|----------|
| US-1.1 | As a user, I want to upload a ZIP file containing multiple PDFs so that I can process many documents at once | - ZIP file is extracted
- Each PDF is registered as a separate document
- Document IDs are returned for all files
- Invalid files are skipped with error message | P0 |
| US-1.2 | As a user, I want to include a CSV file in my ZIP for auto-labeling so that annotations are created automatically | - CSV is parsed and validated
- DocumentId column maps to PDF filenames
- Field values are stored for auto-labeling
- Invalid CSV rows are logged | P0 |
| US-1.3 | As a user, I want to upload a single PDF with auto-label values via API so that I can integrate with my workflow | - PDF is uploaded
- Auto-label values provided in JSON body
- Auto-labeling runs automatically
- Document ID returned | P0 |
| US-1.4 | As a user, I want clear feedback on batch upload progress so that I know which files succeeded or failed | - Upload progress indicator
- Per-file status (success/failed)
- Error messages for failed files
- Summary count displayed | P1 |
#### Epic 2: Document List and Status
| ID | User Story | Acceptance Criteria | Priority |
|----|------------|---------------------|----------|
| US-2.1 | As a user, I want to see a list of all uploaded documents so that I can manage my annotations | - Paginated document list
- Shows filename, status, date
- Sortable columns
- Search/filter capability | P0 |
| US-2.2 | As a user, I want to see auto-label status for each document so that I know processing progress | - Status badge: pending, processing, completed, failed
- Progress indicator for processing
- Error message for failed | P0 |
| US-2.3 | As a user, I want to see the upload source (API vs UI) so that I can track document origin | - Source column in list
- Filter by source
- Source shown in detail view | P1 |
| US-2.4 | As a user, I want to see annotation preview for completed documents so that I can quickly review | - Thumbnail with overlaid bounding boxes
- Annotation count badge
- Click to view full detail | P1 |
#### Epic 3: Manual Annotation with Auto-Label Dependency
| ID | User Story | Acceptance Criteria | Priority |
|----|------------|---------------------|----------|
| US-3.1 | As a user, I want to be blocked from manual annotation if auto-label is pending so that I don't lose work | - Clear message: "Auto-labeling in progress, please wait"
- Refresh button to check status
- Automatic unlock when complete | P0 |
| US-3.2 | As a user, I want to override auto-generated annotations so that I can correct errors | - Can edit any annotation
- Source changes from "auto" to "manual"
- Original auto value preserved in history
- Override timestamp recorded | P0 |
| US-3.3 | As a user, I want to see which annotations are manual vs auto so that I can review confidence | - Color-coded annotation badges
- Manual: solid border
- Auto: dashed border with confidence %
- Filter by source | P0 |
| US-3.4 | As a user, I want to accept or reject individual auto-annotations so that I can curate training data | - Accept button marks as verified
- Reject button removes annotation
- Bulk accept/reject actions | P1 |
#### Epic 4: Training Page Features
| ID | User Story | Acceptance Criteria | Priority |
|----|------------|---------------------|----------|
| US-4.1 | As a user, I want to see all documents available for training so that I can select training data | - Filtered list (only labeled documents)
- Shows annotation count per document
- Checkbox selection
- Select all/none options | P0 |
| US-4.2 | As a user, I want to select specific documents for training so that I can control data quality | - Multi-select with checkboxes
- Selection count displayed
- Clear selection button
- Persisted selection state | P0 |
| US-4.3 | As a user, I want to see all trained models so that I can track model history | - Model list with name, date, status
- Document count used
- mAP/accuracy metrics
- Download model link | P0 |
| US-4.4 | As a user, I want to see which documents were used in training so that I can track data lineage | - "Used in training" badge on documents
- Click to see model list
- Filter documents by training status | P1 |
| US-4.5 | As a user, I want to start a training job with selected documents so that I can create new models | - Start training button
- Training config options
- Progress monitoring
- Email notification on completion | P0 |
#### Epic 5: Document Detail View (Enhanced)
| ID | User Story | Acceptance Criteria | Priority |
|----|------------|---------------------|----------|
| US-5.1 | As a user, I want to see all annotations with their source so that I can review data quality | - Annotation list with source column
- Confidence score for auto
- Edit/delete buttons
- Group by field type | P0 |
| US-5.2 | As a user, I want to see training history for a document so that I can understand model lineage | - List of models using this document
- Training date and model name
- Link to model detail page | P1 |
| US-5.3 | As a user, I want to edit annotations inline so that I can quickly make corrections | - Click to edit bounding box
- Drag to resize
- Double-click to edit text value
- Save/cancel buttons | P0 |
| US-5.4 | As a user, I want to see auto vs manual annotation comparison so that I can evaluate auto-label quality | - Side-by-side comparison view
- Highlight differences
- Override history timeline | P2 |
#### Epic 6: API Endpoints
| ID | User Story | Acceptance Criteria | Priority |
|----|------------|---------------------|----------|
| US-6.1 | As a developer, I want to upload ZIP/PDF via API so that I can automate document ingestion | - POST endpoint accepts multipart
- Returns document IDs array
- Async processing option
- Webhook callback support | P0 |
| US-6.2 | As a developer, I want to upload PDF with auto-label values so that I can pre-annotate documents | - JSON body with field values
- Auto-label runs synchronously or async
- Returns annotation IDs | P0 |
| US-6.3 | As a developer, I want to query document status so that I can poll for completion | - GET endpoint with document ID
- Returns full status object
- Includes annotation summary | P0 |
| US-6.4 | As a developer, I want API-uploaded documents visible in UI so that I can manage all documents centrally | - Same data model for API/UI uploads
- Source field distinguishes origin
- Full UI functionality available | P0 |
---
## 2. CSV Format Specification
### 2.1 Required Headers
```csv
customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro
```
### 2.2 Column Definitions
| Column | Type | Required | Maps to Class | Description | Validation Rules |
|--------|------|----------|---------------|-------------|------------------|
| `DocumentId` | string | YES | N/A | PDF filename (without .pdf extension) | Non-empty, alphanumeric + underscore/hyphen |
| `customer_number` | string | NO | customer_number (9) | Customer reference number | Max 50 chars |
| `supplier_name` | string | NO | N/A (metadata only) | Supplier company name | Max 255 chars |
| `supplier_organisation_number` | string | NO | supplier_organisation_number (7) | Swedish org number (XXXXXX-XXXX) | Format: 6 digits, hyphen, 4 digits |
| `supplier_accounts` | string | NO | N/A (metadata) | Pipe-separated account numbers | Max 500 chars |
| `InvoiceNumber` | string | NO | invoice_number (0) | Invoice reference | Max 50 chars |
| `InvoiceDate` | date | NO | invoice_date (1) | Invoice issue date | ISO 8601 or YYYY-MM-DD |
| `InvoiceDueDate` | date | NO | invoice_due_date (2) | Payment due date | ISO 8601 or YYYY-MM-DD |
| `Amount` | decimal | NO | amount (6) | Invoice total amount | Numeric, max 2 decimal places |
| `OCR` | string | NO | ocr_number (3) | Swedish OCR payment reference | Numeric string, max 25 chars |
| `Message` | string | NO | N/A (metadata only) | Free-text payment message | Max 140 chars |
| `Bankgiro` | string | NO | bankgiro (4) | Bankgiro account number | Format: XXX-XXXX or 7-8 digits |
| `Plusgiro` | string | NO | plusgiro (5) | Plusgiro account number | Format: XXXXXX-X or 6-8 digits |
### 2.3 Field to Class Mapping
```python
CSV_TO_CLASS_MAPPING = {
'InvoiceNumber': 0, # invoice_number
'InvoiceDate': 1, # invoice_date
'InvoiceDueDate': 2, # invoice_due_date
'OCR': 3, # ocr_number
'Bankgiro': 4, # bankgiro
'Plusgiro': 5, # plusgiro
'Amount': 6, # amount
'supplier_organisation_number': 7, # supplier_organisation_number
# 8: payment_line (derived from OCR/Bankgiro/Amount)
'customer_number': 9, # customer_number
}
```
### 2.4 Example CSV
```csv
customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro
C12345,ACME Corp,556677-8899,123-4567|987-6543,INV001,F2024-001,2024-01-15,2024-02-15,1250.00,7350012345678,,123-4567,
C12346,Widget AB,112233-4455,,INV002,F2024-002,2024-01-16,2024-02-16,3450.50,,,987-6543,
```
### 2.5 Validation Rules
1. **DocumentId**: Required, must match a PDF filename in the ZIP
2. **At least one matchable field**: One of InvoiceNumber, OCR, Bankgiro, Plusgiro, Amount, supplier_organisation_number must be non-empty
3. **Date formats**: YYYY-MM-DD, DD/MM/YYYY, DD.MM.YYYY
4. **Amount formats**: 1234.56, 1 234,56, 1234,56 SEK
5. **Swedish org number**: XXXXXX-XXXX pattern
---
## 3. Database Schema Changes
### 3.1 New Tables
#### 3.1.1 BatchUpload Table
```sql
CREATE TABLE batch_uploads (
batch_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
admin_token VARCHAR(255) NOT NULL REFERENCES admin_tokens(token),
filename VARCHAR(255) NOT NULL,
file_size INTEGER NOT NULL,
upload_source VARCHAR(20) NOT NULL DEFAULT 'ui', -- 'ui' or 'api'
status VARCHAR(20) NOT NULL DEFAULT 'processing',
-- Status: processing, completed, partial, failed
total_files INTEGER DEFAULT 0,
processed_files INTEGER DEFAULT 0,
successful_files INTEGER DEFAULT 0,
failed_files INTEGER DEFAULT 0,
error_message TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
completed_at TIMESTAMP
);
CREATE INDEX idx_batch_uploads_admin_token ON batch_uploads(admin_token);
CREATE INDEX idx_batch_uploads_status ON batch_uploads(status);
```
#### 3.1.2 BatchUploadFile Table
```sql
CREATE TABLE batch_upload_files (
file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
batch_id UUID NOT NULL REFERENCES batch_uploads(batch_id) ON DELETE CASCADE,
document_id UUID REFERENCES admin_documents(document_id),
filename VARCHAR(255) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
-- Status: pending, processing, completed, failed, skipped
error_message TEXT,
csv_row_data JSONB, -- Parsed CSV row for this file
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
processed_at TIMESTAMP
);
CREATE INDEX idx_batch_upload_files_batch_id ON batch_upload_files(batch_id);
CREATE INDEX idx_batch_upload_files_document_id ON batch_upload_files(document_id);
```
#### 3.1.3 TrainingDocumentLink Table (Junction Table)
```sql
CREATE TABLE training_document_links (
link_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
task_id UUID NOT NULL REFERENCES training_tasks(task_id) ON DELETE CASCADE,
document_id UUID NOT NULL REFERENCES admin_documents(document_id) ON DELETE CASCADE,
annotation_snapshot JSONB, -- Snapshot of annotations at training time
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(task_id, document_id)
);
CREATE INDEX idx_training_doc_links_task_id ON training_document_links(task_id);
CREATE INDEX idx_training_doc_links_document_id ON training_document_links(document_id);
```
#### 3.1.4 AnnotationHistory Table
```sql
CREATE TABLE annotation_history (
history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
annotation_id UUID NOT NULL REFERENCES admin_annotations(annotation_id) ON DELETE CASCADE,
action VARCHAR(20) NOT NULL, -- 'created', 'updated', 'deleted', 'override'
previous_value JSONB, -- Full annotation state before change
new_value JSONB, -- Full annotation state after change
changed_by VARCHAR(255), -- admin_token
change_reason TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_annotation_history_annotation_id ON annotation_history(annotation_id);
CREATE INDEX idx_annotation_history_created_at ON annotation_history(created_at);
```
### 3.2 Modified Tables
#### 3.2.1 AdminDocument Modifications
```sql
ALTER TABLE admin_documents ADD COLUMN upload_source VARCHAR(20) DEFAULT 'ui';
-- Values: 'ui', 'api'
ALTER TABLE admin_documents ADD COLUMN batch_id UUID REFERENCES batch_uploads(batch_id);
ALTER TABLE admin_documents ADD COLUMN csv_field_values JSONB;
-- Stores original CSV values for reference
ALTER TABLE admin_documents ADD COLUMN auto_label_queued_at TIMESTAMP;
-- When auto-label was queued (for dependency checking)
ALTER TABLE admin_documents ADD COLUMN annotation_lock_until TIMESTAMP;
-- Lock for manual annotation while auto-label runs
CREATE INDEX idx_admin_documents_upload_source ON admin_documents(upload_source);
CREATE INDEX idx_admin_documents_batch_id ON admin_documents(batch_id);
```
#### 3.2.2 AdminAnnotation Modifications
```sql
ALTER TABLE admin_annotations ADD COLUMN is_verified BOOLEAN DEFAULT FALSE;
-- User-verified annotation
ALTER TABLE admin_annotations ADD COLUMN verified_at TIMESTAMP;
ALTER TABLE admin_annotations ADD COLUMN verified_by VARCHAR(255);
ALTER TABLE admin_annotations ADD COLUMN override_source VARCHAR(20);
-- If this annotation overrides another: 'auto' or 'imported'
ALTER TABLE admin_annotations ADD COLUMN original_annotation_id UUID;
-- Reference to the annotation this overrides
CREATE INDEX idx_admin_annotations_source ON admin_annotations(source);
CREATE INDEX idx_admin_annotations_is_verified ON admin_annotations(is_verified);
```
#### 3.2.3 TrainingTask Modifications
```sql
ALTER TABLE training_tasks ADD COLUMN document_count INTEGER DEFAULT 0;
-- Count of documents used in training
ALTER TABLE training_tasks ADD COLUMN document_ids UUID[];
-- Array of document IDs used (for quick reference)
ALTER TABLE training_tasks ADD COLUMN metrics_mAP FLOAT;
ALTER TABLE training_tasks ADD COLUMN metrics_precision FLOAT;
ALTER TABLE training_tasks ADD COLUMN metrics_recall FLOAT;
-- Extracted metrics for easy querying
CREATE INDEX idx_training_tasks_metrics ON training_tasks(metrics_mAP);
```
### 3.3 SQLModel Definitions
```python
# File: src/data/admin_models.py
from datetime import datetime
from typing import Any
from uuid import UUID, uuid4
from sqlmodel import Field, SQLModel, Column, JSON, ARRAY
from sqlalchemy import String
class BatchUpload(SQLModel, table=True):
"""Batch upload record for ZIP uploads."""
__tablename__ = "batch_uploads"
batch_id: UUID = Field(default_factory=uuid4, primary_key=True)
admin_token: str = Field(foreign_key="admin_tokens.token", max_length=255, index=True)
filename: str = Field(max_length=255)
file_size: int
upload_source: str = Field(default="ui", max_length=20)
status: str = Field(default="processing", max_length=20, index=True)
total_files: int = Field(default=0)
processed_files: int = Field(default=0)
successful_files: int = Field(default=0)
failed_files: int = Field(default=0)
error_message: str | None = Field(default=None)
created_at: datetime = Field(default_factory=datetime.utcnow)
completed_at: datetime | None = Field(default=None)
class BatchUploadFile(SQLModel, table=True):
"""Individual file within a batch upload."""
__tablename__ = "batch_upload_files"
file_id: UUID = Field(default_factory=uuid4, primary_key=True)
batch_id: UUID = Field(foreign_key="batch_uploads.batch_id", index=True)
document_id: UUID | None = Field(default=None, foreign_key="admin_documents.document_id")
filename: str = Field(max_length=255)
status: str = Field(default="pending", max_length=20)
error_message: str | None = Field(default=None)
csv_row_data: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
created_at: datetime = Field(default_factory=datetime.utcnow)
processed_at: datetime | None = Field(default=None)
class TrainingDocumentLink(SQLModel, table=True):
"""Link between training tasks and documents used."""
__tablename__ = "training_document_links"
link_id: UUID = Field(default_factory=uuid4, primary_key=True)
task_id: UUID = Field(foreign_key="training_tasks.task_id", index=True)
document_id: UUID = Field(foreign_key="admin_documents.document_id", index=True)
annotation_snapshot: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
created_at: datetime = Field(default_factory=datetime.utcnow)
class AnnotationHistory(SQLModel, table=True):
"""History of annotation changes."""
__tablename__ = "annotation_history"
history_id: UUID = Field(default_factory=uuid4, primary_key=True)
annotation_id: UUID = Field(foreign_key="admin_annotations.annotation_id", index=True)
action: str = Field(max_length=20)
previous_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
new_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
changed_by: str | None = Field(default=None, max_length=255)
change_reason: str | None = Field(default=None)
created_at: datetime = Field(default_factory=datetime.utcnow, index=True)
```
---
## 4. API Specification
### 4.1 New Endpoints
#### 4.1.1 Batch Upload (ZIP)
```yaml
POST /api/v1/admin/batch/upload
Content-Type: multipart/form-data
Request:
file: binary (ZIP file)
async: boolean (default: true)
auto_label: boolean (default: true)
Response (202 Accepted):
{
"batch_id": "uuid",
"status": "processing",
"total_files": 25,
"message": "Batch upload started. Use batch_id to check progress.",
"status_url": "/api/v1/admin/batch/{batch_id}"
}
Response (200 OK - sync mode):
{
"batch_id": "uuid",
"status": "completed",
"total_files": 25,
"successful_files": 23,
"failed_files": 2,
"documents": [
{
"document_id": "uuid",
"filename": "INV001.pdf",
"status": "completed",
"auto_label_status": "completed",
"annotations_created": 8
}
],
"errors": [
{
"filename": "invalid.pdf",
"error": "Corrupted PDF file"
}
]
}
```
#### 4.1.2 Batch Status
```yaml
GET /api/v1/admin/batch/{batch_id}
Response:
{
"batch_id": "uuid",
"status": "processing",
"progress": {
"total": 25,
"processed": 15,
"successful": 14,
"failed": 1,
"percentage": 60
},
"files": [
{
"file_id": "uuid",
"filename": "INV001.pdf",
"document_id": "uuid",
"status": "completed"
}
],
"created_at": "2024-01-15T10:00:00Z",
"estimated_completion": "2024-01-15T10:05:00Z"
}
```
#### 4.1.3 Upload PDF with Auto-Label Values
```yaml
POST /api/v1/admin/documents/upload-with-labels
Content-Type: multipart/form-data
Request:
file: binary (PDF file)
field_values: JSON string
{
"InvoiceNumber": "F2024-001",
"InvoiceDate": "2024-01-15",
"Amount": "1250.00",
"OCR": "7350012345678",
"Bankgiro": "123-4567"
}
auto_label: boolean (default: true)
wait_for_completion: boolean (default: false)
Response (202 Accepted):
{
"document_id": "uuid",
"filename": "invoice.pdf",
"status": "auto_labeling",
"auto_label_status": "running",
"message": "Document uploaded. Auto-labeling in progress."
}
Response (200 OK - wait_for_completion=true):
{
"document_id": "uuid",
"filename": "invoice.pdf",
"status": "labeled",
"auto_label_status": "completed",
"annotations": [
{
"annotation_id": "uuid",
"class_id": 0,
"class_name": "invoice_number",
"text_value": "F2024-001",
"confidence": 0.95,
"bbox": { "x": 100, "y": 200, "width": 150, "height": 30 }
}
]
}
```
#### 4.1.4 Query Document Status
```yaml
GET /api/v1/admin/documents/{document_id}/status
Response:
{
"document_id": "uuid",
"filename": "invoice.pdf",
"status": "labeled",
"auto_label_status": "completed",
"upload_source": "api",
"annotation_summary": {
"total": 8,
"manual": 2,
"auto": 6,
"verified": 3
},
"can_annotate": true,
"annotation_lock_reason": null,
"training_history": [
{
"task_id": "uuid",
"task_name": "Training Run 2024-01",
"trained_at": "2024-01-20T15:00:00Z"
}
]
}
```
#### 4.1.5 Training with Document Selection
```yaml
POST /api/v1/admin/training/tasks
Content-Type: application/json
Request:
{
"name": "Training Run 2024-01",
"description": "First training run with 500 documents",
"document_ids": ["uuid1", "uuid2", "uuid3"],
"config": {
"model_name": "yolo11n.pt",
"epochs": 100,
"batch_size": 16,
"image_size": 640
},
"scheduled_at": "2024-01-20T22:00:00Z"
}
Response:
{
"task_id": "uuid",
"name": "Training Run 2024-01",
"status": "scheduled",
"document_count": 500,
"message": "Training task scheduled for 2024-01-20T22:00:00Z"
}
```
#### 4.1.6 Get Documents for Training
```yaml
GET /api/v1/admin/training/documents
Query Parameters:
- status: labeled (required)
- has_annotations: true
- min_annotation_count: 3
- exclude_used_in_training: boolean
- limit: 100
- offset: 0
Response:
{
"total": 1500,
"documents": [
{
"document_id": "uuid",
"filename": "INV001.pdf",
"annotation_count": 8,
"annotation_sources": { "manual": 3, "auto": 5 },
"used_in_training": ["task_id_1", "task_id_2"],
"last_modified": "2024-01-15T10:00:00Z"
}
]
}
```
#### 4.1.7 Get Model List
```yaml
GET /api/v1/admin/training/models
Query Parameters:
- status: completed
- limit: 20
- offset: 0
Response:
{
"total": 15,
"models": [
{
"task_id": "uuid",
"name": "Training Run 2024-01",
"status": "completed",
"document_count": 500,
"created_at": "2024-01-20T15:00:00Z",
"completed_at": "2024-01-20T18:30:00Z",
"metrics": {
"mAP": 0.935,
"precision": 0.92,
"recall": 0.88
},
"model_path": "runs/train/invoice_fields_20240120/weights/best.pt",
"download_url": "/api/v1/admin/training/models/{task_id}/download"
}
]
}
```
#### 4.1.8 Override Annotation
```yaml
PATCH /api/v1/admin/documents/{document_id}/annotations/{annotation_id}/override
Content-Type: application/json
Request:
{
"bbox": { "x": 110, "y": 205, "width": 145, "height": 28 },
"text_value": "F2024-001-A",
"reason": "Corrected OCR error"
}
Response:
{
"annotation_id": "uuid",
"source": "manual",
"override_source": "auto",
"original_annotation_id": "uuid",
"message": "Annotation overridden successfully",
"history_id": "uuid"
}
```
### 4.2 Modified Endpoints
#### 4.2.1 Document List (Enhanced)
```yaml
GET /api/v1/admin/documents
Query Parameters (additions):
- upload_source: 'ui' | 'api' | null
- has_annotations: boolean
- auto_label_status: 'pending' | 'running' | 'completed' | 'failed'
- used_in_training: boolean
- batch_id: uuid
Response (additions to DocumentItem):
{
"documents": [
{
// ... existing fields ...
"upload_source": "api",
"batch_id": "uuid",
"can_annotate": true,
"training_count": 2
}
]
}
```
#### 4.2.2 Document Detail (Enhanced)
```yaml
GET /api/v1/admin/documents/{document_id}
Response (additions):
{
// ... existing fields ...
"upload_source": "api",
"csv_field_values": {
"InvoiceNumber": "F2024-001",
"Amount": "1250.00"
},
"can_annotate": true,
"annotation_lock_reason": null,
"annotations": [
{
// ... existing fields ...
"is_verified": true,
"verified_at": "2024-01-16T09:00:00Z",
"override_source": null
}
],
"training_history": [
{
"task_id": "uuid",
"name": "Training Run 2024-01",
"trained_at": "2024-01-20T15:00:00Z",
"model_metrics": { "mAP": 0.935 }
}
]
}
```
---
## 5. UI Wireframes (Text-Based)
### 5.1 Document List View
```
+------------------------------------------------------------------+
| DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]|
+------------------------------------------------------------------+
| [Documents] [Training] [Models] [Settings] |
+------------------------------------------------------------------+
| |
| DOCUMENTS |
| +-----------------+ +-----------------------------------------+ |
| | UPLOAD | | FILTERS | |
| | [Single PDF] | | Status: [All v] Source: [All v] | |
| | [ZIP Batch] | | Auto-Label: [All v] Search: [________] | |
| +-----------------+ +-----------------------------------------+ |
| |
| +--------------------------------------------------------------+ |
| | [] Filename | Status | Auto-Label | Source | Date | |
| +--------------------------------------------------------------+ |
| | [] INV001.pdf | Labeled | Completed | API | 01/15 | |
| | [8 annotations] | [Preview] | [95%] | | | |
| +--------------------------------------------------------------+ |
| | [] INV002.pdf | Pending | Running | UI | 01/16 | |
| | [0 annotations] | [Locked] | [==== ] | | | |
| +--------------------------------------------------------------+ |
| | [] INV003.pdf | Labeled | Failed | API | 01/16 | |
| | [5 annotations] | [Preview] | [Retry] | | | |
| +--------------------------------------------------------------+ |
| | [] INV004.pdf | Labeled | Completed | UI | 01/17 | |
| | [10 annotations]| [Preview] | [98%] | [Used] | | |
| +--------------------------------------------------------------+ |
| |
| Showing 1-20 of 1,543 documents [<] [1] [2] [3] ... [78] [>] |
| |
| [Delete Selected] [Start Training with Selected] |
+------------------------------------------------------------------+
```
### 5.2 Document Detail View
```
+------------------------------------------------------------------+
| < Back to Documents INV001.pdf |
+------------------------------------------------------------------+
| |
| +---------------------------+ +-------------------------------+ |
| | | | DOCUMENT INFO | |
| | | | Status: Labeled | |
| | [Page 1 Image with | | Source: API Upload | |
| | Annotation Overlays] | | Auto-Label: Completed (95%) | |
| | | | Pages: 1 | |
| | [Manual: Solid border] | | Uploaded: 2024-01-15 | |
| | [Auto: Dashed border] | | | |
| | | | TRAINING HISTORY | |
| | | | - Run 2024-01 (mAP: 93.5%) | |
| | | | - Run 2024-02 (mAP: 95.1%) | |
| | | | | |
| +---------------------------+ +-------------------------------+ |
| |
| ANNOTATIONS [Add Annotation] [Run OCR] |
| +--------------------------------------------------------------+ |
| | Field | Value | Source | Conf | Actions | |
| +--------------------------------------------------------------+ |
| | invoice_number | F2024-001 | Manual | - | [E] [D] | |
| +--------------------------------------------------------------+ |
| | invoice_date | 2024-01-15 | Auto | 95% | [V] [E][D]| |
| +--------------------------------------------------------------+ |
| | amount | 1,250.00 | Auto | 98% | [V] [E][D]| |
| +--------------------------------------------------------------+ |
| | ocr_number | 7350012345 | Auto | 87% | [V] [E][D]| |
| +--------------------------------------------------------------+ |
| | bankgiro | 123-4567 | Manual | - | [E] [D] | |
| +--------------------------------------------------------------+ |
| |
| [V] = Verify [E] = Edit [D] = Delete |
| |
| CSV FIELD VALUES (Reference) |
| +--------------------------------------------------------------+ |
| | InvoiceNumber: F2024-001 | InvoiceDate: 2024-01-15 | |
| | Amount: 1250.00 | OCR: 7350012345678 | |
| | Bankgiro: 123-4567 | | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
```
### 5.3 Training Page
```
+------------------------------------------------------------------+
| DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]|
+------------------------------------------------------------------+
| [Documents] [Training] [Models] [Settings] |
+------------------------------------------------------------------+
| |
| TRAINING |
| |
| DOCUMENT SELECTION Selected: 500 docs |
| +--------------------------------------------------------------+ |
| | [] Filename | Annotations | Source | Last Modified | |
| +--------------------------------------------------------------+ |
| | [x] INV001.pdf | 8 (M:3 A:5) | API | 2024-01-15 | |
| +--------------------------------------------------------------+ |
| | [x] INV002.pdf | 10 (M:2 A:8)| UI | 2024-01-16 | |
| +--------------------------------------------------------------+ |
| | [ ] INV003.pdf | 5 (M:5 A:0) | UI | 2024-01-16 | |
| +--------------------------------------------------------------+ |
| | [x] INV004.pdf | 12 (M:4 A:8)| API | 2024-01-17 | |
| +--------------------------------------------------------------+ |
| |
| [Select All] [Select None] [Select Not Used in Training] |
| |
| Showing labeled documents only [<] [1] [2] [3] ... [50] [>] |
| |
| TRAINING CONFIGURATION |
| +--------------------------------------------------------------+ |
| | Name: [Training Run 2024-01____________] | |
| | Description: [First training with 500 documents_________] | |
| | | |
| | Base Model: [yolo11n.pt v] Epochs: [100] Batch: [16] | |
| | Image Size: [640] Device: [GPU 0 v] | |
| | | |
| | [ ] Schedule for later: [2024-01-20] [22:00] | |
| +--------------------------------------------------------------+ |
| |
| [Start Training] |
+------------------------------------------------------------------+
```
### 5.4 Model History View
```
+------------------------------------------------------------------+
| DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]|
+------------------------------------------------------------------+
| [Documents] [Training] [Models] [Settings] |
+------------------------------------------------------------------+
| |
| TRAINED MODELS |
| |
| +--------------------------------------------------------------+ |
| | Name | Status | Docs | mAP | Date | |
| +--------------------------------------------------------------+ |
| | Training Run 2024-03 | Running | 750 | - | 01/25 | |
| | | [==== ] | | | | |
| | | [View Logs] [Cancel] | |
| +--------------------------------------------------------------+ |
| | Training Run 2024-02 | Completed | 600 | 95.1% | 01/20 | |
| | | P: 94% R: 92% | |
| | | [View] [Download] [Use as Base] | |
| +--------------------------------------------------------------+ |
| | Training Run 2024-01 | Completed | 500 | 93.5% | 01/15 | |
| | | P: 92% R: 88% | |
| | | [View] [Download] [Use as Base] | |
| +--------------------------------------------------------------+ |
| | Initial Training | Completed | 200 | 85.2% | 01/10 | |
| | | P: 84% R: 80% | |
| | | [View] [Download] [Use as Base] | |
| +--------------------------------------------------------------+ |
| |
| MODEL DETAIL: Training Run 2024-02 |
| +--------------------------------------------------------------+ |
| | Created: 2024-01-20 15:00 | Completed: 2024-01-20 18:30 | |
| | Duration: 3h 30m | Documents: 600 | |
| | | |
| | Metrics: | |
| | - mAP@0.5: 95.1% | |
| | - Precision: 94% | |
| | - Recall: 92% | |
| | | |
| | Configuration: | |
| | - Base: yolo11n.pt Epochs: 100 Batch: 16 Size: 640 | |
| | | |
| | Documents Used: [View 600 documents] | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
```
### 5.5 Batch Upload Modal
```
+------------------------------------------------------------------+
| BATCH UPLOAD [X] |
+------------------------------------------------------------------+
| |
| Upload a ZIP file containing: |
| - Multiple PDF files |
| - (Optional) CSV file for auto-labeling |
| |
| +--------------------------------------------------------------+ |
| | | |
| | [Drag and drop ZIP file here] | |
| | or | |
| | [Browse Files] | |
| | | |
| +--------------------------------------------------------------+ |
| |
| [x] Auto-label documents (requires CSV) |
| [ ] Process asynchronously |
| |
| CSV FORMAT REQUIREMENTS: |
| Required columns: DocumentId |
| Optional: InvoiceNumber, InvoiceDate, Amount, OCR, Bankgiro... |
| [View full CSV specification] |
| |
| [Cancel] [Upload] |
+------------------------------------------------------------------+
+------------------------------------------------------------------+
| UPLOAD PROGRESS [X] |
+------------------------------------------------------------------+
| |
| Processing batch upload... |
| |
| [======================================== ] 80% |
| |
| Files: 20 / 25 |
| Successful: 18 |
| Failed: 2 |
| |
| +--------------------------------------------------------------+ |
| | [OK] INV001.pdf - Completed (8 annotations) | |
| | [OK] INV002.pdf - Completed (10 annotations) | |
| | [!!] INV003.pdf - Failed: Corrupted PDF | |
| | [OK] INV004.pdf - Completed (6 annotations) | |
| | [...] Processing INV005.pdf... | |
| +--------------------------------------------------------------+ |
| |
| [Cancel] [Close] |
+------------------------------------------------------------------+
```
---
## 6. Implementation Phases
### Phase 1: Database and Core Models (Week 1)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 1.1 | Create database migration script | `src/data/migrations/` | Low |
| 1.2 | Add new SQLModel classes | `src/data/admin_models.py` | Low |
| 1.3 | Update AdminDB with new methods | `src/data/admin_db.py` | Medium |
| 1.4 | Add unit tests for new models | `tests/data/test_admin_models.py` | Low |
**Dependencies**: None
**Risk Assessment**: Low - mostly additive changes to existing structure
### Phase 2: Batch Upload Backend (Week 2)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 2.1 | Create ZIP extraction service | `src/web/batch_upload_service.py` | Medium |
| 2.2 | Add CSV parsing with new format | `src/data/csv_loader.py` | Low |
| 2.3 | Create batch upload routes | `src/web/admin_batch_routes.py` | Medium |
| 2.4 | Add async processing queue | `src/web/batch_queue.py` | High |
| 2.5 | Integration tests | `tests/web/test_batch_upload.py` | Medium |
**Dependencies**: Phase 1
**Risk Assessment**: Medium - ZIP handling and async processing add complexity
### Phase 3: Enhanced Document Management (Week 3)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 3.1 | Add upload source tracking | `src/data/admin_models.py` | Low |
| 3.2 | Update document list endpoint | `src/web/admin_routes.py` | Low |
| 3.3 | Add annotation lock mechanism | `src/web/admin_annotation_routes.py` | Medium |
| 3.4 | Add document status endpoint | `src/web/admin_routes.py` | Low |
| 3.5 | Update auto-label service | `src/web/admin_autolabel.py` | Medium |
**Dependencies**: Phase 1, Phase 2
**Risk Assessment**: Medium - locking mechanism needs careful implementation
### Phase 4: Manual Annotation Enhancement (Week 4)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 4.1 | Add override mechanism | `src/web/admin_annotation_routes.py` | Medium |
| 4.2 | Add annotation history | `src/data/admin_db.py` | Low |
| 4.3 | Add verification endpoint | `src/web/admin_annotation_routes.py` | Low |
| 4.4 | Update schemas with new fields | `src/web/admin_schemas.py` | Low |
**Dependencies**: Phase 3
**Risk Assessment**: Low - extending existing annotation system
### Phase 5: Training Integration (Week 5)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 5.1 | Add document selection for training | `src/web/admin_training_routes.py` | Medium |
| 5.2 | Add training document link table | `src/data/admin_db.py` | Low |
| 5.3 | Add model list endpoint | `src/web/admin_training_routes.py` | Low |
| 5.4 | Update export with selection | `src/web/admin_training_routes.py` | Medium |
| 5.5 | Add metrics extraction | `src/cli/train.py` | Medium |
**Dependencies**: Phase 1, Phase 4
**Risk Assessment**: Medium - integration with training pipeline
### Phase 6: Frontend Implementation (Weeks 6-7)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 6.1 | Create React component structure | `frontend/` | High |
| 6.2 | Implement document list view | `frontend/src/components/` | Medium |
| 6.3 | Implement document detail view | `frontend/src/components/` | High |
| 6.4 | Implement training page | `frontend/src/components/` | Medium |
| 6.5 | Implement batch upload modal | `frontend/src/components/` | Medium |
| 6.6 | Add annotation editor | `frontend/src/components/` | High |
**Dependencies**: Phase 2-5
**Risk Assessment**: High - frontend development is a new component
### Phase 7: Testing and Documentation (Week 8)
| Step | Task | Files | Risk |
|------|------|-------|------|
| 7.1 | Integration tests | `tests/integration/` | Medium |
| 7.2 | E2E tests | `tests/e2e/` | High |
| 7.3 | API documentation | `docs/api/` | Low |
| 7.4 | User guide | `docs/user-guide/` | Low |
| 7.5 | Performance testing | `tests/performance/` | Medium |
**Dependencies**: All phases
**Risk Assessment**: Medium
### Risk Mitigation Strategies
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| ZIP bomb attack | High | Low | Limit max file count, max total size, scan before extraction |
| Async queue failures | Medium | Medium | Implement retry logic, dead letter queue, manual retry endpoint |
| Annotation lock deadlock | Medium | Low | Timeout-based locks, admin override capability |
| Large batch performance | Medium | High | Chunked processing, progress tracking, background workers |
| Database migration issues | High | Low | Backward compatible changes, rollback scripts |
| Frontend complexity | Medium | Medium | Use established UI framework, incremental delivery |
---
## 7. State Machine Diagrams
### 7.1 Document Lifecycle States
```
+-------------+
| DELETED |
+------^------+
|
| delete
|
+----------+ upload +----------+ |
| | --------------> | |--+
| (none) | | PENDING |
| | | |
+----------+ +----+-----+
|
+----------------+-----------------+
| |
| trigger auto-label | create manual annotation
v |
+-------------+ |
| | |
| AUTO_LABEL- | |
| ING | |
| | |
+------+------+ |
| |
+---------+---------+ |
| | |
| complete | fail |
v v |
+-------------+ +-------------+ |
| | | | |
| LABELED |<----+ PENDING +<--------------+
| | retry| (failed) |
+------+------+ +-------------+
|
| export
v
+-------------+
| |
| EXPORTED |
| |
+-------------+
```
### 7.2 Auto-Label Workflow States
```
+-------------+
| MANUAL |
| OVERRIDE |
+------^------+
|
| user edit
|
+----------+ queue +----------+ | +-----------+
| | --------------> | | | | |
| (none) | | QUEUED |--+--->| COMPLETED |
| | | | | |
+----------+ +----+-----+ +-----^-----+
| |
| start |
v |
+-------------+ |
| | |
| RUNNING +-----------+
| | success
+------+------+
|
| error
v
+-------------+
| |
| FAILED |
| |
+------+------+
|
| retry
v
+-------------+
| |
| QUEUED |
| |
+-------------+
```
### 7.3 Batch Upload States
```
+----------+ upload +-------------+
| | --------------> | |
| (none) | | PROCESSING |
| | | |
+----------+ +------+------+
|
+---------------+---------------+
| | |
| all success | some fail | all fail
v v v
+-------------+ +-------------+ +-------------+
| | | | | |
| COMPLETED | | PARTIAL | | FAILED |
| | | | | |
+-------------+ +-------------+ +-------------+
```
### 7.4 Training Task States
```
+----------+ create +-------------+
| | --------------> | |
| (none) | | PENDING |
| | | |
+----------+ +------+------+
|
+-------------+-------------+
| |
| immediate | scheduled
v v
+-------------+ +-------------+
| | | |
| RUNNING |<------------+ SCHEDULED |
| | trigger | |
+------+------+ +------+------+
| |
+---------+---------+ | cancel
| | v
| success | error +-------------+
v v | |
+-------------+ +-------------+ | CANCELLED |
| | | | | |
| COMPLETED | | FAILED | +-------------+
| | | |
+-------------+ +------+------+
|
| retry
v
+-------------+
| |
| PENDING |
| |
+-------------+
```
### 7.5 Annotation Lock States
```
+-------------+
| LOCKED |
| (auto-label |
| running) |
+------^------+
|
| auto-label starts
|
+----------+ upload +----------+ |
| | --------------> | |--+
| (none) | | UNLOCKED |<---------+
| | | | |
+----------+ +----+-----+ |
| |
| auto-label | auto-label
| starts | completes/fails
| |
v |
+-------------+ |
| | |
| LOCKED +---------+
| (timeout: |
| 5 minutes) |
+-------------+
```
---
## Summary
This comprehensive plan provides:
1. **PRD**: 24 user stories across 6 epics with clear acceptance criteria and priorities
2. **CSV Specification**: 13 columns with detailed validation rules and field mappings
3. **Database Schema**: 4 new tables + modifications to 3 existing tables with full SQLModel definitions
4. **API Specification**: 8 new endpoints + 2 modified endpoints with complete request/response schemas
5. **UI Wireframes**: 5 detailed text-based wireframes covering all major views
6. **Implementation Phases**: 7 phases over 8 weeks with 30+ tasks, dependencies, and risk assessments
7. **State Machines**: 5 state diagrams covering document, auto-label, batch, training, and locking workflows
The implementation follows an incremental approach starting with database/backend changes before frontend development, minimizing risk and enabling continuous testing throughout the development cycle.