52 KiB
Document Annotation Tool - Product Plan v2
Table of Contents
- Product Requirements Document (PRD)
- CSV Format Specification
- Database Schema Changes
- API Specification
- UI Wireframes (Text-Based)
- Implementation Phases
- State Machine Diagrams
1. Product Requirements Document (PRD)
1.1 Overview
This enhancement adds batch upload capabilities, document lifecycle management, manual annotation workflow with auto-label dependency, comprehensive training management, and enhanced document detail views to the Invoice Master Document Annotation Tool.
1.2 User Stories
Epic 1: Batch Upload (ZIP Support)
| ID | User Story | Acceptance Criteria | Priority |
|---|---|---|---|
| US-1.1 | As a user, I want to upload a ZIP file containing multiple PDFs so that I can process many documents at once | - ZIP file is extracted - Each PDF is registered as a separate document - Document IDs are returned for all files - Invalid files are skipped with error message |
P0 |
| US-1.2 | As a user, I want to include a CSV file in my ZIP for auto-labeling so that annotations are created automatically | - CSV is parsed and validated - DocumentId column maps to PDF filenames - Field values are stored for auto-labeling - Invalid CSV rows are logged |
P0 |
| US-1.3 | As a user, I want to upload a single PDF with auto-label values via API so that I can integrate with my workflow | - PDF is uploaded - Auto-label values provided in JSON body - Auto-labeling runs automatically - Document ID returned |
P0 |
| US-1.4 | As a user, I want clear feedback on batch upload progress so that I know which files succeeded or failed | - Upload progress indicator - Per-file status (success/failed) - Error messages for failed files - Summary count displayed |
P1 |
Epic 2: Document List and Status
| ID | User Story | Acceptance Criteria | Priority |
|---|---|---|---|
| US-2.1 | As a user, I want to see a list of all uploaded documents so that I can manage my annotations | - Paginated document list - Shows filename, status, date - Sortable columns - Search/filter capability |
P0 |
| US-2.2 | As a user, I want to see auto-label status for each document so that I know processing progress | - Status badge: pending, processing, completed, failed - Progress indicator for processing - Error message for failed |
P0 |
| US-2.3 | As a user, I want to see the upload source (API vs UI) so that I can track document origin | - Source column in list - Filter by source - Source shown in detail view |
P1 |
| US-2.4 | As a user, I want to see annotation preview for completed documents so that I can quickly review | - Thumbnail with overlaid bounding boxes - Annotation count badge - Click to view full detail |
P1 |
Epic 3: Manual Annotation with Auto-Label Dependency
| ID | User Story | Acceptance Criteria | Priority |
|---|---|---|---|
| US-3.1 | As a user, I want to be blocked from manual annotation if auto-label is pending so that I don't lose work | - Clear message: "Auto-labeling in progress, please wait" - Refresh button to check status - Automatic unlock when complete |
P0 |
| US-3.2 | As a user, I want to override auto-generated annotations so that I can correct errors | - Can edit any annotation - Source changes from "auto" to "manual" - Original auto value preserved in history - Override timestamp recorded |
P0 |
| US-3.3 | As a user, I want to see which annotations are manual vs auto so that I can review confidence | - Color-coded annotation badges - Manual: solid border - Auto: dashed border with confidence % - Filter by source |
P0 |
| US-3.4 | As a user, I want to accept or reject individual auto-annotations so that I can curate training data | - Accept button marks as verified - Reject button removes annotation - Bulk accept/reject actions |
P1 |
Epic 4: Training Page Features
| ID | User Story | Acceptance Criteria | Priority |
|---|---|---|---|
| US-4.1 | As a user, I want to see all documents available for training so that I can select training data | - Filtered list (only labeled documents) - Shows annotation count per document - Checkbox selection - Select all/none options |
P0 |
| US-4.2 | As a user, I want to select specific documents for training so that I can control data quality | - Multi-select with checkboxes - Selection count displayed - Clear selection button - Persisted selection state |
P0 |
| US-4.3 | As a user, I want to see all trained models so that I can track model history | - Model list with name, date, status - Document count used - mAP/accuracy metrics - Download model link |
P0 |
| US-4.4 | As a user, I want to see which documents were used in training so that I can track data lineage | - "Used in training" badge on documents - Click to see model list - Filter documents by training status |
P1 |
| US-4.5 | As a user, I want to start a training job with selected documents so that I can create new models | - Start training button - Training config options - Progress monitoring - Email notification on completion |
P0 |
Epic 5: Document Detail View (Enhanced)
| ID | User Story | Acceptance Criteria | Priority |
|---|---|---|---|
| US-5.1 | As a user, I want to see all annotations with their source so that I can review data quality | - Annotation list with source column - Confidence score for auto - Edit/delete buttons - Group by field type |
P0 |
| US-5.2 | As a user, I want to see training history for a document so that I can understand model lineage | - List of models using this document - Training date and model name - Link to model detail page |
P1 |
| US-5.3 | As a user, I want to edit annotations inline so that I can quickly make corrections | - Click to edit bounding box - Drag to resize - Double-click to edit text value - Save/cancel buttons |
P0 |
| US-5.4 | As a user, I want to see auto vs manual annotation comparison so that I can evaluate auto-label quality | - Side-by-side comparison view - Highlight differences - Override history timeline |
P2 |
Epic 6: API Endpoints
| ID | User Story | Acceptance Criteria | Priority |
|---|---|---|---|
| US-6.1 | As a developer, I want to upload ZIP/PDF via API so that I can automate document ingestion | - POST endpoint accepts multipart - Returns document IDs array - Async processing option - Webhook callback support |
P0 |
| US-6.2 | As a developer, I want to upload PDF with auto-label values so that I can pre-annotate documents | - JSON body with field values - Auto-label runs synchronously or async - Returns annotation IDs |
P0 |
| US-6.3 | As a developer, I want to query document status so that I can poll for completion | - GET endpoint with document ID - Returns full status object - Includes annotation summary |
P0 |
| US-6.4 | As a developer, I want API-uploaded documents visible in UI so that I can manage all documents centrally | - Same data model for API/UI uploads - Source field distinguishes origin - Full UI functionality available |
P0 |
2. CSV Format Specification
2.1 Required Headers
customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro
2.2 Column Definitions
| Column | Type | Required | Maps to Class | Description | Validation Rules |
|---|---|---|---|---|---|
DocumentId |
string | YES | N/A | PDF filename (without .pdf extension) | Non-empty, alphanumeric + underscore/hyphen |
customer_number |
string | NO | customer_number (9) | Customer reference number | Max 50 chars |
supplier_name |
string | NO | N/A (metadata only) | Supplier company name | Max 255 chars |
supplier_organisation_number |
string | NO | supplier_organisation_number (7) | Swedish org number (XXXXXX-XXXX) | Format: 6 digits, hyphen, 4 digits |
supplier_accounts |
string | NO | N/A (metadata) | Pipe-separated account numbers | Max 500 chars |
InvoiceNumber |
string | NO | invoice_number (0) | Invoice reference | Max 50 chars |
InvoiceDate |
date | NO | invoice_date (1) | Invoice issue date | ISO 8601 or YYYY-MM-DD |
InvoiceDueDate |
date | NO | invoice_due_date (2) | Payment due date | ISO 8601 or YYYY-MM-DD |
Amount |
decimal | NO | amount (6) | Invoice total amount | Numeric, max 2 decimal places |
OCR |
string | NO | ocr_number (3) | Swedish OCR payment reference | Numeric string, max 25 chars |
Message |
string | NO | N/A (metadata only) | Free-text payment message | Max 140 chars |
Bankgiro |
string | NO | bankgiro (4) | Bankgiro account number | Format: XXX-XXXX or 7-8 digits |
Plusgiro |
string | NO | plusgiro (5) | Plusgiro account number | Format: XXXXXX-X or 6-8 digits |
2.3 Field to Class Mapping
CSV_TO_CLASS_MAPPING = {
'InvoiceNumber': 0, # invoice_number
'InvoiceDate': 1, # invoice_date
'InvoiceDueDate': 2, # invoice_due_date
'OCR': 3, # ocr_number
'Bankgiro': 4, # bankgiro
'Plusgiro': 5, # plusgiro
'Amount': 6, # amount
'supplier_organisation_number': 7, # supplier_organisation_number
# 8: payment_line (derived from OCR/Bankgiro/Amount)
'customer_number': 9, # customer_number
}
2.4 Example CSV
customer_number,supplier_name,supplier_organisation_number,supplier_accounts,DocumentId,InvoiceNumber,InvoiceDate,InvoiceDueDate,Amount,OCR,Message,Bankgiro,Plusgiro
C12345,ACME Corp,556677-8899,123-4567|987-6543,INV001,F2024-001,2024-01-15,2024-02-15,1250.00,7350012345678,,123-4567,
C12346,Widget AB,112233-4455,,INV002,F2024-002,2024-01-16,2024-02-16,3450.50,,,987-6543,
2.5 Validation Rules
- DocumentId: Required, must match a PDF filename in the ZIP
- At least one matchable field: One of InvoiceNumber, OCR, Bankgiro, Plusgiro, Amount, supplier_organisation_number must be non-empty
- Date formats: YYYY-MM-DD, DD/MM/YYYY, DD.MM.YYYY
- Amount formats: 1234.56, 1 234,56, 1234,56 SEK
- Swedish org number: XXXXXX-XXXX pattern
3. Database Schema Changes
3.1 New Tables
3.1.1 BatchUpload Table
CREATE TABLE batch_uploads (
batch_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
admin_token VARCHAR(255) NOT NULL REFERENCES admin_tokens(token),
filename VARCHAR(255) NOT NULL,
file_size INTEGER NOT NULL,
upload_source VARCHAR(20) NOT NULL DEFAULT 'ui', -- 'ui' or 'api'
status VARCHAR(20) NOT NULL DEFAULT 'processing',
-- Status: processing, completed, partial, failed
total_files INTEGER DEFAULT 0,
processed_files INTEGER DEFAULT 0,
successful_files INTEGER DEFAULT 0,
failed_files INTEGER DEFAULT 0,
error_message TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
completed_at TIMESTAMP
);
CREATE INDEX idx_batch_uploads_admin_token ON batch_uploads(admin_token);
CREATE INDEX idx_batch_uploads_status ON batch_uploads(status);
3.1.2 BatchUploadFile Table
CREATE TABLE batch_upload_files (
file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
batch_id UUID NOT NULL REFERENCES batch_uploads(batch_id) ON DELETE CASCADE,
document_id UUID REFERENCES admin_documents(document_id),
filename VARCHAR(255) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
-- Status: pending, processing, completed, failed, skipped
error_message TEXT,
csv_row_data JSONB, -- Parsed CSV row for this file
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
processed_at TIMESTAMP
);
CREATE INDEX idx_batch_upload_files_batch_id ON batch_upload_files(batch_id);
CREATE INDEX idx_batch_upload_files_document_id ON batch_upload_files(document_id);
3.1.3 TrainingDocumentLink Table (Junction Table)
CREATE TABLE training_document_links (
link_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
task_id UUID NOT NULL REFERENCES training_tasks(task_id) ON DELETE CASCADE,
document_id UUID NOT NULL REFERENCES admin_documents(document_id) ON DELETE CASCADE,
annotation_snapshot JSONB, -- Snapshot of annotations at training time
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(task_id, document_id)
);
CREATE INDEX idx_training_doc_links_task_id ON training_document_links(task_id);
CREATE INDEX idx_training_doc_links_document_id ON training_document_links(document_id);
3.1.4 AnnotationHistory Table
CREATE TABLE annotation_history (
history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
annotation_id UUID NOT NULL REFERENCES admin_annotations(annotation_id) ON DELETE CASCADE,
action VARCHAR(20) NOT NULL, -- 'created', 'updated', 'deleted', 'override'
previous_value JSONB, -- Full annotation state before change
new_value JSONB, -- Full annotation state after change
changed_by VARCHAR(255), -- admin_token
change_reason TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_annotation_history_annotation_id ON annotation_history(annotation_id);
CREATE INDEX idx_annotation_history_created_at ON annotation_history(created_at);
3.2 Modified Tables
3.2.1 AdminDocument Modifications
ALTER TABLE admin_documents ADD COLUMN upload_source VARCHAR(20) DEFAULT 'ui';
-- Values: 'ui', 'api'
ALTER TABLE admin_documents ADD COLUMN batch_id UUID REFERENCES batch_uploads(batch_id);
ALTER TABLE admin_documents ADD COLUMN csv_field_values JSONB;
-- Stores original CSV values for reference
ALTER TABLE admin_documents ADD COLUMN auto_label_queued_at TIMESTAMP;
-- When auto-label was queued (for dependency checking)
ALTER TABLE admin_documents ADD COLUMN annotation_lock_until TIMESTAMP;
-- Lock for manual annotation while auto-label runs
CREATE INDEX idx_admin_documents_upload_source ON admin_documents(upload_source);
CREATE INDEX idx_admin_documents_batch_id ON admin_documents(batch_id);
3.2.2 AdminAnnotation Modifications
ALTER TABLE admin_annotations ADD COLUMN is_verified BOOLEAN DEFAULT FALSE;
-- User-verified annotation
ALTER TABLE admin_annotations ADD COLUMN verified_at TIMESTAMP;
ALTER TABLE admin_annotations ADD COLUMN verified_by VARCHAR(255);
ALTER TABLE admin_annotations ADD COLUMN override_source VARCHAR(20);
-- If this annotation overrides another: 'auto' or 'imported'
ALTER TABLE admin_annotations ADD COLUMN original_annotation_id UUID;
-- Reference to the annotation this overrides
CREATE INDEX idx_admin_annotations_source ON admin_annotations(source);
CREATE INDEX idx_admin_annotations_is_verified ON admin_annotations(is_verified);
3.2.3 TrainingTask Modifications
ALTER TABLE training_tasks ADD COLUMN document_count INTEGER DEFAULT 0;
-- Count of documents used in training
ALTER TABLE training_tasks ADD COLUMN document_ids UUID[];
-- Array of document IDs used (for quick reference)
ALTER TABLE training_tasks ADD COLUMN metrics_mAP FLOAT;
ALTER TABLE training_tasks ADD COLUMN metrics_precision FLOAT;
ALTER TABLE training_tasks ADD COLUMN metrics_recall FLOAT;
-- Extracted metrics for easy querying
CREATE INDEX idx_training_tasks_metrics ON training_tasks(metrics_mAP);
3.3 SQLModel Definitions
# File: src/data/admin_models.py
from datetime import datetime
from typing import Any
from uuid import UUID, uuid4
from sqlmodel import Field, SQLModel, Column, JSON, ARRAY
from sqlalchemy import String
class BatchUpload(SQLModel, table=True):
"""Batch upload record for ZIP uploads."""
__tablename__ = "batch_uploads"
batch_id: UUID = Field(default_factory=uuid4, primary_key=True)
admin_token: str = Field(foreign_key="admin_tokens.token", max_length=255, index=True)
filename: str = Field(max_length=255)
file_size: int
upload_source: str = Field(default="ui", max_length=20)
status: str = Field(default="processing", max_length=20, index=True)
total_files: int = Field(default=0)
processed_files: int = Field(default=0)
successful_files: int = Field(default=0)
failed_files: int = Field(default=0)
error_message: str | None = Field(default=None)
created_at: datetime = Field(default_factory=datetime.utcnow)
completed_at: datetime | None = Field(default=None)
class BatchUploadFile(SQLModel, table=True):
"""Individual file within a batch upload."""
__tablename__ = "batch_upload_files"
file_id: UUID = Field(default_factory=uuid4, primary_key=True)
batch_id: UUID = Field(foreign_key="batch_uploads.batch_id", index=True)
document_id: UUID | None = Field(default=None, foreign_key="admin_documents.document_id")
filename: str = Field(max_length=255)
status: str = Field(default="pending", max_length=20)
error_message: str | None = Field(default=None)
csv_row_data: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
created_at: datetime = Field(default_factory=datetime.utcnow)
processed_at: datetime | None = Field(default=None)
class TrainingDocumentLink(SQLModel, table=True):
"""Link between training tasks and documents used."""
__tablename__ = "training_document_links"
link_id: UUID = Field(default_factory=uuid4, primary_key=True)
task_id: UUID = Field(foreign_key="training_tasks.task_id", index=True)
document_id: UUID = Field(foreign_key="admin_documents.document_id", index=True)
annotation_snapshot: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
created_at: datetime = Field(default_factory=datetime.utcnow)
class AnnotationHistory(SQLModel, table=True):
"""History of annotation changes."""
__tablename__ = "annotation_history"
history_id: UUID = Field(default_factory=uuid4, primary_key=True)
annotation_id: UUID = Field(foreign_key="admin_annotations.annotation_id", index=True)
action: str = Field(max_length=20)
previous_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
new_value: dict[str, Any] | None = Field(default=None, sa_column=Column(JSON))
changed_by: str | None = Field(default=None, max_length=255)
change_reason: str | None = Field(default=None)
created_at: datetime = Field(default_factory=datetime.utcnow, index=True)
4. API Specification
4.1 New Endpoints
4.1.1 Batch Upload (ZIP)
POST /api/v1/admin/batch/upload
Content-Type: multipart/form-data
Request:
file: binary (ZIP file)
async: boolean (default: true)
auto_label: boolean (default: true)
Response (202 Accepted):
{
"batch_id": "uuid",
"status": "processing",
"total_files": 25,
"message": "Batch upload started. Use batch_id to check progress.",
"status_url": "/api/v1/admin/batch/{batch_id}"
}
Response (200 OK - sync mode):
{
"batch_id": "uuid",
"status": "completed",
"total_files": 25,
"successful_files": 23,
"failed_files": 2,
"documents": [
{
"document_id": "uuid",
"filename": "INV001.pdf",
"status": "completed",
"auto_label_status": "completed",
"annotations_created": 8
}
],
"errors": [
{
"filename": "invalid.pdf",
"error": "Corrupted PDF file"
}
]
}
4.1.2 Batch Status
GET /api/v1/admin/batch/{batch_id}
Response:
{
"batch_id": "uuid",
"status": "processing",
"progress": {
"total": 25,
"processed": 15,
"successful": 14,
"failed": 1,
"percentage": 60
},
"files": [
{
"file_id": "uuid",
"filename": "INV001.pdf",
"document_id": "uuid",
"status": "completed"
}
],
"created_at": "2024-01-15T10:00:00Z",
"estimated_completion": "2024-01-15T10:05:00Z"
}
4.1.3 Upload PDF with Auto-Label Values
POST /api/v1/admin/documents/upload-with-labels
Content-Type: multipart/form-data
Request:
file: binary (PDF file)
field_values: JSON string
{
"InvoiceNumber": "F2024-001",
"InvoiceDate": "2024-01-15",
"Amount": "1250.00",
"OCR": "7350012345678",
"Bankgiro": "123-4567"
}
auto_label: boolean (default: true)
wait_for_completion: boolean (default: false)
Response (202 Accepted):
{
"document_id": "uuid",
"filename": "invoice.pdf",
"status": "auto_labeling",
"auto_label_status": "running",
"message": "Document uploaded. Auto-labeling in progress."
}
Response (200 OK - wait_for_completion=true):
{
"document_id": "uuid",
"filename": "invoice.pdf",
"status": "labeled",
"auto_label_status": "completed",
"annotations": [
{
"annotation_id": "uuid",
"class_id": 0,
"class_name": "invoice_number",
"text_value": "F2024-001",
"confidence": 0.95,
"bbox": { "x": 100, "y": 200, "width": 150, "height": 30 }
}
]
}
4.1.4 Query Document Status
GET /api/v1/admin/documents/{document_id}/status
Response:
{
"document_id": "uuid",
"filename": "invoice.pdf",
"status": "labeled",
"auto_label_status": "completed",
"upload_source": "api",
"annotation_summary": {
"total": 8,
"manual": 2,
"auto": 6,
"verified": 3
},
"can_annotate": true,
"annotation_lock_reason": null,
"training_history": [
{
"task_id": "uuid",
"task_name": "Training Run 2024-01",
"trained_at": "2024-01-20T15:00:00Z"
}
]
}
4.1.5 Training with Document Selection
POST /api/v1/admin/training/tasks
Content-Type: application/json
Request:
{
"name": "Training Run 2024-01",
"description": "First training run with 500 documents",
"document_ids": ["uuid1", "uuid2", "uuid3"],
"config": {
"model_name": "yolo11n.pt",
"epochs": 100,
"batch_size": 16,
"image_size": 640
},
"scheduled_at": "2024-01-20T22:00:00Z"
}
Response:
{
"task_id": "uuid",
"name": "Training Run 2024-01",
"status": "scheduled",
"document_count": 500,
"message": "Training task scheduled for 2024-01-20T22:00:00Z"
}
4.1.6 Get Documents for Training
GET /api/v1/admin/training/documents
Query Parameters:
- status: labeled (required)
- has_annotations: true
- min_annotation_count: 3
- exclude_used_in_training: boolean
- limit: 100
- offset: 0
Response:
{
"total": 1500,
"documents": [
{
"document_id": "uuid",
"filename": "INV001.pdf",
"annotation_count": 8,
"annotation_sources": { "manual": 3, "auto": 5 },
"used_in_training": ["task_id_1", "task_id_2"],
"last_modified": "2024-01-15T10:00:00Z"
}
]
}
4.1.7 Get Model List
GET /api/v1/admin/training/models
Query Parameters:
- status: completed
- limit: 20
- offset: 0
Response:
{
"total": 15,
"models": [
{
"task_id": "uuid",
"name": "Training Run 2024-01",
"status": "completed",
"document_count": 500,
"created_at": "2024-01-20T15:00:00Z",
"completed_at": "2024-01-20T18:30:00Z",
"metrics": {
"mAP": 0.935,
"precision": 0.92,
"recall": 0.88
},
"model_path": "runs/train/invoice_fields_20240120/weights/best.pt",
"download_url": "/api/v1/admin/training/models/{task_id}/download"
}
]
}
4.1.8 Override Annotation
PATCH /api/v1/admin/documents/{document_id}/annotations/{annotation_id}/override
Content-Type: application/json
Request:
{
"bbox": { "x": 110, "y": 205, "width": 145, "height": 28 },
"text_value": "F2024-001-A",
"reason": "Corrected OCR error"
}
Response:
{
"annotation_id": "uuid",
"source": "manual",
"override_source": "auto",
"original_annotation_id": "uuid",
"message": "Annotation overridden successfully",
"history_id": "uuid"
}
4.2 Modified Endpoints
4.2.1 Document List (Enhanced)
GET /api/v1/admin/documents
Query Parameters (additions):
- upload_source: 'ui' | 'api' | null
- has_annotations: boolean
- auto_label_status: 'pending' | 'running' | 'completed' | 'failed'
- used_in_training: boolean
- batch_id: uuid
Response (additions to DocumentItem):
{
"documents": [
{
// ... existing fields ...
"upload_source": "api",
"batch_id": "uuid",
"can_annotate": true,
"training_count": 2
}
]
}
4.2.2 Document Detail (Enhanced)
GET /api/v1/admin/documents/{document_id}
Response (additions):
{
// ... existing fields ...
"upload_source": "api",
"csv_field_values": {
"InvoiceNumber": "F2024-001",
"Amount": "1250.00"
},
"can_annotate": true,
"annotation_lock_reason": null,
"annotations": [
{
// ... existing fields ...
"is_verified": true,
"verified_at": "2024-01-16T09:00:00Z",
"override_source": null
}
],
"training_history": [
{
"task_id": "uuid",
"name": "Training Run 2024-01",
"trained_at": "2024-01-20T15:00:00Z",
"model_metrics": { "mAP": 0.935 }
}
]
}
5. UI Wireframes (Text-Based)
5.1 Document List View
+------------------------------------------------------------------+
| DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]|
+------------------------------------------------------------------+
| [Documents] [Training] [Models] [Settings] |
+------------------------------------------------------------------+
| |
| DOCUMENTS |
| +-----------------+ +-----------------------------------------+ |
| | UPLOAD | | FILTERS | |
| | [Single PDF] | | Status: [All v] Source: [All v] | |
| | [ZIP Batch] | | Auto-Label: [All v] Search: [________] | |
| +-----------------+ +-----------------------------------------+ |
| |
| +--------------------------------------------------------------+ |
| | [] Filename | Status | Auto-Label | Source | Date | |
| +--------------------------------------------------------------+ |
| | [] INV001.pdf | Labeled | Completed | API | 01/15 | |
| | [8 annotations] | [Preview] | [95%] | | | |
| +--------------------------------------------------------------+ |
| | [] INV002.pdf | Pending | Running | UI | 01/16 | |
| | [0 annotations] | [Locked] | [==== ] | | | |
| +--------------------------------------------------------------+ |
| | [] INV003.pdf | Labeled | Failed | API | 01/16 | |
| | [5 annotations] | [Preview] | [Retry] | | | |
| +--------------------------------------------------------------+ |
| | [] INV004.pdf | Labeled | Completed | UI | 01/17 | |
| | [10 annotations]| [Preview] | [98%] | [Used] | | |
| +--------------------------------------------------------------+ |
| |
| Showing 1-20 of 1,543 documents [<] [1] [2] [3] ... [78] [>] |
| |
| [Delete Selected] [Start Training with Selected] |
+------------------------------------------------------------------+
5.2 Document Detail View
+------------------------------------------------------------------+
| < Back to Documents INV001.pdf |
+------------------------------------------------------------------+
| |
| +---------------------------+ +-------------------------------+ |
| | | | DOCUMENT INFO | |
| | | | Status: Labeled | |
| | [Page 1 Image with | | Source: API Upload | |
| | Annotation Overlays] | | Auto-Label: Completed (95%) | |
| | | | Pages: 1 | |
| | [Manual: Solid border] | | Uploaded: 2024-01-15 | |
| | [Auto: Dashed border] | | | |
| | | | TRAINING HISTORY | |
| | | | - Run 2024-01 (mAP: 93.5%) | |
| | | | - Run 2024-02 (mAP: 95.1%) | |
| | | | | |
| +---------------------------+ +-------------------------------+ |
| |
| ANNOTATIONS [Add Annotation] [Run OCR] |
| +--------------------------------------------------------------+ |
| | Field | Value | Source | Conf | Actions | |
| +--------------------------------------------------------------+ |
| | invoice_number | F2024-001 | Manual | - | [E] [D] | |
| +--------------------------------------------------------------+ |
| | invoice_date | 2024-01-15 | Auto | 95% | [V] [E][D]| |
| +--------------------------------------------------------------+ |
| | amount | 1,250.00 | Auto | 98% | [V] [E][D]| |
| +--------------------------------------------------------------+ |
| | ocr_number | 7350012345 | Auto | 87% | [V] [E][D]| |
| +--------------------------------------------------------------+ |
| | bankgiro | 123-4567 | Manual | - | [E] [D] | |
| +--------------------------------------------------------------+ |
| |
| [V] = Verify [E] = Edit [D] = Delete |
| |
| CSV FIELD VALUES (Reference) |
| +--------------------------------------------------------------+ |
| | InvoiceNumber: F2024-001 | InvoiceDate: 2024-01-15 | |
| | Amount: 1250.00 | OCR: 7350012345678 | |
| | Bankgiro: 123-4567 | | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
5.3 Training Page
+------------------------------------------------------------------+
| DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]|
+------------------------------------------------------------------+
| [Documents] [Training] [Models] [Settings] |
+------------------------------------------------------------------+
| |
| TRAINING |
| |
| DOCUMENT SELECTION Selected: 500 docs |
| +--------------------------------------------------------------+ |
| | [] Filename | Annotations | Source | Last Modified | |
| +--------------------------------------------------------------+ |
| | [x] INV001.pdf | 8 (M:3 A:5) | API | 2024-01-15 | |
| +--------------------------------------------------------------+ |
| | [x] INV002.pdf | 10 (M:2 A:8)| UI | 2024-01-16 | |
| +--------------------------------------------------------------+ |
| | [ ] INV003.pdf | 5 (M:5 A:0) | UI | 2024-01-16 | |
| +--------------------------------------------------------------+ |
| | [x] INV004.pdf | 12 (M:4 A:8)| API | 2024-01-17 | |
| +--------------------------------------------------------------+ |
| |
| [Select All] [Select None] [Select Not Used in Training] |
| |
| Showing labeled documents only [<] [1] [2] [3] ... [50] [>] |
| |
| TRAINING CONFIGURATION |
| +--------------------------------------------------------------+ |
| | Name: [Training Run 2024-01____________] | |
| | Description: [First training with 500 documents_________] | |
| | | |
| | Base Model: [yolo11n.pt v] Epochs: [100] Batch: [16] | |
| | Image Size: [640] Device: [GPU 0 v] | |
| | | |
| | [ ] Schedule for later: [2024-01-20] [22:00] | |
| +--------------------------------------------------------------+ |
| |
| [Start Training] |
+------------------------------------------------------------------+
5.4 Model History View
+------------------------------------------------------------------+
| DOCUMENT ANNOTATION TOOL [User: Admin] [Logout]|
+------------------------------------------------------------------+
| [Documents] [Training] [Models] [Settings] |
+------------------------------------------------------------------+
| |
| TRAINED MODELS |
| |
| +--------------------------------------------------------------+ |
| | Name | Status | Docs | mAP | Date | |
| +--------------------------------------------------------------+ |
| | Training Run 2024-03 | Running | 750 | - | 01/25 | |
| | | [==== ] | | | | |
| | | [View Logs] [Cancel] | |
| +--------------------------------------------------------------+ |
| | Training Run 2024-02 | Completed | 600 | 95.1% | 01/20 | |
| | | P: 94% R: 92% | |
| | | [View] [Download] [Use as Base] | |
| +--------------------------------------------------------------+ |
| | Training Run 2024-01 | Completed | 500 | 93.5% | 01/15 | |
| | | P: 92% R: 88% | |
| | | [View] [Download] [Use as Base] | |
| +--------------------------------------------------------------+ |
| | Initial Training | Completed | 200 | 85.2% | 01/10 | |
| | | P: 84% R: 80% | |
| | | [View] [Download] [Use as Base] | |
| +--------------------------------------------------------------+ |
| |
| MODEL DETAIL: Training Run 2024-02 |
| +--------------------------------------------------------------+ |
| | Created: 2024-01-20 15:00 | Completed: 2024-01-20 18:30 | |
| | Duration: 3h 30m | Documents: 600 | |
| | | |
| | Metrics: | |
| | - mAP@0.5: 95.1% | |
| | - Precision: 94% | |
| | - Recall: 92% | |
| | | |
| | Configuration: | |
| | - Base: yolo11n.pt Epochs: 100 Batch: 16 Size: 640 | |
| | | |
| | Documents Used: [View 600 documents] | |
| +--------------------------------------------------------------+ |
+------------------------------------------------------------------+
5.5 Batch Upload Modal
+------------------------------------------------------------------+
| BATCH UPLOAD [X] |
+------------------------------------------------------------------+
| |
| Upload a ZIP file containing: |
| - Multiple PDF files |
| - (Optional) CSV file for auto-labeling |
| |
| +--------------------------------------------------------------+ |
| | | |
| | [Drag and drop ZIP file here] | |
| | or | |
| | [Browse Files] | |
| | | |
| +--------------------------------------------------------------+ |
| |
| [x] Auto-label documents (requires CSV) |
| [ ] Process asynchronously |
| |
| CSV FORMAT REQUIREMENTS: |
| Required columns: DocumentId |
| Optional: InvoiceNumber, InvoiceDate, Amount, OCR, Bankgiro... |
| [View full CSV specification] |
| |
| [Cancel] [Upload] |
+------------------------------------------------------------------+
+------------------------------------------------------------------+
| UPLOAD PROGRESS [X] |
+------------------------------------------------------------------+
| |
| Processing batch upload... |
| |
| [======================================== ] 80% |
| |
| Files: 20 / 25 |
| Successful: 18 |
| Failed: 2 |
| |
| +--------------------------------------------------------------+ |
| | [OK] INV001.pdf - Completed (8 annotations) | |
| | [OK] INV002.pdf - Completed (10 annotations) | |
| | [!!] INV003.pdf - Failed: Corrupted PDF | |
| | [OK] INV004.pdf - Completed (6 annotations) | |
| | [...] Processing INV005.pdf... | |
| +--------------------------------------------------------------+ |
| |
| [Cancel] [Close] |
+------------------------------------------------------------------+
6. Implementation Phases
Phase 1: Database and Core Models (Week 1)
| Step | Task | Files | Risk |
|---|---|---|---|
| 1.1 | Create database migration script | src/data/migrations/ |
Low |
| 1.2 | Add new SQLModel classes | src/data/admin_models.py |
Low |
| 1.3 | Update AdminDB with new methods | src/data/admin_db.py |
Medium |
| 1.4 | Add unit tests for new models | tests/data/test_admin_models.py |
Low |
Dependencies: None Risk Assessment: Low - mostly additive changes to existing structure
Phase 2: Batch Upload Backend (Week 2)
| Step | Task | Files | Risk |
|---|---|---|---|
| 2.1 | Create ZIP extraction service | src/web/batch_upload_service.py |
Medium |
| 2.2 | Add CSV parsing with new format | src/data/csv_loader.py |
Low |
| 2.3 | Create batch upload routes | src/web/admin_batch_routes.py |
Medium |
| 2.4 | Add async processing queue | src/web/batch_queue.py |
High |
| 2.5 | Integration tests | tests/web/test_batch_upload.py |
Medium |
Dependencies: Phase 1 Risk Assessment: Medium - ZIP handling and async processing add complexity
Phase 3: Enhanced Document Management (Week 3)
| Step | Task | Files | Risk |
|---|---|---|---|
| 3.1 | Add upload source tracking | src/data/admin_models.py |
Low |
| 3.2 | Update document list endpoint | src/web/admin_routes.py |
Low |
| 3.3 | Add annotation lock mechanism | src/web/admin_annotation_routes.py |
Medium |
| 3.4 | Add document status endpoint | src/web/admin_routes.py |
Low |
| 3.5 | Update auto-label service | src/web/admin_autolabel.py |
Medium |
Dependencies: Phase 1, Phase 2 Risk Assessment: Medium - locking mechanism needs careful implementation
Phase 4: Manual Annotation Enhancement (Week 4)
| Step | Task | Files | Risk |
|---|---|---|---|
| 4.1 | Add override mechanism | src/web/admin_annotation_routes.py |
Medium |
| 4.2 | Add annotation history | src/data/admin_db.py |
Low |
| 4.3 | Add verification endpoint | src/web/admin_annotation_routes.py |
Low |
| 4.4 | Update schemas with new fields | src/web/admin_schemas.py |
Low |
Dependencies: Phase 3 Risk Assessment: Low - extending existing annotation system
Phase 5: Training Integration (Week 5)
| Step | Task | Files | Risk |
|---|---|---|---|
| 5.1 | Add document selection for training | src/web/admin_training_routes.py |
Medium |
| 5.2 | Add training document link table | src/data/admin_db.py |
Low |
| 5.3 | Add model list endpoint | src/web/admin_training_routes.py |
Low |
| 5.4 | Update export with selection | src/web/admin_training_routes.py |
Medium |
| 5.5 | Add metrics extraction | src/cli/train.py |
Medium |
Dependencies: Phase 1, Phase 4 Risk Assessment: Medium - integration with training pipeline
Phase 6: Frontend Implementation (Weeks 6-7)
| Step | Task | Files | Risk |
|---|---|---|---|
| 6.1 | Create React component structure | frontend/ |
High |
| 6.2 | Implement document list view | frontend/src/components/ |
Medium |
| 6.3 | Implement document detail view | frontend/src/components/ |
High |
| 6.4 | Implement training page | frontend/src/components/ |
Medium |
| 6.5 | Implement batch upload modal | frontend/src/components/ |
Medium |
| 6.6 | Add annotation editor | frontend/src/components/ |
High |
Dependencies: Phase 2-5 Risk Assessment: High - frontend development is a new component
Phase 7: Testing and Documentation (Week 8)
| Step | Task | Files | Risk |
|---|---|---|---|
| 7.1 | Integration tests | tests/integration/ |
Medium |
| 7.2 | E2E tests | tests/e2e/ |
High |
| 7.3 | API documentation | docs/api/ |
Low |
| 7.4 | User guide | docs/user-guide/ |
Low |
| 7.5 | Performance testing | tests/performance/ |
Medium |
Dependencies: All phases Risk Assessment: Medium
Risk Mitigation Strategies
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| ZIP bomb attack | High | Low | Limit max file count, max total size, scan before extraction |
| Async queue failures | Medium | Medium | Implement retry logic, dead letter queue, manual retry endpoint |
| Annotation lock deadlock | Medium | Low | Timeout-based locks, admin override capability |
| Large batch performance | Medium | High | Chunked processing, progress tracking, background workers |
| Database migration issues | High | Low | Backward compatible changes, rollback scripts |
| Frontend complexity | Medium | Medium | Use established UI framework, incremental delivery |
7. State Machine Diagrams
7.1 Document Lifecycle States
+-------------+
| DELETED |
+------^------+
|
| delete
|
+----------+ upload +----------+ |
| | --------------> | |--+
| (none) | | PENDING |
| | | |
+----------+ +----+-----+
|
+----------------+-----------------+
| |
| trigger auto-label | create manual annotation
v |
+-------------+ |
| | |
| AUTO_LABEL- | |
| ING | |
| | |
+------+------+ |
| |
+---------+---------+ |
| | |
| complete | fail |
v v |
+-------------+ +-------------+ |
| | | | |
| LABELED |<----+ PENDING +<--------------+
| | retry| (failed) |
+------+------+ +-------------+
|
| export
v
+-------------+
| |
| EXPORTED |
| |
+-------------+
7.2 Auto-Label Workflow States
+-------------+
| MANUAL |
| OVERRIDE |
+------^------+
|
| user edit
|
+----------+ queue +----------+ | +-----------+
| | --------------> | | | | |
| (none) | | QUEUED |--+--->| COMPLETED |
| | | | | |
+----------+ +----+-----+ +-----^-----+
| |
| start |
v |
+-------------+ |
| | |
| RUNNING +-----------+
| | success
+------+------+
|
| error
v
+-------------+
| |
| FAILED |
| |
+------+------+
|
| retry
v
+-------------+
| |
| QUEUED |
| |
+-------------+
7.3 Batch Upload States
+----------+ upload +-------------+
| | --------------> | |
| (none) | | PROCESSING |
| | | |
+----------+ +------+------+
|
+---------------+---------------+
| | |
| all success | some fail | all fail
v v v
+-------------+ +-------------+ +-------------+
| | | | | |
| COMPLETED | | PARTIAL | | FAILED |
| | | | | |
+-------------+ +-------------+ +-------------+
7.4 Training Task States
+----------+ create +-------------+
| | --------------> | |
| (none) | | PENDING |
| | | |
+----------+ +------+------+
|
+-------------+-------------+
| |
| immediate | scheduled
v v
+-------------+ +-------------+
| | | |
| RUNNING |<------------+ SCHEDULED |
| | trigger | |
+------+------+ +------+------+
| |
+---------+---------+ | cancel
| | v
| success | error +-------------+
v v | |
+-------------+ +-------------+ | CANCELLED |
| | | | | |
| COMPLETED | | FAILED | +-------------+
| | | |
+-------------+ +------+------+
|
| retry
v
+-------------+
| |
| PENDING |
| |
+-------------+
7.5 Annotation Lock States
+-------------+
| LOCKED |
| (auto-label |
| running) |
+------^------+
|
| auto-label starts
|
+----------+ upload +----------+ |
| | --------------> | |--+
| (none) | | UNLOCKED |<---------+
| | | | |
+----------+ +----+-----+ |
| |
| auto-label | auto-label
| starts | completes/fails
| |
v |
+-------------+ |
| | |
| LOCKED +---------+
| (timeout: |
| 5 minutes) |
+-------------+
Summary
This comprehensive plan provides:
- PRD: 24 user stories across 6 epics with clear acceptance criteria and priorities
- CSV Specification: 13 columns with detailed validation rules and field mappings
- Database Schema: 4 new tables + modifications to 3 existing tables with full SQLModel definitions
- API Specification: 8 new endpoints + 2 modified endpoints with complete request/response schemas
- UI Wireframes: 5 detailed text-based wireframes covering all major views
- Implementation Phases: 7 phases over 8 weeks with 30+ tasks, dependencies, and risk assessments
- State Machines: 5 state diagrams covering document, auto-label, batch, training, and locking workflows
The implementation follows an incremental approach starting with database/backend changes before frontend development, minimizing risk and enabling continuous testing throughout the development cycle.