Files
invoice-master-poc-v2/packages/shared
Yaojia Wang 0990239e9c feat: add field-specific bbox expansion strategies for YOLO training
Implement center-point based bbox scaling with directional compensation
to capture field labels that typically appear above or to the left of
field values. This improves YOLO training data quality by including
contextual information around field values.

Key changes:
- Add shared.bbox module with ScaleStrategy dataclass and expand_bbox function
- Define field-specific strategies (ocr_number, bankgiro, invoice_date, etc.)
- Support manual_mode for minimal padding (no scaling)
- Integrate expand_bbox into AnnotationGenerator
- Add FIELD_TO_CLASS mapping for field_name to class_name lookup
- Comprehensive tests with 100% coverage (45 tests)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:56:52 +01:00
..
WIP
2026-02-01 00:08:40 +01:00
2026-01-27 23:58:17 +01:00

Shared Package

Shared utilities and abstractions for the Invoice Master system.

Storage Abstraction Layer

A unified storage abstraction supporting multiple backends:

  • Local filesystem - Development and testing
  • Azure Blob Storage - Azure cloud deployments
  • AWS S3 - AWS cloud deployments

Installation

# Basic installation (local storage only)
pip install -e packages/shared

# With Azure support
pip install -e "packages/shared[azure]"

# With S3 support
pip install -e "packages/shared[s3]"

# All cloud providers
pip install -e "packages/shared[all]"

Quick Start

from shared.storage import get_storage_backend

# Option 1: From configuration file
storage = get_storage_backend("storage.yaml")

# Option 2: From environment variables
from shared.storage import create_storage_backend_from_env
storage = create_storage_backend_from_env()

# Upload a file
storage.upload(Path("local/file.pdf"), "documents/file.pdf")

# Download a file
storage.download("documents/file.pdf", Path("local/downloaded.pdf"))

# Get pre-signed URL for frontend access
url = storage.get_presigned_url("documents/file.pdf", expires_in_seconds=3600)

Configuration File Format

Create a storage.yaml file with environment variable substitution support:

# Backend selection: local, azure_blob, or s3
backend: ${STORAGE_BACKEND:-local}

# Default pre-signed URL expiry (seconds)
presigned_url_expiry: 3600

# Local storage configuration
local:
  base_path: ${STORAGE_BASE_PATH:-./data/storage}

# Azure Blob Storage configuration
azure:
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: ${AZURE_STORAGE_CONTAINER:-documents}
  create_container: false

# AWS S3 configuration
s3:
  bucket_name: ${AWS_S3_BUCKET}
  region_name: ${AWS_REGION:-us-east-1}
  access_key_id: ${AWS_ACCESS_KEY_ID}
  secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  endpoint_url: ${AWS_ENDPOINT_URL}  # Optional, for S3-compatible services
  create_bucket: false

Environment Variables

Variable Backend Description
STORAGE_BACKEND All Backend type: local, azure_blob, s3
STORAGE_BASE_PATH Local Base directory path
AZURE_STORAGE_CONNECTION_STRING Azure Connection string
AZURE_STORAGE_CONTAINER Azure Container name
AWS_S3_BUCKET S3 Bucket name
AWS_REGION S3 AWS region (default: us-east-1)
AWS_ACCESS_KEY_ID S3 Access key (optional, uses credential chain)
AWS_SECRET_ACCESS_KEY S3 Secret key (optional)
AWS_ENDPOINT_URL S3 Custom endpoint for S3-compatible services

API Reference

StorageBackend Interface

class StorageBackend(ABC):
    def upload(self, local_path: Path, remote_path: str, overwrite: bool = False) -> str:
        """Upload a file to storage."""

    def download(self, remote_path: str, local_path: Path) -> Path:
        """Download a file from storage."""

    def exists(self, remote_path: str) -> bool:
        """Check if a file exists."""

    def list_files(self, prefix: str) -> list[str]:
        """List files with given prefix."""

    def delete(self, remote_path: str) -> bool:
        """Delete a file."""

    def get_url(self, remote_path: str) -> str:
        """Get URL for a file."""

    def get_presigned_url(self, remote_path: str, expires_in_seconds: int = 3600) -> str:
        """Generate a pre-signed URL for temporary access (1-604800 seconds)."""

    def upload_bytes(self, data: bytes, remote_path: str, overwrite: bool = False) -> str:
        """Upload bytes directly."""

    def download_bytes(self, remote_path: str) -> bytes:
        """Download file as bytes."""

Factory Functions

# Create from configuration file
storage = create_storage_backend_from_file("storage.yaml")

# Create from environment variables
storage = create_storage_backend_from_env()

# Create from StorageConfig object
config = StorageConfig(backend_type="local", base_path=Path("./data"))
storage = create_storage_backend(config)

# Convenience function with fallback chain: config file -> env vars -> local default
storage = get_storage_backend("storage.yaml")  # or None for env-only

Pre-signed URLs

Pre-signed URLs provide temporary access to files without exposing credentials:

# Generate URL valid for 1 hour (default)
url = storage.get_presigned_url("documents/invoice.pdf")

# Generate URL valid for 24 hours
url = storage.get_presigned_url("documents/invoice.pdf", expires_in_seconds=86400)

# Maximum expiry: 7 days (604800 seconds)
url = storage.get_presigned_url("documents/invoice.pdf", expires_in_seconds=604800)

Note: Local storage returns file:// URLs that don't actually expire.

Error Handling

from shared.storage import (
    StorageError,
    FileNotFoundStorageError,
    PresignedUrlNotSupportedError,
)

try:
    storage.download("nonexistent.pdf", Path("local.pdf"))
except FileNotFoundStorageError as e:
    print(f"File not found: {e}")
except StorageError as e:
    print(f"Storage error: {e}")

Testing with MinIO (S3-compatible)

# Start MinIO locally
docker run -p 9000:9000 -p 9001:9001 minio/minio server /data --console-address ":9001"

# Configure environment
export STORAGE_BACKEND=s3
export AWS_S3_BUCKET=test-bucket
export AWS_ENDPOINT_URL=http://localhost:9000
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin

Module Structure

shared/storage/
├── __init__.py       # Public exports
├── base.py           # Abstract interface and exceptions
├── local.py          # Local filesystem backend
├── azure.py          # Azure Blob Storage backend
├── s3.py             # AWS S3 backend
├── config_loader.py  # YAML configuration loader
└── factory.py        # Backend factory functions