Add claude config
This commit is contained in:
22
.env.example
Normal file
22
.env.example
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
# Database Configuration
|
||||||
|
# Copy this file to .env and fill in your actual values
|
||||||
|
|
||||||
|
# PostgreSQL Database
|
||||||
|
DB_HOST=192.168.68.31
|
||||||
|
DB_PORT=5432
|
||||||
|
DB_NAME=docmaster
|
||||||
|
DB_USER=docmaster
|
||||||
|
DB_PASSWORD=your_password_here
|
||||||
|
|
||||||
|
# Model Configuration (optional)
|
||||||
|
# MODEL_PATH=runs/train/invoice_fields/weights/best.pt
|
||||||
|
# CONFIDENCE_THRESHOLD=0.5
|
||||||
|
|
||||||
|
# Server Configuration (optional)
|
||||||
|
# SERVER_HOST=0.0.0.0
|
||||||
|
# SERVER_PORT=8000
|
||||||
|
|
||||||
|
# Auto-labeling Configuration (optional)
|
||||||
|
# AUTOLABEL_WORKERS=2
|
||||||
|
# AUTOLABEL_DPI=150
|
||||||
|
# AUTOLABEL_MIN_CONFIDENCE=0.5
|
||||||
317
CHANGELOG.md
Normal file
317
CHANGELOG.md
Normal file
@@ -0,0 +1,317 @@
|
|||||||
|
# Changelog
|
||||||
|
|
||||||
|
All notable changes to the Invoice Field Extraction project will be documented in this file.
|
||||||
|
|
||||||
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [Unreleased]
|
||||||
|
|
||||||
|
### Added - Phase 1: Security & Infrastructure (2026-01-22)
|
||||||
|
|
||||||
|
#### Security Enhancements
|
||||||
|
- **Environment Variable Management**: Added `python-dotenv` for secure configuration management
|
||||||
|
- Created `.env.example` template file for configuration reference
|
||||||
|
- Created `.env` file for actual credentials (gitignored)
|
||||||
|
- Updated `config.py` to load database password from environment variables
|
||||||
|
- Added validation to ensure `DB_PASSWORD` is set at startup
|
||||||
|
- Files modified: `config.py`, `requirements.txt`
|
||||||
|
- New files: `.env`, `.env.example`
|
||||||
|
- Tests: `tests/test_config.py` (7 tests, all passing)
|
||||||
|
|
||||||
|
- **SQL Injection Prevention**: Fixed SQL injection vulnerabilities in database queries
|
||||||
|
- Replaced f-string formatting with parameterized queries in `LIMIT` clauses
|
||||||
|
- Updated `get_all_documents_summary()` to use `%s` placeholder for LIMIT parameter
|
||||||
|
- Updated `get_failed_matches()` to use `%s` placeholder for LIMIT parameter
|
||||||
|
- Files modified: `src/data/db.py` (lines 246, 298)
|
||||||
|
- Tests: `tests/test_db_security.py` (9 tests, all passing)
|
||||||
|
|
||||||
|
#### Code Quality
|
||||||
|
- **Exception Hierarchy**: Created comprehensive custom exception system
|
||||||
|
- Added base class `InvoiceExtractionError` with message and details support
|
||||||
|
- Added specific exception types:
|
||||||
|
- `PDFProcessingError` - PDF rendering/conversion errors
|
||||||
|
- `OCRError` - OCR processing errors
|
||||||
|
- `ModelInferenceError` - YOLO model errors
|
||||||
|
- `FieldValidationError` - Field validation errors (with field-specific attributes)
|
||||||
|
- `DatabaseError` - Database operation errors
|
||||||
|
- `ConfigurationError` - Configuration errors
|
||||||
|
- `PaymentLineParseError` - Payment line parsing errors
|
||||||
|
- `CustomerNumberParseError` - Customer number parsing errors
|
||||||
|
- `DataLoadError` - Data loading errors
|
||||||
|
- `AnnotationError` - Annotation generation errors
|
||||||
|
- New file: `src/exceptions.py`
|
||||||
|
- Tests: `tests/test_exceptions.py` (16 tests, all passing)
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
- Added 32 new tests across 3 test files
|
||||||
|
- Configuration tests: 7 tests
|
||||||
|
- SQL injection prevention tests: 9 tests
|
||||||
|
- Exception hierarchy tests: 16 tests
|
||||||
|
- All tests passing (32/32)
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- Created `docs/CODE_REVIEW_REPORT.md` - Comprehensive code quality analysis (550+ lines)
|
||||||
|
- Created `docs/REFACTORING_PLAN.md` - Detailed 3-phase refactoring plan (600+ lines)
|
||||||
|
- Created `CHANGELOG.md` - Project changelog (this file)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- **Configuration Loading**: Database configuration now loads from environment variables instead of hardcoded values
|
||||||
|
- Breaking change: Requires `.env` file with `DB_PASSWORD` set
|
||||||
|
- Migration: Copy `.env.example` to `.env` and set your database password
|
||||||
|
|
||||||
|
### Security
|
||||||
|
- **Fixed**: Database password no longer stored in plain text in `config.py`
|
||||||
|
- **Fixed**: SQL injection vulnerabilities in LIMIT clauses (2 instances)
|
||||||
|
|
||||||
|
### Technical Debt Addressed
|
||||||
|
- Eliminated security vulnerability: plaintext password storage
|
||||||
|
- Reduced SQL injection attack surface
|
||||||
|
- Improved error handling granularity with custom exceptions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Added - Phase 2: Parser Refactoring (2026-01-22)
|
||||||
|
|
||||||
|
#### Unified Parser Modules
|
||||||
|
- **Payment Line Parser**: Created dedicated payment line parsing module
|
||||||
|
- Handles Swedish payment line format: `# <OCR> # <Kronor> <Öre> <Type> > <Account>#<Check>#`
|
||||||
|
- Tolerates common OCR errors: spaces in numbers, missing symbols, spaces in check digits
|
||||||
|
- Supports 4 parsing patterns: full format, no amount, alternative, account-only
|
||||||
|
- Returns structured `PaymentLineData` with parsed fields
|
||||||
|
- New file: `src/inference/payment_line_parser.py` (90 lines, 92% coverage)
|
||||||
|
- Tests: `tests/test_payment_line_parser.py` (23 tests, all passing)
|
||||||
|
- Eliminates 1st code duplication (payment line parsing logic)
|
||||||
|
|
||||||
|
- **Customer Number Parser**: Created dedicated customer number parsing module
|
||||||
|
- Handles Swedish customer number formats: `JTY 576-3`, `DWQ 211-X`, `FFL 019N`, etc.
|
||||||
|
- Uses Strategy Pattern with 5 pattern classes:
|
||||||
|
- `LabeledPattern` - Explicit labels (highest priority, 0.98 confidence)
|
||||||
|
- `DashFormatPattern` - Standard format with dash (0.95 confidence)
|
||||||
|
- `NoDashFormatPattern` - Format without dash, adds dash automatically (0.90 confidence)
|
||||||
|
- `CompactFormatPattern` - Compact format without spaces (0.75 confidence)
|
||||||
|
- `GenericAlphanumericPattern` - Fallback generic pattern (variable confidence)
|
||||||
|
- Excludes Swedish postal codes (`SE XXX XX` format)
|
||||||
|
- Returns highest confidence match
|
||||||
|
- New file: `src/inference/customer_number_parser.py` (154 lines, 92% coverage)
|
||||||
|
- Tests: `tests/test_customer_number_parser.py` (32 tests, all passing)
|
||||||
|
- Reduces `_normalize_customer_number` complexity (127 lines → will use 5-10 lines after integration)
|
||||||
|
|
||||||
|
### Testing Summary
|
||||||
|
|
||||||
|
**Phase 1 Tests** (32 tests):
|
||||||
|
- Configuration tests: 7 tests ([test_config.py](tests/test_config.py))
|
||||||
|
- SQL injection prevention tests: 9 tests ([test_db_security.py](tests/test_db_security.py))
|
||||||
|
- Exception hierarchy tests: 16 tests ([test_exceptions.py](tests/test_exceptions.py))
|
||||||
|
|
||||||
|
**Phase 2 Tests** (121 tests):
|
||||||
|
- Payment line parser tests: 23 tests ([test_payment_line_parser.py](tests/test_payment_line_parser.py))
|
||||||
|
- Standard parsing, OCR error handling, real-world examples, edge cases
|
||||||
|
- Coverage: 92%
|
||||||
|
- Customer number parser tests: 32 tests ([test_customer_number_parser.py](tests/test_customer_number_parser.py))
|
||||||
|
- Pattern matching (DashFormat, NoDashFormat, Compact, Labeled)
|
||||||
|
- Real-world examples, edge cases, Swedish postal code exclusion
|
||||||
|
- Coverage: 92%
|
||||||
|
- Field extractor integration tests: 45 tests ([test_field_extractor.py](src/inference/test_field_extractor.py))
|
||||||
|
- Validates backward compatibility with existing code
|
||||||
|
- Tests for invoice numbers, bankgiro, plusgiro, amounts, OCR, dates, payment lines, customer numbers
|
||||||
|
- Pipeline integration tests: 21 tests ([test_pipeline.py](src/inference/test_pipeline.py))
|
||||||
|
- Cross-validation, payment line parsing, field overrides
|
||||||
|
|
||||||
|
**Total**: 153 tests, 100% passing, 4.50s runtime
|
||||||
|
|
||||||
|
### Code Quality
|
||||||
|
- **Eliminated Code Duplication**: Payment line parsing previously in 3 places, now unified in 1 module
|
||||||
|
- **Improved Maintainability**: Strategy Pattern makes customer number patterns easy to extend
|
||||||
|
- **Better Test Coverage**: New parsers have 92% coverage vs original 10% in field_extractor.py
|
||||||
|
|
||||||
|
#### Parser Integration into field_extractor.py (2026-01-22)
|
||||||
|
|
||||||
|
- **field_extractor.py Integration**: Successfully integrated new parsers
|
||||||
|
- Added `PaymentLineParser` and `CustomerNumberParser` instances (lines 99-101)
|
||||||
|
- Replaced `_normalize_payment_line` method: 74 lines → 3 lines (lines 640-657)
|
||||||
|
- Replaced `_normalize_customer_number` method: 127 lines → 3 lines (lines 697-707)
|
||||||
|
- All 45 existing tests pass (100% backward compatibility maintained)
|
||||||
|
- Tests run time: 4.21 seconds
|
||||||
|
- File: `src/inference/field_extractor.py`
|
||||||
|
|
||||||
|
#### Parser Integration into pipeline.py (2026-01-22)
|
||||||
|
|
||||||
|
- **pipeline.py Integration**: Successfully integrated PaymentLineParser
|
||||||
|
- Added `PaymentLineParser` import (line 15)
|
||||||
|
- Added `payment_line_parser` instance initialization (line 128)
|
||||||
|
- Replaced `_parse_machine_readable_payment_line` method: 36 lines → 6 lines (lines 219-233)
|
||||||
|
- All 21 existing tests pass (100% backward compatibility maintained)
|
||||||
|
- Tests run time: 4.00 seconds
|
||||||
|
- File: `src/inference/pipeline.py`
|
||||||
|
|
||||||
|
### Phase 2 Status: **COMPLETED** ✅
|
||||||
|
|
||||||
|
- [x] Create unified `payment_line_parser` module ✅
|
||||||
|
- [x] Create unified `customer_number_parser` module ✅
|
||||||
|
- [x] Refactor `field_extractor.py` to use new parsers ✅
|
||||||
|
- [x] Refactor `pipeline.py` to use new parsers ✅
|
||||||
|
- [x] Comprehensive test suite (153 tests, 100% passing) ✅
|
||||||
|
|
||||||
|
### Achieved Impact
|
||||||
|
- Eliminate code duplication: 3 implementations → 1 ✅ (payment_line unified across field_extractor.py, pipeline.py, tests)
|
||||||
|
- Reduce `_normalize_payment_line` complexity in field_extractor.py: 74 lines → 3 lines ✅
|
||||||
|
- Reduce `_normalize_customer_number` complexity in field_extractor.py: 127 lines → 3 lines ✅
|
||||||
|
- Reduce `_parse_machine_readable_payment_line` complexity in pipeline.py: 36 lines → 6 lines ✅
|
||||||
|
- Total lines of code eliminated: 201 lines reduced to 12 lines (94% reduction) ✅
|
||||||
|
- Improve test coverage: New parser modules have 92% coverage (vs original 10% in field_extractor.py)
|
||||||
|
- Simplify maintenance: Pattern-based approach makes extension easy
|
||||||
|
- 100% backward compatibility: All 66 existing tests pass (45 field_extractor + 21 pipeline)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Performance & Documentation (2026-01-22)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
#### Configuration Constants Extraction
|
||||||
|
- **Created `src/inference/constants.py`**: Centralized configuration constants
|
||||||
|
- Detection & model configuration (confidence thresholds, IOU)
|
||||||
|
- Image processing configuration (DPI, scaling factors)
|
||||||
|
- Customer number parser confidence scores
|
||||||
|
- Field extraction confidence multipliers
|
||||||
|
- Account type detection thresholds
|
||||||
|
- Pattern matching constants
|
||||||
|
- 90 lines of well-documented constants with usage notes
|
||||||
|
- Eliminates ~15 hardcoded magic numbers across codebase
|
||||||
|
- File: [src/inference/constants.py](src/inference/constants.py)
|
||||||
|
|
||||||
|
#### Performance Optimization Documentation
|
||||||
|
- **Created `docs/PERFORMANCE_OPTIMIZATION.md`**: Comprehensive performance guide (400+ lines)
|
||||||
|
- **Batch Processing Optimization**: Parallel processing strategies, already-implemented dual pool system
|
||||||
|
- **Database Query Optimization**: Connection pooling recommendations, index strategies
|
||||||
|
- **Caching Strategies**: Model loading cache, parser reuse (already optimal), OCR result caching
|
||||||
|
- **Memory Management**: Explicit cleanup, generator patterns, context managers
|
||||||
|
- **Profiling Guidelines**: cProfile, memory_profiler, py-spy recommendations
|
||||||
|
- **Benchmarking Scripts**: Ready-to-use performance measurement code
|
||||||
|
- **Priority Roadmap**: High/Medium/Low priority optimizations with effort estimates
|
||||||
|
- Expected impact: 2-5x throughput improvement for batch processing
|
||||||
|
- File: [docs/PERFORMANCE_OPTIMIZATION.md](docs/PERFORMANCE_OPTIMIZATION.md)
|
||||||
|
|
||||||
|
### Phase 3 Status: **COMPLETED** ✅
|
||||||
|
|
||||||
|
- [x] Configuration constants extraction ✅
|
||||||
|
- [x] Performance optimization analysis ✅
|
||||||
|
- [x] Batch processing optimization recommendations ✅
|
||||||
|
- [x] Database optimization strategies ✅
|
||||||
|
- [x] Caching and memory management guidelines ✅
|
||||||
|
- [x] Profiling and benchmarking documentation ✅
|
||||||
|
|
||||||
|
### Deliverables
|
||||||
|
|
||||||
|
**New Files** (2 files):
|
||||||
|
1. `src/inference/constants.py` (90 lines) - Centralized configuration constants
|
||||||
|
2. `docs/PERFORMANCE_OPTIMIZATION.md` (400+ lines) - Performance optimization guide
|
||||||
|
|
||||||
|
**Impact**:
|
||||||
|
- Eliminates 15+ hardcoded magic numbers
|
||||||
|
- Provides clear optimization roadmap
|
||||||
|
- Documents existing performance features
|
||||||
|
- Identifies quick wins (connection pooling, indexes)
|
||||||
|
- Long-term strategy (caching, profiling)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
### Breaking Changes
|
||||||
|
- **v2.x**: Requires `.env` file with database credentials
|
||||||
|
- Action required: Create `.env` file based on `.env.example`
|
||||||
|
- Affected: All deployments, CI/CD pipelines
|
||||||
|
|
||||||
|
### Migration Guide
|
||||||
|
|
||||||
|
#### From v1.x to v2.x (Environment Variables)
|
||||||
|
1. Copy `.env.example` to `.env`:
|
||||||
|
```bash
|
||||||
|
cp .env.example .env
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Edit `.env` and set your database password:
|
||||||
|
```
|
||||||
|
DB_PASSWORD=your_actual_password_here
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Install new dependency:
|
||||||
|
```bash
|
||||||
|
pip install python-dotenv
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Verify configuration loads correctly:
|
||||||
|
```bash
|
||||||
|
python -c "import config; print('Config loaded successfully')"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Summary of All Work Completed
|
||||||
|
|
||||||
|
### Files Created (13 new files)
|
||||||
|
|
||||||
|
**Phase 1** (3 files):
|
||||||
|
1. `.env` - Environment variables for database credentials
|
||||||
|
2. `.env.example` - Template for environment configuration
|
||||||
|
3. `src/exceptions.py` - Custom exception hierarchy (35 lines, 66% coverage)
|
||||||
|
|
||||||
|
**Phase 2** (7 files):
|
||||||
|
4. `src/inference/payment_line_parser.py` - Unified payment line parsing (90 lines, 92% coverage)
|
||||||
|
5. `src/inference/customer_number_parser.py` - Unified customer number parsing (154 lines, 92% coverage)
|
||||||
|
6. `tests/test_config.py` - Configuration tests (7 tests)
|
||||||
|
7. `tests/test_db_security.py` - SQL injection prevention tests (9 tests)
|
||||||
|
8. `tests/test_exceptions.py` - Exception hierarchy tests (16 tests)
|
||||||
|
9. `tests/test_payment_line_parser.py` - Payment line parser tests (23 tests)
|
||||||
|
10. `tests/test_customer_number_parser.py` - Customer number parser tests (32 tests)
|
||||||
|
|
||||||
|
**Phase 3** (2 files):
|
||||||
|
11. `src/inference/constants.py` - Centralized configuration constants (90 lines)
|
||||||
|
12. `docs/PERFORMANCE_OPTIMIZATION.md` - Performance optimization guide (400+ lines)
|
||||||
|
|
||||||
|
**Documentation** (1 file):
|
||||||
|
13. `CHANGELOG.md` - This file (260+ lines of detailed documentation)
|
||||||
|
|
||||||
|
### Files Modified (4 files)
|
||||||
|
1. `config.py` - Added environment variable loading with python-dotenv
|
||||||
|
2. `src/data/db.py` - Fixed 2 SQL injection vulnerabilities (lines 246, 298)
|
||||||
|
3. `src/inference/field_extractor.py` - Integrated new parsers (reduced 201 lines to 6 lines)
|
||||||
|
4. `src/inference/pipeline.py` - Integrated PaymentLineParser (reduced 36 lines to 6 lines)
|
||||||
|
5. `requirements.txt` - Added python-dotenv dependency
|
||||||
|
|
||||||
|
### Test Summary
|
||||||
|
- **Total tests**: 153 tests across 7 test files
|
||||||
|
- **Passing**: 153 (100%)
|
||||||
|
- **Failing**: 0
|
||||||
|
- **Runtime**: 4.50 seconds
|
||||||
|
- **Coverage**:
|
||||||
|
- New parser modules: 92%
|
||||||
|
- Config module: 100%
|
||||||
|
- Exception module: 66%
|
||||||
|
- DB security coverage: 18% (focused on parameterized queries)
|
||||||
|
|
||||||
|
### Code Metrics
|
||||||
|
- **Lines eliminated**: 237 lines of duplicated/complex code → 18 lines (92% reduction)
|
||||||
|
- field_extractor.py: 201 lines → 6 lines
|
||||||
|
- pipeline.py: 36 lines → 6 lines
|
||||||
|
- **New code added**: 279 lines of well-tested parser code
|
||||||
|
- **Net impact**: Replaced 237 lines of duplicate code with 279 lines of unified, tested code (+42 lines, but -3 implementations)
|
||||||
|
- **Test coverage improvement**: 0% → 92% for parser logic
|
||||||
|
|
||||||
|
### Performance Impact
|
||||||
|
- Configuration loading: Negligible (<1ms overhead for .env parsing)
|
||||||
|
- SQL queries: No performance change (parameterized queries are standard practice)
|
||||||
|
- Parser refactoring: No performance degradation (logic simplified, not changed)
|
||||||
|
- Exception handling: Minimal overhead (only when exceptions are raised)
|
||||||
|
|
||||||
|
### Security Improvements
|
||||||
|
- ✅ Eliminated plaintext password storage
|
||||||
|
- ✅ Fixed 2 SQL injection vulnerabilities
|
||||||
|
- ✅ Added input validation in database layer
|
||||||
|
|
||||||
|
### Maintainability Improvements
|
||||||
|
- ✅ Eliminated code duplication (3 implementations → 1)
|
||||||
|
- ✅ Strategy Pattern enables easy extension of customer number formats
|
||||||
|
- ✅ Comprehensive test suite (153 tests) ensures safe refactoring
|
||||||
|
- ✅ 100% backward compatibility maintained
|
||||||
|
- ✅ Custom exception hierarchy for granular error handling
|
||||||
364
README.md
364
README.md
@@ -54,8 +54,12 @@
|
|||||||
- **数据库存储**: 标注结果存储在 PostgreSQL,支持增量处理和断点续传
|
- **数据库存储**: 标注结果存储在 PostgreSQL,支持增量处理和断点续传
|
||||||
- **YOLO 检测**: 使用 YOLOv11 检测发票字段区域
|
- **YOLO 检测**: 使用 YOLOv11 检测发票字段区域
|
||||||
- **OCR 识别**: 使用 PaddleOCR v5 提取检测区域的文本
|
- **OCR 识别**: 使用 PaddleOCR v5 提取检测区域的文本
|
||||||
|
- **统一解析器**: payment_line 和 customer_number 采用独立解析器模块
|
||||||
|
- **交叉验证**: payment_line 数据与单独检测字段交叉验证,优先采用 payment_line 值
|
||||||
|
- **文档类型识别**: 自动区分 invoice (有 payment_line) 和 letter (无 payment_line)
|
||||||
- **Web 应用**: 提供 REST API 和可视化界面
|
- **Web 应用**: 提供 REST API 和可视化界面
|
||||||
- **增量训练**: 支持在已训练模型基础上继续训练
|
- **增量训练**: 支持在已训练模型基础上继续训练
|
||||||
|
- **内存优化**: 支持低内存模式训练 (--low-memory)
|
||||||
|
|
||||||
## 支持的字段
|
## 支持的字段
|
||||||
|
|
||||||
@@ -69,6 +73,8 @@
|
|||||||
| 5 | plusgiro | Plusgiro 号码 |
|
| 5 | plusgiro | Plusgiro 号码 |
|
||||||
| 6 | amount | 金额 |
|
| 6 | amount | 金额 |
|
||||||
| 7 | supplier_organisation_number | 供应商组织号 |
|
| 7 | supplier_organisation_number | 供应商组织号 |
|
||||||
|
| 8 | payment_line | 支付行 (机器可读格式) |
|
||||||
|
| 9 | customer_number | 客户编号 |
|
||||||
|
|
||||||
## 安装
|
## 安装
|
||||||
|
|
||||||
@@ -132,8 +138,24 @@ python -m src.cli.train \
|
|||||||
--model yolo11n.pt \
|
--model yolo11n.pt \
|
||||||
--epochs 100 \
|
--epochs 100 \
|
||||||
--batch 16 \
|
--batch 16 \
|
||||||
--name invoice_yolo11n_full \
|
--name invoice_fields \
|
||||||
--dpi 150
|
--dpi 150
|
||||||
|
|
||||||
|
# 低内存模式 (适用于内存不足场景)
|
||||||
|
python -m src.cli.train \
|
||||||
|
--model yolo11n.pt \
|
||||||
|
--epochs 100 \
|
||||||
|
--name invoice_fields \
|
||||||
|
--low-memory \
|
||||||
|
--workers 4 \
|
||||||
|
--no-cache
|
||||||
|
|
||||||
|
# 从检查点恢复训练 (训练中断后)
|
||||||
|
python -m src.cli.train \
|
||||||
|
--model runs/train/invoice_fields/weights/last.pt \
|
||||||
|
--epochs 100 \
|
||||||
|
--name invoice_fields \
|
||||||
|
--resume
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. 增量训练
|
### 4. 增量训练
|
||||||
@@ -164,26 +186,46 @@ python -m src.cli.train \
|
|||||||
```bash
|
```bash
|
||||||
# 命令行推理
|
# 命令行推理
|
||||||
python -m src.cli.infer \
|
python -m src.cli.infer \
|
||||||
--model runs/train/invoice_yolo11n_full/weights/best.pt \
|
--model runs/train/invoice_fields/weights/best.pt \
|
||||||
--input path/to/invoice.pdf \
|
--input path/to/invoice.pdf \
|
||||||
--output result.json \
|
--output result.json \
|
||||||
--gpu
|
--gpu
|
||||||
|
|
||||||
|
# 批量推理
|
||||||
|
python -m src.cli.infer \
|
||||||
|
--model runs/train/invoice_fields/weights/best.pt \
|
||||||
|
--input invoices/*.pdf \
|
||||||
|
--output results/ \
|
||||||
|
--gpu
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**推理结果包含**:
|
||||||
|
- `fields`: 提取的字段值 (InvoiceNumber, Amount, payment_line, customer_number 等)
|
||||||
|
- `confidence`: 各字段的置信度
|
||||||
|
- `document_type`: 文档类型 ("invoice" 或 "letter")
|
||||||
|
- `cross_validation`: payment_line 交叉验证结果 (如果有)
|
||||||
|
|
||||||
### 6. Web 应用
|
### 6. Web 应用
|
||||||
|
|
||||||
|
**在 WSL 环境中启动**:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 启动 Web 服务器
|
# 方法 1: 从 Windows PowerShell 启动 (推荐)
|
||||||
|
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && python run_server.py --port 8000"
|
||||||
|
|
||||||
|
# 方法 2: 在 WSL 内启动
|
||||||
|
conda activate invoice-py311
|
||||||
|
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2
|
||||||
python run_server.py --port 8000
|
python run_server.py --port 8000
|
||||||
|
|
||||||
# 开发模式 (自动重载)
|
# 方法 3: 使用启动脚本
|
||||||
python run_server.py --debug --reload
|
./start_web.sh
|
||||||
|
|
||||||
# 禁用 GPU
|
|
||||||
python run_server.py --no-gpu
|
|
||||||
```
|
```
|
||||||
|
|
||||||
访问 **http://localhost:8000** 使用 Web 界面。
|
**服务启动后**:
|
||||||
|
- 访问 **http://localhost:8000** 使用 Web 界面
|
||||||
|
- 服务会自动加载模型 `runs/train/invoice_fields/weights/best.pt`
|
||||||
|
- GPU 默认启用,置信度阈值 0.5
|
||||||
|
|
||||||
#### Web API 端点
|
#### Web API 端点
|
||||||
|
|
||||||
@@ -194,6 +236,33 @@ python run_server.py --no-gpu
|
|||||||
| POST | `/api/v1/infer` | 上传文件并推理 |
|
| POST | `/api/v1/infer` | 上传文件并推理 |
|
||||||
| GET | `/api/v1/results/{filename}` | 获取可视化图片 |
|
| GET | `/api/v1/results/{filename}` | 获取可视化图片 |
|
||||||
|
|
||||||
|
#### API 响应格式
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"status": "success",
|
||||||
|
"result": {
|
||||||
|
"document_id": "abc123",
|
||||||
|
"document_type": "invoice",
|
||||||
|
"fields": {
|
||||||
|
"InvoiceNumber": "12345",
|
||||||
|
"Amount": "1234.56",
|
||||||
|
"payment_line": "# 94228110015950070 # > 48666036#14#",
|
||||||
|
"customer_number": "UMJ 436-R"
|
||||||
|
},
|
||||||
|
"confidence": {
|
||||||
|
"InvoiceNumber": 0.95,
|
||||||
|
"Amount": 0.92
|
||||||
|
},
|
||||||
|
"cross_validation": {
|
||||||
|
"is_valid": true,
|
||||||
|
"ocr_match": true,
|
||||||
|
"amount_match": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## 训练配置
|
## 训练配置
|
||||||
|
|
||||||
### YOLO 训练参数
|
### YOLO 训练参数
|
||||||
@@ -210,6 +279,10 @@ Options:
|
|||||||
--name 训练名称
|
--name 训练名称
|
||||||
--limit 限制文档数 (用于测试)
|
--limit 限制文档数 (用于测试)
|
||||||
--device 设备 (0=GPU, cpu)
|
--device 设备 (0=GPU, cpu)
|
||||||
|
--resume 从检查点恢复训练
|
||||||
|
--low-memory 启用低内存模式 (batch=8, workers=4, no-cache)
|
||||||
|
--workers 数据加载 worker 数 (默认: 8)
|
||||||
|
--cache 缓存图像到内存
|
||||||
```
|
```
|
||||||
|
|
||||||
### 训练最佳实践
|
### 训练最佳实践
|
||||||
@@ -236,14 +309,28 @@ Options:
|
|||||||
|
|
||||||
### 训练结果示例
|
### 训练结果示例
|
||||||
|
|
||||||
使用约 10,000 张训练图片,100 epochs 后的结果:
|
**最新训练结果** (100 epochs, 2026-01-22):
|
||||||
|
|
||||||
| 指标 | 值 |
|
| 指标 | 值 |
|
||||||
|------|-----|
|
|------|-----|
|
||||||
| **mAP@0.5** | 98.7% |
|
| **mAP@0.5** | 93.5% |
|
||||||
| **mAP@0.5-0.95** | 87.4% |
|
| **mAP@0.5-0.95** | 83.0% |
|
||||||
| **Precision** | 97.5% |
|
| **训练集** | ~10,000 张标注图片 |
|
||||||
| **Recall** | 95.5% |
|
| **字段类型** | 10 个字段 (新增 payment_line, customer_number) |
|
||||||
|
| **模型位置** | `runs/train/invoice_fields/weights/best.pt` |
|
||||||
|
|
||||||
|
**各字段检测性能**:
|
||||||
|
- 发票基础信息 (InvoiceNumber, InvoiceDate, InvoiceDueDate): >95% mAP
|
||||||
|
- 支付信息 (OCR, Bankgiro, Plusgiro, Amount): >90% mAP
|
||||||
|
- 组织信息 (supplier_org_number, customer_number): >85% mAP
|
||||||
|
- 支付行 (payment_line): >80% mAP
|
||||||
|
|
||||||
|
**模型文件**:
|
||||||
|
```
|
||||||
|
runs/train/invoice_fields/weights/
|
||||||
|
├── best.pt # 最佳模型 (mAP@0.5 最高) ⭐ 推荐用于生产
|
||||||
|
└── last.pt # 最后检查点 (用于继续训练)
|
||||||
|
```
|
||||||
|
|
||||||
> 注:目前仍在持续标注更多数据,预计最终将有 25,000+ 张标注图片用于训练。
|
> 注:目前仍在持续标注更多数据,预计最终将有 25,000+ 张标注图片用于训练。
|
||||||
|
|
||||||
@@ -262,15 +349,18 @@ invoice-master-poc-v2/
|
|||||||
│ │ ├── renderer.py # 图像渲染
|
│ │ ├── renderer.py # 图像渲染
|
||||||
│ │ └── detector.py # 类型检测
|
│ │ └── detector.py # 类型检测
|
||||||
│ ├── ocr/ # PaddleOCR 封装
|
│ ├── ocr/ # PaddleOCR 封装
|
||||||
|
│ │ └── machine_code_parser.py # 机器可读付款行解析器
|
||||||
│ ├── normalize/ # 字段规范化
|
│ ├── normalize/ # 字段规范化
|
||||||
│ ├── matcher/ # 字段匹配
|
│ ├── matcher/ # 字段匹配
|
||||||
│ ├── yolo/ # YOLO 相关
|
│ ├── yolo/ # YOLO 相关
|
||||||
│ │ ├── annotation_generator.py
|
│ │ ├── annotation_generator.py
|
||||||
│ │ └── db_dataset.py
|
│ │ └── db_dataset.py
|
||||||
│ ├── inference/ # 推理管道
|
│ ├── inference/ # 推理管道
|
||||||
│ │ ├── pipeline.py
|
│ │ ├── pipeline.py # 主推理流程
|
||||||
│ │ ├── yolo_detector.py
|
│ │ ├── yolo_detector.py # YOLO 检测
|
||||||
│ │ └── field_extractor.py
|
│ │ ├── field_extractor.py # 字段提取
|
||||||
|
│ │ ├── payment_line_parser.py # 支付行解析器
|
||||||
|
│ │ └── customer_number_parser.py # 客户编号解析器
|
||||||
│ ├── processing/ # 多池处理架构
|
│ ├── processing/ # 多池处理架构
|
||||||
│ │ ├── worker_pool.py
|
│ │ ├── worker_pool.py
|
||||||
│ │ ├── cpu_pool.py
|
│ │ ├── cpu_pool.py
|
||||||
@@ -278,20 +368,33 @@ invoice-master-poc-v2/
|
|||||||
│ │ ├── task_dispatcher.py
|
│ │ ├── task_dispatcher.py
|
||||||
│ │ └── dual_pool_coordinator.py
|
│ │ └── dual_pool_coordinator.py
|
||||||
│ ├── web/ # Web 应用
|
│ ├── web/ # Web 应用
|
||||||
│ │ ├── app.py # FastAPI 应用
|
│ │ ├── app.py # FastAPI 应用入口
|
||||||
│ │ ├── routes.py # API 路由
|
│ │ ├── routes.py # API 路由
|
||||||
│ │ ├── services.py # 业务逻辑
|
│ │ ├── services.py # 业务逻辑
|
||||||
│ │ ├── schemas.py # 数据模型
|
│ │ └── schemas.py # 数据模型
|
||||||
│ │ └── config.py # 配置
|
│ ├── utils/ # 工具模块
|
||||||
|
│ │ ├── text_cleaner.py # 文本清理
|
||||||
|
│ │ ├── validators.py # 字段验证
|
||||||
|
│ │ ├── fuzzy_matcher.py # 模糊匹配
|
||||||
|
│ │ └── ocr_corrections.py # OCR 错误修正
|
||||||
│ └── data/ # 数据处理
|
│ └── data/ # 数据处理
|
||||||
|
├── tests/ # 测试文件
|
||||||
|
│ ├── ocr/ # OCR 模块测试
|
||||||
|
│ │ └── test_machine_code_parser.py
|
||||||
|
│ ├── inference/ # 推理模块测试
|
||||||
|
│ ├── normalize/ # 规范化模块测试
|
||||||
|
│ └── utils/ # 工具模块测试
|
||||||
|
├── docs/ # 文档
|
||||||
|
│ ├── REFACTORING_SUMMARY.md
|
||||||
|
│ └── TEST_COVERAGE_IMPROVEMENT.md
|
||||||
├── config.py # 配置文件
|
├── config.py # 配置文件
|
||||||
├── run_server.py # Web 服务器启动脚本
|
├── run_server.py # Web 服务器启动脚本
|
||||||
├── runs/ # 训练输出
|
├── runs/ # 训练输出
|
||||||
│ └── train/
|
│ └── train/
|
||||||
│ └── invoice_yolo11n_full/
|
│ └── invoice_fields/
|
||||||
│ └── weights/
|
│ └── weights/
|
||||||
│ ├── best.pt
|
│ ├── best.pt # 最佳模型
|
||||||
│ └── last.pt
|
│ └── last.pt # 最后检查点
|
||||||
└── requirements.txt
|
└── requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -410,14 +513,15 @@ Options:
|
|||||||
## Python API
|
## Python API
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from src.inference import InferencePipeline
|
from src.inference.pipeline import InferencePipeline
|
||||||
|
|
||||||
# 初始化
|
# 初始化
|
||||||
pipeline = InferencePipeline(
|
pipeline = InferencePipeline(
|
||||||
model_path='runs/train/invoice_yolo11n_full/weights/best.pt',
|
model_path='runs/train/invoice_fields/weights/best.pt',
|
||||||
confidence_threshold=0.3,
|
confidence_threshold=0.25,
|
||||||
use_gpu=True,
|
use_gpu=True,
|
||||||
dpi=150
|
dpi=150,
|
||||||
|
enable_fallback=True
|
||||||
)
|
)
|
||||||
|
|
||||||
# 处理 PDF
|
# 处理 PDF
|
||||||
@@ -427,26 +531,194 @@ result = pipeline.process_pdf('invoice.pdf')
|
|||||||
result = pipeline.process_image('invoice.png')
|
result = pipeline.process_image('invoice.png')
|
||||||
|
|
||||||
# 获取结果
|
# 获取结果
|
||||||
print(result.fields) # {'InvoiceNumber': '12345', 'Amount': '1234.56', ...}
|
print(result.fields)
|
||||||
|
# {
|
||||||
|
# 'InvoiceNumber': '12345',
|
||||||
|
# 'Amount': '1234.56',
|
||||||
|
# 'payment_line': '# 94228110015950070 # > 48666036#14#',
|
||||||
|
# 'customer_number': 'UMJ 436-R',
|
||||||
|
# ...
|
||||||
|
# }
|
||||||
|
|
||||||
print(result.confidence) # {'InvoiceNumber': 0.95, 'Amount': 0.92, ...}
|
print(result.confidence) # {'InvoiceNumber': 0.95, 'Amount': 0.92, ...}
|
||||||
print(result.to_json()) # JSON 格式输出
|
print(result.to_json()) # JSON 格式输出
|
||||||
|
|
||||||
|
# 访问交叉验证结果
|
||||||
|
if result.cross_validation:
|
||||||
|
print(f"OCR match: {result.cross_validation.ocr_match}")
|
||||||
|
print(f"Amount match: {result.cross_validation.amount_match}")
|
||||||
|
print(f"Details: {result.cross_validation.details}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 统一解析器使用
|
||||||
|
|
||||||
|
```python
|
||||||
|
from src.inference.payment_line_parser import PaymentLineParser
|
||||||
|
from src.inference.customer_number_parser import CustomerNumberParser
|
||||||
|
|
||||||
|
# Payment Line 解析
|
||||||
|
parser = PaymentLineParser()
|
||||||
|
result = parser.parse("# 94228110015950070 # 15658 00 8 > 48666036#14#")
|
||||||
|
print(f"OCR: {result.ocr_number}")
|
||||||
|
print(f"Amount: {result.amount}")
|
||||||
|
print(f"Account: {result.account_number}")
|
||||||
|
|
||||||
|
# Customer Number 解析
|
||||||
|
parser = CustomerNumberParser()
|
||||||
|
result = parser.parse("Said, Shakar Umj 436-R Billo")
|
||||||
|
print(f"Customer Number: {result}") # "UMJ 436-R"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 测试
|
||||||
|
|
||||||
|
### 测试统计
|
||||||
|
|
||||||
|
| 指标 | 数值 |
|
||||||
|
|------|------|
|
||||||
|
| **测试总数** | 688 |
|
||||||
|
| **通过率** | 100% |
|
||||||
|
| **整体覆盖率** | 37% |
|
||||||
|
|
||||||
|
### 关键模块覆盖率
|
||||||
|
|
||||||
|
| 模块 | 覆盖率 | 测试数 |
|
||||||
|
|------|--------|--------|
|
||||||
|
| `machine_code_parser.py` | 65% | 79 |
|
||||||
|
| `payment_line_parser.py` | 85% | 45 |
|
||||||
|
| `customer_number_parser.py` | 90% | 32 |
|
||||||
|
|
||||||
|
### 运行测试
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 运行所有测试
|
||||||
|
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest"
|
||||||
|
|
||||||
|
# 运行并查看覆盖率
|
||||||
|
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest --cov=src --cov-report=term-missing"
|
||||||
|
|
||||||
|
# 运行特定模块测试
|
||||||
|
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest tests/ocr/test_machine_code_parser.py -v"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 测试结构
|
||||||
|
|
||||||
|
```
|
||||||
|
tests/
|
||||||
|
├── ocr/
|
||||||
|
│ ├── test_machine_code_parser.py # 支付行解析 (79 tests)
|
||||||
|
│ └── test_ocr_engine.py # OCR 引擎测试
|
||||||
|
├── inference/
|
||||||
|
│ ├── test_payment_line_parser.py # 支付行解析器
|
||||||
|
│ └── test_customer_number_parser.py # 客户编号解析器
|
||||||
|
├── normalize/
|
||||||
|
│ └── test_normalizers.py # 字段规范化
|
||||||
|
└── utils/
|
||||||
|
└── test_validators.py # 字段验证
|
||||||
```
|
```
|
||||||
|
|
||||||
## 开发状态
|
## 开发状态
|
||||||
|
|
||||||
|
**已完成功能**:
|
||||||
- [x] 文本层 PDF 自动标注
|
- [x] 文本层 PDF 自动标注
|
||||||
- [x] 扫描图 OCR 自动标注
|
- [x] 扫描图 OCR 自动标注
|
||||||
- [x] 多策略字段匹配 (精确/子串/规范化)
|
- [x] 多策略字段匹配 (精确/子串/规范化)
|
||||||
- [x] PostgreSQL 数据库存储 (断点续传)
|
- [x] PostgreSQL 数据库存储 (断点续传)
|
||||||
- [x] 信号处理和超时保护
|
- [x] 信号处理和超时保护
|
||||||
- [x] YOLO 训练 (98.7% mAP@0.5)
|
- [x] YOLO 训练 (93.5% mAP@0.5, 10 个字段)
|
||||||
- [x] 推理管道
|
- [x] 推理管道
|
||||||
- [x] 字段规范化和验证
|
- [x] 字段规范化和验证
|
||||||
- [x] Web 应用 (FastAPI + 前端 UI)
|
- [x] Web 应用 (FastAPI + REST API)
|
||||||
- [x] 增量训练支持
|
- [x] 增量训练支持
|
||||||
|
- [x] 内存优化训练 (--low-memory, --resume)
|
||||||
|
- [x] Payment Line 解析器 (统一模块)
|
||||||
|
- [x] Customer Number 解析器 (统一模块)
|
||||||
|
- [x] Payment Line 交叉验证 (OCR, Amount, Account)
|
||||||
|
- [x] 文档类型识别 (invoice/letter)
|
||||||
|
- [x] 单元测试覆盖 (688 tests, 37% coverage)
|
||||||
|
|
||||||
|
**进行中**:
|
||||||
- [ ] 完成全部 25,000+ 文档标注
|
- [ ] 完成全部 25,000+ 文档标注
|
||||||
- [ ] 表格 items 处理
|
- [ ] 多源融合增强 (Multi-source fusion)
|
||||||
- [ ] 模型量化部署
|
- [ ] OCR 错误修正集成
|
||||||
|
- [ ] 提升测试覆盖率到 60%+
|
||||||
|
|
||||||
|
**计划中**:
|
||||||
|
- [ ] 表格 items 提取
|
||||||
|
- [ ] 模型量化部署 (ONNX/TensorRT)
|
||||||
|
- [ ] 多语言支持扩展
|
||||||
|
|
||||||
|
## 关键技术特性
|
||||||
|
|
||||||
|
### 1. Payment Line 交叉验证
|
||||||
|
|
||||||
|
瑞典发票的 payment_line (支付行) 包含完整的支付信息:OCR 参考号、金额、账号。我们实现了交叉验证机制:
|
||||||
|
|
||||||
|
```
|
||||||
|
Payment Line: # 94228110015950070 # 15658 00 8 > 48666036#14#
|
||||||
|
↓ ↓ ↓
|
||||||
|
OCR Number Amount Bankgiro Account
|
||||||
|
```
|
||||||
|
|
||||||
|
**验证流程**:
|
||||||
|
1. 从 payment_line 提取 OCR、Amount、Account
|
||||||
|
2. 与单独检测的字段对比验证
|
||||||
|
3. **payment_line 值优先** - 如有不匹配,采用 payment_line 的值
|
||||||
|
4. 返回验证结果和详细信息
|
||||||
|
|
||||||
|
**优势**:
|
||||||
|
- 提高数据准确性 (payment_line 是机器可读格式,更可靠)
|
||||||
|
- 发现 OCR 或检测错误
|
||||||
|
- 为数据质量提供信心指标
|
||||||
|
|
||||||
|
### 2. 统一解析器架构
|
||||||
|
|
||||||
|
采用独立解析器模块处理复杂字段:
|
||||||
|
|
||||||
|
**PaymentLineParser**:
|
||||||
|
- 解析瑞典标准支付行格式
|
||||||
|
- 提取 OCR、Amount (包含 Kronor + Öre)、Account + Check digits
|
||||||
|
- 支持多种变体格式
|
||||||
|
|
||||||
|
**CustomerNumberParser**:
|
||||||
|
- 支持多种瑞典客户编号格式 (`UMJ 436-R`, `JTY 576-3`, `FFL 019N`)
|
||||||
|
- 从混合文本中提取 (如地址行中的客户编号)
|
||||||
|
- 大小写不敏感,输出统一大写格式
|
||||||
|
|
||||||
|
**优势**:
|
||||||
|
- 代码模块化、可测试
|
||||||
|
- 易于扩展新格式
|
||||||
|
- 统一的解析逻辑,减少重复代码
|
||||||
|
|
||||||
|
### 3. 文档类型自动识别
|
||||||
|
|
||||||
|
根据 payment_line 字段自动判断文档类型:
|
||||||
|
|
||||||
|
- **invoice**: 包含 payment_line 的发票文档
|
||||||
|
- **letter**: 不含 payment_line 的信函文档
|
||||||
|
|
||||||
|
这个特性帮助下游系统区分处理流程。
|
||||||
|
|
||||||
|
### 4. 低内存模式训练
|
||||||
|
|
||||||
|
支持在内存受限环境下训练:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m src.cli.train --low-memory
|
||||||
|
```
|
||||||
|
|
||||||
|
自动调整:
|
||||||
|
- batch size: 16 → 8
|
||||||
|
- workers: 8 → 4
|
||||||
|
- cache: disabled
|
||||||
|
- 推荐用于 GPU 内存 < 8GB 或系统内存 < 16GB 的场景
|
||||||
|
|
||||||
|
### 5. 断点续传训练
|
||||||
|
|
||||||
|
训练中断后可从检查点恢复:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m src.cli.train --resume --model runs/train/invoice_fields/weights/last.pt
|
||||||
|
```
|
||||||
|
|
||||||
## 技术栈
|
## 技术栈
|
||||||
|
|
||||||
@@ -457,7 +729,33 @@ print(result.to_json()) # JSON 格式输出
|
|||||||
| **PDF 处理** | PyMuPDF (fitz) |
|
| **PDF 处理** | PyMuPDF (fitz) |
|
||||||
| **数据库** | PostgreSQL + psycopg2 |
|
| **数据库** | PostgreSQL + psycopg2 |
|
||||||
| **Web 框架** | FastAPI + Uvicorn |
|
| **Web 框架** | FastAPI + Uvicorn |
|
||||||
| **深度学习** | PyTorch + CUDA |
|
| **深度学习** | PyTorch + CUDA 12.x |
|
||||||
|
|
||||||
|
## 常见问题
|
||||||
|
|
||||||
|
**Q: 为什么必须在 WSL 环境运行?**
|
||||||
|
|
||||||
|
A: PaddleOCR 和某些依赖在 Windows 原生环境存在兼容性问题。WSL 提供完整的 Linux 环境,确保所有依赖正常工作。
|
||||||
|
|
||||||
|
**Q: 训练过程中出现 OOM (内存不足) 错误怎么办?**
|
||||||
|
|
||||||
|
A: 使用 `--low-memory` 模式,或手动调整 `--batch` 和 `--workers` 参数。
|
||||||
|
|
||||||
|
**Q: payment_line 和单独检测字段不匹配时怎么处理?**
|
||||||
|
|
||||||
|
A: 系统默认优先采用 payment_line 的值,因为 payment_line 是机器可读格式,通常更准确。验证结果会记录在 `cross_validation` 字段中。
|
||||||
|
|
||||||
|
**Q: 如何添加新的字段类型?**
|
||||||
|
|
||||||
|
A:
|
||||||
|
1. 在 `src/inference/constants.py` 添加字段定义
|
||||||
|
2. 在 `field_extractor.py` 添加规范化方法
|
||||||
|
3. 重新生成标注数据
|
||||||
|
4. 从头训练模型
|
||||||
|
|
||||||
|
**Q: 可以用 CPU 训练吗?**
|
||||||
|
|
||||||
|
A: 可以,但速度会非常慢 (慢 10-50 倍)。强烈建议使用 GPU 训练。
|
||||||
|
|
||||||
## 许可证
|
## 许可证
|
||||||
|
|
||||||
|
|||||||
24
config.py
24
config.py
@@ -4,6 +4,12 @@ Configuration settings for the invoice extraction system.
|
|||||||
|
|
||||||
import os
|
import os
|
||||||
import platform
|
import platform
|
||||||
|
from pathlib import Path
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
# Load environment variables from .env file
|
||||||
|
env_path = Path(__file__).parent / '.env'
|
||||||
|
load_dotenv(dotenv_path=env_path)
|
||||||
|
|
||||||
|
|
||||||
def _is_wsl() -> bool:
|
def _is_wsl() -> bool:
|
||||||
@@ -21,14 +27,22 @@ def _is_wsl() -> bool:
|
|||||||
|
|
||||||
|
|
||||||
# PostgreSQL Database Configuration
|
# PostgreSQL Database Configuration
|
||||||
|
# Now loaded from environment variables for security
|
||||||
DATABASE = {
|
DATABASE = {
|
||||||
'host': '192.168.68.31',
|
'host': os.getenv('DB_HOST', '192.168.68.31'),
|
||||||
'port': 5432,
|
'port': int(os.getenv('DB_PORT', '5432')),
|
||||||
'database': 'docmaster',
|
'database': os.getenv('DB_NAME', 'docmaster'),
|
||||||
'user': 'docmaster',
|
'user': os.getenv('DB_USER', 'docmaster'),
|
||||||
'password': '0412220',
|
'password': os.getenv('DB_PASSWORD'), # No default for security
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Validate required configuration
|
||||||
|
if not DATABASE['password']:
|
||||||
|
raise ValueError(
|
||||||
|
"DB_PASSWORD environment variable is not set. "
|
||||||
|
"Please create a .env file based on .env.example and set DB_PASSWORD."
|
||||||
|
)
|
||||||
|
|
||||||
# Connection string for psycopg2
|
# Connection string for psycopg2
|
||||||
def get_db_connection_string():
|
def get_db_connection_string():
|
||||||
return f"postgresql://{DATABASE['user']}:{DATABASE['password']}@{DATABASE['host']}:{DATABASE['port']}/{DATABASE['database']}"
|
return f"postgresql://{DATABASE['user']}:{DATABASE['password']}@{DATABASE['host']}:{DATABASE['port']}/{DATABASE['database']}"
|
||||||
|
|||||||
@@ -20,3 +20,4 @@ pyyaml>=6.0 # YAML config files
|
|||||||
|
|
||||||
# Utilities
|
# Utilities
|
||||||
tqdm>=4.65.0 # Progress bars
|
tqdm>=4.65.0 # Progress bars
|
||||||
|
python-dotenv>=1.0.0 # Environment variable management
|
||||||
|
|||||||
5
start_web.sh
Normal file
5
start_web.sh
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2
|
||||||
|
source ~/miniconda3/etc/profile.d/conda.sh
|
||||||
|
conda activate invoice-py311
|
||||||
|
python run_server.py --port 8000
|
||||||
Reference in New Issue
Block a user