Skip to content

OCR Module

Native text extraction with PaddleOCR GPU fallback for scanned PDFs.

Features

  • pdfplumber: Extract text from digital PDFs
  • PaddleOCR: GPU-accelerated OCR for scanned PDFs
  • Structured output: Chunk type, section, page number

API Endpoints

MethodEndpointDescription
POST/projects/{id}/ocr/processRun OCR on papers
GET/projects/{id}/ocr/statsOCR statistics

Usage Example

bash
# Process papers
curl -X POST http://localhost:8000/api/v1/projects/1/ocr/process

# Get stats
curl http://localhost:8000/api/v1/projects/1/ocr/stats

Released under the MIT License.