OCR API

Base path: /api/v1/projects/{project_id}/ocr

Overview

OCR and text extraction for PDF papers. Uses pdfplumber for native PDFs and PaddleOCR for scanned documents.

Method	Path	Description
POST	`/projects/{id}/ocr/process`	Run OCR on papers
GET	`/projects/{id}/ocr/stats`	OCR statistics

POST /projects/{id}/ocr/process — Extract text from PDFs via OCR.

Query parameters:

Parameter	Type	Description
`paper_ids`	list[int]	Optional. Specific paper IDs. If omitted, all `pdf_downloaded` papers are processed.
`force_ocr`	bool	Re-run OCR even if already processed (default: false)
`use_gpu`	bool	Use GPU for PaddleOCR (default: true)

Response: { processed, failed, total, message? }

GET /projects/{id}/ocr/stats — Return paper counts by status and total chunk count.

Response: { metadata_only: n, pdf_downloaded: n, ocr_complete: n, indexed: n, error: n, total_chunks: n }