Skip to content

OCR API

Base path: /api/v1/projects/{project_id}/ocr

Overview

OCR and text extraction for PDF papers. Uses pdfplumber for native PDFs and PaddleOCR for scanned documents.

Endpoints

MethodPathDescription
POST/projects/{id}/ocr/processRun OCR on papers
GET/projects/{id}/ocr/statsOCR statistics

Process

POST /projects/{id}/ocr/process — Extract text from PDFs via OCR.

Query parameters:

ParameterTypeDescription
paper_idslist[int]Optional. Specific paper IDs. If omitted, all pdf_downloaded papers are processed.
force_ocrboolRe-run OCR even if already processed (default: false)
use_gpuboolUse GPU for PaddleOCR (default: true)

Response: { processed, failed, total, message? }

Stats

GET /projects/{id}/ocr/stats — Return paper counts by status and total chunk count.

Response: { metadata_only: n, pdf_downloaded: n, ocr_complete: n, indexed: n, error: n, total_chunks: n }

Released under the MIT License.