Skip to content

PDF Crawler Module

Multi-channel PDF download: Unpaywall → arXiv → direct URL fallback.

Features

  • Unpaywall: Open-access links (requires UNPAYWALL_EMAIL)
  • arXiv: Preprint PDFs for arXiv papers
  • Direct URL: Fallback when available
  • Async task: Progress and retry support

API Endpoints

MethodEndpointDescription
POST/projects/{id}/crawler/startStart PDF crawl
GET/projects/{id}/crawler/statsCrawl statistics

Usage Example

bash
# Start crawl for project papers
curl -X POST http://localhost:8000/api/v1/projects/1/crawler/start

# Get stats
curl http://localhost:8000/api/v1/projects/1/crawler/stats

Released under the MIT License.