Skip to content

Crawler API

爬虫模块 API,用于为待下载文献执行 PDF 下载(Unpaywall 等多源回退)。

Base path: /api/v1/projects/{project_id}/crawl


Endpoints

MethodPathDescription
POST/start启动 PDF 下载任务
GET/stats获取下载统计

POST /start

对项目内待下载文献启动 PDF 下载。仅处理 pendingmetadata_only 状态文献。

Query Parameters

NameTypeDefaultDescription
prioritystring"high"优先级:high 按引用数排序,low 按创建时间排序
max_papersint50单次处理最大文献数

Response

json
{
  "code": 200,
  "message": "success",
  "data": {
    "total": 10,
    "success": 8,
    "failed": 2,
    "details": [
      {
        "paper_id": 1,
        "success": true,
        "file_path": "/data0/djx/omelette/.../1.pdf"
      }
    ]
  }
}

Example

bash
curl -X POST "http://localhost:8000/api/v1/projects/1/crawl/start?priority=high&max_papers=50"

GET /stats

返回项目内下载相关统计。

Response

json
{
  "code": 200,
  "message": "success",
  "data": {
    "pending": 20,
    "metadata_only": 5,
    "pdf_downloaded": 80,
    "ocr_complete": 60,
    "indexed": 50,
    "error": 3,
    "storage": {
      "total_mb": 1024,
      "used_mb": 512
    }
  }
}
  • 各状态字段:文献数量
  • storage:存储统计(可选,由 CrawlerService 提供)

Example

bash
curl "http://localhost:8000/api/v1/projects/1/crawl/stats"

Error Codes

CodeDescription
404项目不存在

Released under the MIT License.