Skip to content

Deduplication Module

Three-stage pipeline: DOI hard dedup, title similarity, LLM-verified dedup.

Features

  • DOI hard dedup: Exact DOI match removes duplicates
  • Title similarity: Jaccard/edit distance for papers without DOI
  • LLM verify: Optional LLM-assisted judgment for ambiguous pairs
  • Async task: Returns task_id for progress polling

API Endpoints

MethodEndpointDescription
POST/projects/{id}/dedup/runRun deduplication
GET/projects/{id}/dedup/candidatesPreview dedup candidates
POST/projects/{id}/dedup/verifyLLM-verify candidate pair

Usage Example

bash
# Run dedup
curl -X POST http://localhost:8000/api/v1/projects/1/dedup/run

# Preview candidates before running
curl http://localhost:8000/api/v1/projects/1/dedup/candidates

# Poll task status
curl http://localhost:8000/api/v1/tasks/{task_id}

Released under the MIT License.