Real Endoscopy Results

SurgiPrompt

Open-vocabulary surgical tool detection, segmentation, and tracking for real endoscopic and laparoscopic imagery using Grounding DINO and SAM2, with prompt-grounded boxes, refined masks, and video propagation.

0.2139 bbox mAP
0.3926 mean mask IoU
5.58 video FPS with re-grounding
120 tracked real frames

Pipeline

The repository implements an inference-first surgical tool pipeline on real endoscopy data only. A text prompt grounds candidate tools, SAM2 refines masks, and the same objects are propagated through video.

1

Prompt Grounding

Text prompts such as forceps, grasper, catheter, and guidewire are grounded into tool proposals.

2

Mask Refinement

SAM2 converts tool boxes into image masks so the output is usable for segmentation-style evaluation instead of box-only inspection.

3

Video Tracking

Seed detections from the first frame are propagated through a real Endoscapes frame sequence with optional periodic re-grounding.

4

Real-Data Evaluation

The code reports box mAP, segmentation mAP, mask IoU, throughput, and saved failure cases on real datasets rather than synthetic splits.

Current Results

These numbers come from tracked local runs in this repository on real Endoscapes data. The evaluation result shown here is a bounded tool-only subset smoke run, not a full benchmark claim.

Run bbox mAP bbox mAP@50 segm mAP mean mask IoU FPS
Endoscapes tool subset 0.2139 0.2492 0.0990 0.3926 2.1140
Video tracking - - - - 5.3907
Video re-grounding every 30 frames - - - - 5.5816

Real inputs used here: one Endoscapes frame, a 120-frame Endoscapes sequence video, and a 9-image tool-only Endoscapes subset for class-aware evaluation.

Figures

Lightweight tracked artifacts from the real runs.

Playable Videos

Real-sequence video overlays exported from the repository runs. The first video uses seeded tracking only; the second adds periodic re-grounding every 30 frames.