SurgiPrompt

Pipeline

The repository implements an inference-first surgical tool pipeline on real endoscopy data only. A text prompt grounds candidate tools, SAM2 refines masks, and the same objects are propagated through video.

Prompt Grounding

Text prompts such as forceps, grasper, catheter, and guidewire are grounded into tool proposals.

Mask Refinement

SAM2 converts tool boxes into image masks so the output is usable for segmentation-style evaluation instead of box-only inspection.

Video Tracking

Seed detections from the first frame are propagated through a real Endoscapes frame sequence with optional periodic re-grounding.

Real-Data Evaluation

The code reports box mAP, segmentation mAP, mask IoU, throughput, and saved failure cases on real datasets rather than synthetic splits.

Current Results

These numbers come from tracked local runs in this repository on real Endoscapes data. The evaluation result shown here is a bounded tool-only subset smoke run, not a full benchmark claim.

Run	bbox mAP	bbox mAP@50	segm mAP	mean mask IoU	FPS
Endoscapes tool subset	0.2139	0.2492	0.0990	0.3926	2.1140
Video tracking	-	-	-	-	5.3907
Video re-grounding every 30 frames	-	-	-	-	5.5816

Real inputs used here: one Endoscapes frame, a 120-frame Endoscapes sequence video, and a 9-image tool-only Endoscapes subset for class-aware evaluation.

Figures

Lightweight tracked artifacts from the real runs.

Prompt-grounded and segmented surgical tool overlay on a real Endoscapes frame — Single real Endoscapes frame with prompt-grounded tool detections and SAM2 masks.

Tracked tool overlay on the first frame of the real Endoscapes sequence — Frame 0 from the tracked real sequence.

Tracked tool overlay on a middle frame of the real Endoscapes sequence — Mid-sequence tracking result without periodic re-grounding.

Tracked tool overlay on a middle frame of the re-grounded Endoscapes sequence — Mid-sequence tracking result with re-grounding every 30 frames.

Playable Videos

Real-sequence video overlays exported from the repository runs. The first video uses seeded tracking only; the second adds periodic re-grounding every 30 frames.

Baseline tracked overlay on the 120-frame Endoscapes sequence.

Tracked overlay with re-grounding every 30 frames on the same real sequence.