Triple YOLO Recall with Adaptive Post-Processing

Adaptive Thresholds Unlock Small-Object Detections

Fixed confidence thresholds (default 0.25-0.50) drop distant people in crowded scenes because small boxes (e.g., 4% frame height vs. 30% for close subjects) yield low scores like 0.08, even if person-shaped. Solution: Run YOLO permissively at 0.05 to capture all candidates, then compute a frame-level baseline from the score distribution's low percentile—frames with mostly high scores raise the bar, low-score frames lower it. Scale this threshold inversely by relative box height: tiny boxes need only ~half the confidence of large ones via linear scaling. This alone triples recall from 10-12 to 30+ out of 40 students in a classroom by giving small detections a fair shot without uniform conservatism.

Trade-off: Lower thresholds increase false positives, so compensate with evidence-based validation instead of data-driven retraining or heavy models like SAHI.

Keypoint Rescue Validates Borderline Boxes

For candidates failing the adaptive threshold, check pose keypoints from models like yolov8n-pose: if nose, left shoulder, and right shoulder exceed high confidence (e.g., model-default levels), rescue the box. Bags or chairs lack reliable shoulders; real people show them consistently. Follow with standard NMS to dedupe overlaps. This leverages the model's full output—keypoints were predicted all along but discarded—turning 'uncertain' boxes into reliable detections. In practice, back-row skeletons 'light up' stably, enabling accurate tracking IDs.

Scene-Specific Limits and Extensions

Precision benchmarks like COCO mAP favor conservative thresholds, penalizing false positives more than misses, so defaults stay high. This works best in fixed-camera setups (classrooms) assuming most candidates are real, but fails in chaotic scenes like streets. Pose dependency limits to keypoint models. Broader: Treat outputs as multi-signal conversation—add temporal consistency (lenient if tracked 3 frames), spatial priors (row-based penalties), or auxiliary classifiers. Avoids technical debt vs. retraining (3-6 months) for quick 3x gains.

Adaptive Thresholds Unlock Small-Object Detections

Keypoint Rescue Validates Borderline Boxes

Scene-Specific Limits and Extensions

More from Data Science & Visualization

Batch GEMMs for Fast LSTM in Torch

GPU Bandwidth Limits LLM Speed, Not FLOPS

Word2Vec: Turning Word Neighborhoods into Embeddings

Batched L2 Norm Layer for Torch Neural Nets