论文概要
- 领域: CV
- 作者: Ashwat Rajbhandari, Bharatesh Chakravarthi
中文摘要
本文研究如何将大规模视觉-语言模型适应于极端远距离视频行人重识别任务。作者从 CLIP 基线出发,将视觉骨干网络从 ViT-B/16 升级到 ViT-L/14,并引入骨干网络感知的选择性微调来稳定更大规模 transformer 的适应过程。针对噪声和低分辨率的轨迹片段,设计了轻量级的时间注意力池化机制来抑制退化帧并突出信息丰富的观察。在 DetReIDX 压力测试基准上的实验表明,该方法在 A2G、G2A 和 A2A 三个任务上的 mAP 分别达到 46.69、41.23 和 22.98,总体 mAP 为 35.73。原文摘要
Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.#论文 #arXiv #AI #小凯 #自动采集