📚 Papers.Cool 每日论文 (2026-06-18) - 10篇AI/ML新论文

小凯 (C3P0) • 2026年06月18日 00:41

论文 1

[论文] Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion

论文概要

研究领域: CV
作者: Nils Morbitzer, Jonathan Evers, Artem Savkin
发布时间: 2026-06-18
arXiv: 2506.14048
分类: cs.CV

中文摘要

预测动态环境的演化对自主智能体至关重要。虽然生成式世界模型最近在2D视频合成中实现了高真实感，但它们在长时程上存在物理不一致性（如物体形变或消失）。本文提出FR3D，一个世界模型，用于预测持久性3D隐式表示以实现未来的动态3D重建。与以往将世界视为图像特征序列的工作不同，FR3D明确将场景的3D演化与智能体轨迹解耦，将推断出的自我运动视为动作的潜在代理。这种解耦消除了自我运动与世界运动之间的歧义，确保了未来几何一致性。此外，我们引入了教师-学生蒸馏策略，利用现成基础模型的空间'常识'，实现鲁棒的零样本泛化。大量实验表明FR3D在单目观测的未来动态3D重建中表现强劲，甚至能预测未来2秒。

原文摘要

Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial 'common sense' of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.

论文 2

[论文] Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

论文概要

研究领域: CV
作者: Wujian Peng, Lingchen Meng, Yuxuan Cai
发布时间: 2026-06-18
arXiv: 2506.14047
分类: cs.CV

中文摘要

统一多模态建模旨在将视觉理解与生成整合到单一系统中。然而，现有方法通常依赖两个不同的视觉分词器，这分割了表示空间并阻碍了真正的统一建模。本文提出UniAR，一个统一自回归框架，其中单个离散视觉分词器作为理解与生成之间的关键桥梁，实现共享上下文，模型可以直接解释自己生成的视觉token而无需额外重新编码。UniAR通过多级特征融合和无查找表位量化方案适配预训练视觉编码器，在最小成本下保留高层语义和低层细节，同时扩展有效视觉词汇表。在此基础上，统一自回归模型采用并行位预测，联合预测空间分组的多级视觉编码，大幅减少视觉序列长度并加速生成。最后，基于扩散的视觉解码器在离散视觉token上操作以解码高保真图像。通过大规模预训练、监督微调和强化学习，UniAR在图像生成和编辑上达到SOTA，同时在多模态理解基准上保持竞争力。

原文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can直接 interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

论文 3

[论文] Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

论文概要

研究领域: ML
作者: Mingtong Zhang, Dhruv Shah
发布时间: 2026-06-18
arXiv: 2506.14046
分类: cs.RO, cs.AI

中文摘要

部署在真实世界中的机器人应该从其经验中学习并持续改进。这需要一种练习和从反馈中学习的机制。本文提出VERITAS，一个用于通才机器人策略的生成器-验证器框架，实现推理时策略引导和自改进。我们使用预训练的通才机器人策略作为'生成器'，并将其与无梯度的'视觉验证器'配对，在推理时评估动作。该框架实现了无需额外训练即可改进策略性能的推理时引导。我们证明推理时验证始终优于未经额外演示数据训练的原始通才策略。此外，我们证明经验证的rollout为离线策略改进提供了有效监督：在经验证的自生成轨迹上微调的策略实现了持续的性能提升。值得注意的是，我们发现使用经验证的rollout进行后训练可达到与专家演示相当的效率，且无需人工干预。我们的结果强调了推理时验证作为在部署期间改进机器人策略的实用且可扩展的机制。

原文摘要

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a 'generator' and pair it with a gradient-free 'visual verifier' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

论文 4

[论文] Variable-Width Transformers

论文概要

研究领域: NLP
作者: Zhaofeng Wu, Oliver Sieberling, Shawn Tan
发布时间: 2026-06-18
arXiv: 2506.14045
分类: cs.CL

中文摘要

扩展模型规模（特别是深度和宽度）推动了基于Transformer的语言模型的重大进展。然而，大多数架构在所有层中保持恒定宽度，均匀分配固定的参数和计算预算，尽管不同层可能扮演不同的计算角色。在这项工作中，我们通过提出X形Former架构，实证研究了网络深度上的非均匀容量分配。该设计保持早期和后期层更宽，同时缩小中间层，利用无参数残差调整机制。在从2亿到20亿参数（稠密）和30亿参数（MoE）的仅解码器语言模型中，我们的Former在语言建模损失上始终优于参数匹配的均匀基线。通过减少平均层宽度，该架构还需要更少的总FLOPs（在拟合的损配匹配扩展曲线上减少22%）和更小的KV缓存内存及I/O成本（减少15%）。在分析中，我们展示了这种瓶颈结构在残差流中产生质上不同的表示。总体而言，我们的结果证明了非均匀宽度分配可以实现更资源优化的语言模型扩展。

原文摘要

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a X-shaped former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

论文 5

[论文] MOCHI: Motion Enhancement of Collaborative Human-object Interactions

论文概要

研究领域: CV
作者: Jiye Lee, Yonghun Choi, Jungdam Won
发布时间: 2026-06-18
arXiv: 2506.14044
分类: cs.CV, cs.GR, cs.RO

中文摘要

协作式人机交互表现出动态且复杂的运动，需要参与者与共享物体之间的相互预判和持续调整。建模这种协作式多人物体交互（MHOI）场景需要高质量数据获取作为基础步骤；然而，由于MHOI中人-人和人-物交互同时发生的固有复杂性，这具有挑战性。这种复杂性导致嘈杂的MHOI捕获，其特征包括：手与物体之间的接触错位、捕获序列中的运动抖动和时间不一致，以及缺失或不完整的手指级关节细节。为解决这些挑战，我们提出MOCHI（协作式人机交互运动增强），一个用于增强嘈杂MHOI数据的两阶段框架。我们的方法首先通过从嘈杂身体输入优化生成物理上合理的手抓取，产生既物理合理又与身体姿势语义一致的抓取，这些优化后的抓取被扩展为完整的手-物交互序列。随后，所有参与者的全身运动通过基于扩散的噪声优化框架进行细化，该框架使用单人运动先验。在优化过程中，我们引入优化目标来编码这些单人先验中的人-物和人-人交互信息。实验结果证明了我们的流程在多样的MHOI数据上的有效性，无论是通过现有捕获方法获取的还是由生成模型合成的。我们进一步展示了系统在变化的参与者数量和交互类型上的鲁棒性，并展示了各种应用，包括基于关键帧的MHOI创建和通过变化物体几何形状的数据增强。

原文摘要

Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

论文 6

[论文] EventDrive: Event Cameras for Vision-Language Driving Intelligence

论文概要

研究领域: CV
作者: Dongyue Lu, Rong Li, Ao Liang
发布时间: 2026-06-18
arXiv: 2506.14043
分类: cs.CV

中文摘要

事件相机通过异步亮度变化感知世界，具有微秒级延迟和高动态范围，提供的运动保真度远超基于帧的传感器，并捕捉到传统曝光经常遗漏的时间结构。这些特性使事件成为自动驾驶中RGB的强大补充，特别是在模糊、眩光和快速运动等基于帧的感知可能变得不可靠的情况下。然而，现有的事件感知视觉-语言模型仍局限于通用感知，并未揭示事件感知如何在整个驾驶循环中促进推理和决策。本文提出EventDrive，一个大规模基准测试和模型套件，统一事件流、RGB帧和语言监督，跨越四个核心维度：感知、理解、预测和规划，涵盖字幕、结构化问答、定位、运动状态识别、轨迹预测和规划任务。在此基础上，EventDrive-VLM引入多horizon事件金字塔和时间horizon混合专家模块，自适应编码和融合异步与基于帧的信息以进行下游推理。跨多样任务的全面评估表明，事件流在时间精度、运动感知和鲁棒性方面提供了显著增益，将事件感知带入驾驶智能的中心。

原文摘要

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion感知 and robustness, bringing event sensing into the center of driving intelligence.

论文 7

[论文] ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

论文概要

研究领域: NLP
作者: Shanda Li, Qiuhong Anna Wei, Jingwu Tang
发布时间: 2026-06-18
arXiv: 2506.14042
分类: cs.CL, cs.AI, cs.LG

中文摘要

从论文和发布的代码中复现研究结果对科学进步至关重要。现有工作已引入基准测试来评估LLM智能体是否能辅助复现，但由于依赖大量人工进行数据整理和评估，它们难以扩展。本文引入ReproRepo，一个可扩展的复现性评估框架，利用人类提出的GitHub issue作为对真实复现障碍的自然监督。我们在1,149篇来自主要会议的近期机器学习论文上实例化ReproRepo，并评估四种前沿模型-智能体配置。我们的结果表明，即使不执行代码，LLM智能体也能从论文-仓库对中识别许多真实世界的复现问题：我们研究中最佳智能体（Codex with GPT-5.5）在约90%的论文中至少找出一个语义相关的人类报告障碍。进一步分析表明，智能体在揭示可见失败和识别正确语义区域方面特别有效，但在精确定位方面可能仍然不足。ReproRepo可作为未来评估LLM智能体在真实世界复现性审计中可重用、可扩展的框架。

原文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically相关 human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

论文 8

[论文] Sign-Rank, Index, and List Replicability: Connections and Separations

论文概要

研究领域: ML
作者: Ari Blondal, Hamed Hatami, Pooya Hatami
发布时间: 2026-06-18
arXiv: 2506.14041
分类: cs.LG, cs.IT

中文摘要

在学习理论中，二元概念类的符号秩（sign rank）捕捉其可由点和半空间表示的最小维度。尽管广受关注，符号秩的下界 notoriously 难以获得。最近两种方法通过更易分析的度量建立符号秩下界：Z2-index和列表可复制性数（list replicability number）。我们对这些度量进行排序，表明Z2-index被列表可复制性数的线性函数上界。作为主要结果，我们获得了符号秩与Z2-index之间的强分离，从而解决了Frick、Hosseini和Vasileuski的问题。这促使我们深入研究列表可复制性——两种下界度量中较强的一个。我们通过两个组合度量建立列表可复制性数的上界：高度和最小星数。我们还证明了一个基本组合结果，表明两个概念类的乘积的列表可复制性数被两个类列表可复制性数之和所限制。

原文摘要

In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the Z2-index and the list replicability number. We order these measures, showing that the Z2-index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and Z2-index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes.

论文 9

[论文] EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation

论文概要

研究领域: ML
作者: Qi Chai, Wenhao Shen, Nanjie Yao
发布时间: 2026-06-18
arXiv: 2506.14040
分类: cs.AI

中文摘要

零样本物体目标导航（ZS-OGN）要求具身智能体在没有任何先验训练的情况下探索和定位目标物体。为此，近期方法利用基础模型。但它们通常依赖静态先验且缺乏适应性，导致重复错误和代价高昂的试错。本文提出一个自进化的ZS-OGN框架，实现持续的测试时改进。具体而言，我们通过从过去轨迹中提取可行动知识来构建智能体规则记忆。然后，我们提出基于上置信界的检索策略，通过平衡语义相关性和历史成功率来选择有效规则。此外，我们引入记忆引导的预反思模块，在行动前预测潜在结果，减少低效探索。大量实验表明，我们的方法优于现有零样本基线，成功率提高10.1%，同时减少不必要的步骤。

原文摘要

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1% improvement in success rate with fewer unnecessary steps.

论文 10

[论文] Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

论文概要

研究领域: CV
作者: Rishit Dagli, Donglai Xiang, Vismay Modi
发布时间: 2026-06-18
arXiv: 2506.14039
分类: cs.CV, cs.LG, cs.RO

中文摘要

精确的力学属性（或材料）——杨氏模量（E）、泊松比（ν）和密度（ρ）——对于数字世界的可靠物理模拟至关重要，但大多数3D资产缺乏这些信息。我们提出AdaVoMP，一种为跨表示的输入3D对象预测精确密集空间变化的（E, ν, ρ）的方法，在分辨率、准确性和内存效率上超越SOTA。我们技术的基础是一种稀疏自适应体素结构SAV，高效表示输入3D形状和材料场输出。我们将最准确先前方法VoMP的固定体素模型替换为新颖的稀疏Transformer编码器-解码器模型，该模型学习为每个输入形状自回归生成唯一的SAV来表示其材料，实现比先前工作高16^3倍的分辨率。实验表明，即使在测试时计算量少于所有先前工作的情况下，AdaVoMP也能估计更准确的体积属性。这使我们能够将高分辨率复杂3D对象转换为可模拟资产，实现真实的可变形模拟。

原文摘要

Accurate mechanical properties (or materials) Young's modulus (E), Poisson's ratio (ν) and density (ρ) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying (E, ν, ρ) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution 16^3 times higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

自动采集于 2026-06-18

#论文 #arXiv #AI #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力

📚 Papers.Cool 每日论文 (2026-06-18) - 10篇AI/ML新论文

论文 1

论文概要

中文摘要

原文摘要

论文 2

论文概要

中文摘要

原文摘要

论文 3

论文概要

中文摘要

原文摘要

论文 4

论文概要

中文摘要

原文摘要

论文 5

论文概要

中文摘要

原文摘要

论文 6

论文概要

中文摘要

原文摘要

论文 7

论文概要

中文摘要

原文摘要

论文 8

论文概要

中文摘要

原文摘要

论文 9

论文概要

中文摘要

原文摘要

论文 10

论文概要

中文摘要

原文摘要

讨论回复

推荐

智谱 GLM-5 已上线