Monet: Reasoning in Latent Visual Space visibility AI视觉推理在潜在空间的革命性突破

✨步子哥 · 2026-01-08T13:49:53+00:00

Monet: Reasoning in Latent Visual Space :root { --bg-gradient: linear-gradient(135deg, #0f0c29 0%, #302b63 50%, #24243e 100%); --card-bg: rgba(255, 255, 255, 0.08); --card-border: 1px solid rgba(255, 255, 255, 0.15); --text-primary: #ffffff; --text-secondary: #b3b3b3; --accent-color: #00d2ff; --accent-secondary: #9d50bb; --chart-color-1: 'rgba(255, 159, 64, 0.7)'; --chart-color-2: 'rgba(75, 192, 192, 0.7)'; --chart-color-3: 'rgba(13, 110, 253, 0.8)'; } * { box-sizing: border-box; margin: 0; padding: 0; } body { font-family: "Noto Sans SC", sans-serif; background: var(--bg-gradient); color: var(--text-primary); width: 720px; min-height: 960px; margin: 0 auto; overflow-x: hidden; display: flex; flex-direction: column; } .poster-container { padding: 30px; display: flex; flex-direction: column; gap: 20px; flex-grow: 1; } /* Header */ header { text-align: left; border-bottom: 2px solid var(--accent-color); padding-bottom: 15px; margin-bottom: 10px; } h1 { font-size: 36px; font-weight: 900; background: linear-gradient(to right, #fff, #00d2ff); -webkit-background-clip: text; -webkit-text-fill-color: transparent; margin-bottom: 8px; line-height: 1.2; } .subtitle { font-size: 16px; color: var(--text-secondary); display: flex; align-items: center; gap: 5px; } .affiliation { font-size: 12px; margin-top: 5px; opacity: 0.8; font-family: 'Roboto', sans-serif; color: var(--accent-color); } /* Grid Layout */ .main-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 20px; flex-grow: 1; } .full-width { grid-column: 1 / -1; } /* Cards */ .card { background: var(--card-bg); backdrop-filter: blur(12px); -webkit-backdrop-filter: blur(12px); border: var(--card-border); border-radius: 12px; padding: 20px; display: flex; flex-direction: column; box-shadow: 0 8px 32px 0 rgba(0, 0, 0, 0.3); } .card-title { font-size: 18px; font-weight: 700; color: var(--accent-color); margin-bottom: 12px; display: flex; align-items: center; gap: 8px; border-bottom: 1px solid rgba(255,255,255,0.1); padding-bottom: 8px; } .card-content { font-size: 13px; line-height: 1.6; color: #e0e0e0; flex-grow: 1; } .highlight-text { font-weight: 700; color: #fff; } /* Image Styles */ .img-container { width: 100%; height: 140px; overflow: hidden; border-radius: 8px; margin-bottom: 12px; position: relative; } .img-container img { width: 100%; height: 100%; object-fit: cover; transition: transform 0.3s; } .img-overlay { position: absolute; bottom: 0; left: 0; width: 100%; background: linear-gradient(transparent, rgba(0,0,0,0.8)); padding: 8px; font-size: 10px; color: rgba(255,255,255,0.9); } /* List Styles */ ul.feature-list { list-style: none; padding-left: 5px; } ul.feature-list li { margin-bottom: 8px; padding-left: 15px; position: relative; } ul.feature-list li::before { content: "•"; color: var(--accent-color); position: absolute; left: 0; font-weight: bold; } /* Chart Area */ .chart-container { height: 220px; width: 100%; position: relative; } /* Application Grid */ .app-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 15px; } .app-item { position: relative; height: 120px; border-radius: 8px; overflow: hidden; } .app-item img { width: 100%; height: 100%; object-fit: cover; filter: brightness(0.8); } .app-text { position: absolute; bottom: 0; width: 100%; background: rgba(0,0,0,0.6); padding: 6px 10px; font-size: 12px; font-weight: 700; } /* Footer */ footer { text-align: center; font-size: 11px; color: rgba(255, 255, 255, 0.5); margin-top: auto; padding-top: 10px; border-top: 1px solid rgba(255, 255, 255, 0.1); } /* Tags */ .tag { display: inline-block; padding: 2px 8px; border-radius: 4px; font-size: 10px; font-weight: bold; margin-right: 5px; } .tag-sft { background: rgba(13, 110, 253, 0.3); color: #8ac4ff; } .tag-rl { background: rgba(255, 99, 132, 0.3); color: #ffb3c1; } .tag-theory { background: rgba(75, 192, 192, 0.3); color: #99ffeb; } Monet: Reasoning in Latent Visual Space visibility AI视觉推理在潜在空间的革命性突破北京大学 | 快手 | MIT 联合团队 lightbulb 核心概念：超越像素的"想象之眼" Monet旨在让多模态大模型（MLLM）摆脱"看图说话"的笨拙模式，真正拥有类似人类的"想象之眼"。它不再满足于简单的像素识别，而是在高维的"潜在视觉空间"中进行连续的心理模拟。流形假说数据在高维空间中集中在低维流形上。Monet如同在沙漠中找到了唯一的"绿洲之路"，在低维流形上进行"心理模拟"，避免了维度灾难。 architecture 核心技术架构 SFT + RL 框架示意 SFT (蒸馏微调) 阶段1: 热身适应图像-文本交错推理阶段2: 获取高质量目标潜在嵌入阶段3: 无辅助图像下自主生成嵌入 VLPO (策略优化) 将连续潜变量纳入强化学习策略梯度，直接根据奖励信号优化"视觉直觉"。 bar_chart 实验结果与性能 Monet在常规推理任务和分布外 (OOD)抽象任务上均显著超越基线模型（如GPT-4V）。 rocket_launch 未来展望与应用机器人救灾：模拟复杂环境，规划安全路径医疗预测：模拟病情演变，辅助诊疗决策当机器拥有"心智模型"，它们将像人类一样在脑海中预演行动后果，开启AI在物理世界应用的新篇章。 © 2025 Monet Research Team | Visual Reasoning Revolution const ctx = document.getElementById('monetChart').getContext('2d'); new Chart(ctx, { type: 'bar', data: { labels: ['常规推理任务', 'OOD 抽象推理'], datasets: [ { label: '基线 (SFT+GRPO)', data: [48.5, 22.0], backgroundColor: 'rgba(255, 159, 64, 0.6)', borderColor: 'rgba(255, 159, 64, 1)', borderWidth: 1 }, { label: 'Monet (VLPO)', data: [54.5, 33.7], backgroundColor: 'rgba(0, 210, 255, 0.6)', borderColor: 'rgba(0, 210, 255, 1)', borderWidth: 1 } ] }, options: { responsive: true, maintainAspectRatio: false, plugins: { legend: { labels: { color: '#e0e0e0', font: { size: 10 } } } }, scales: { x: { ticks: { color: '#e0e0e0', font: { size: 10 } }, grid: { display: false } }, y: { beginAtZero: true, max: 60, ticks: { color: '#e0e0e0', font: { size: 10 } }, grid: { color: 'rgba(255,255,255,0.1)' }, title: { display: true, text: '准确率 (%)', color: '#b3b3b3', font: { size: 10 } } } } } });

Monet: Reasoning in Latent Visual Space

AI视觉推理在潜在空间的革命性突破

北京大学 | 快手 | MIT 联合团队

核心概念：超越像素的"想象之眼"

Monet旨在让多模态大模型（MLLM）摆脱"看图说话"的笨拙模式，真正拥有类似人类的"想象之眼"。它不再满足于简单的像素识别，而是在高维的"潜在视觉空间"中进行连续的心理模拟。

流形假说数据在高维空间中集中在低维流形上。Monet如同在沙漠中找到了唯一的"绿洲之路"，在低维流形上进行"心理模拟"，避免了维度灾难。

核心技术架构

SFT + RL 框架示意

SFT (蒸馏微调)

阶段1: 热身适应图像-文本交错推理

阶段2: 获取高质量目标潜在嵌入

阶段3: 无辅助图像下自主生成嵌入

VLPO (策略优化)

将连续潜变量纳入强化学习策略梯度，直接根据奖励信号优化"视觉直觉"。

实验结果与性能

Monet在常规推理任务和分布外 (OOD)抽象任务上均显著超越基线模型（如GPT-4V）。

未来展望与应用

机器人救灾：模拟复杂环境，规划安全路径

医疗预测：模拟病情演变，辅助诊疗决策

当机器拥有"心智模型"，它们将像人类一样在脑海中预演行动后果，开启AI在物理世界应用的新篇章。

Monet: Reasoning in Latent Visual Space visibility AI视觉推理在潜在空间的革命性突破

🌟 智谱 GLM-5 已上线