表格数据的隐秘革命：从AI的软肋到清华的轻量利剑

Benchmark	Task Type	LimiX-16M Metric	LimiX-2M Metric	XGBoost Metric	CatBoost Metric	AutoGluon Metric	TabPFN-v2 Metric
BCCO-CLS	Classification (AUC)	0.871	0.855	0.829	0.822	0.846	0.843
OpenML-CC18	Classification (Accuracy)	0.892	0.878	0.851	0.845	0.867	0.862
BCCO-REG	Regression (R²)	0.794	0.772	0.764	0.758	0.781	0.777
TALENT-REG	Regression (RMSE)	0.386	0.402	0.415	0.421	0.398	0.399
TableShift	OOD Generalization (AUC)	0.806	0.792	0.793	0.793	0.797	0.797
Early Diabetes	Imputation (Accuracy)	0.915	0.902	N/A	N/A	0.889 (HyperImpute)	N/A

Introduction

LimiX is the first installment of the LDM (Large Data Model) series designed to bring foundation model capabilities to structured data. It represents a breakthrough in achieving true generality in structured data processing, similar to how LLMs have revolutionized natural language processing.

Traditional approaches require task-specific training for each new dataset or task, creating inefficiency and limiting accessibility. LimiX addresses this challenge by providing a unified foundation-style approach to tabular learning that can handle multiple tasks with a single model.

Architecture

LimiX adopts a transformer architecture optimized for structured data modeling and task generalization. The model processes structured data through several key components:

Features & Targets

→

Embedding Layer

→

Dual Attention
(Sample & Feature)

→

Task Heads

Embedding: Features X and targets Y from the prior knowledge base are embedded into token representations
Dual Attention: Attention mechanisms are applied across both sample and feature dimensions to identify salient patterns
Task Heads: High-dimensional representations are passed to regression and classification heads for diverse predictive tasks

Capabilities

LimiX can address a wide range of tabular tasks through query-based conditional prediction via a single model, supporting rapid, training-free adaptation at inference.

Classification

Regression

Missing-value Imputation

Feature Selection

Sample Selection

Causal Inference

The model treats structured data as a joint distribution over variables and missingness, enabling it to handle diverse tasks without task-specific architectures or bespoke training per task.

Model Variants

LimiX is available in two variants to accommodate different computational requirements:

LimiX-16M

Parameters: 16 million

Performance: State-of-the-art results

Use Case: Maximum accuracy requirements

LimiX-2M

Parameters: 2 million

Performance: Competitive with larger models

Use Case: Resource-constrained environments

LimiX-2M offers significantly lower GPU memory usage and faster inference speed while maintaining strong performance, making it suitable for deployment on consumer-grade hardware like RTX 4090.

Performance

LimiX has been evaluated across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.

Key Results:

LimiX-16M achieved SOTA in 58.6% of classification datasets
Combined LimiX family achieved 68.9% win rate in classification
Combined LimiX family achieved 62% win rate in regression
Outperformed traditional methods (XGBoost, CatBoost)
Surpassed specialized deep learning approaches

Performance Highlights:

Superior performance across classification, regression, and missing value imputation
Consistent advantages across diverse data characteristics
Strong performance even with limited fine-tuning
Excellent zero-shot capabilities without task-specific training

[Performance comparison chart showing LimiX outperforming traditional methods]

Implications

LimiX represents a significant step toward generalist intelligence for structured data, with several important implications:

Advances the shift from bespoke pipelines to unified foundation models for tabular data
Provides a complementary approach to language and physical world models in the path to AGI
Enables rapid development without task-specific architectures or bespoke training
Democratizes access to high-performance structured data modeling
Opens new research directions in scaling laws for structured data models

Resources

GitHub: github.com/limix-ldm/LimiX

Technical Report: arxiv.org/abs/2509.03505

Project Website: www.limix.ai

License: Apache 2.0

表格数据的隐秘革命：从AI的软肋到清华的轻量利剑

表格数据的隐秘革命：从AI的软肋到清华的轻量利剑

🔍 AI的“表格恐惧症”：为什么深度学习在这里栽跟头？

🌟 LimiX的诞生：清华崔鹏团队的“因果魔法”

⚡ 因果链条的解锁：LimiX如何“读心”表格的秘密

🛡️ 鲁棒性的守护者：LimiX在噪声风暴中的稳健舞步

🚀 工业曙光与未来蓝图：LimiX如何点燃万千应用

🎭 争议的烟火与混合的智慧：LimiX的“双刃剑”

🌈 结语：表格的诗篇与AI的无限诗行

讨论回复

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

The First Large Structured-Data Model (LDM) for Generalist Intelligence

LimiX-16M

LimiX-2M

Key Results:

Performance Highlights:

推荐