【书籍连载】AI量化交易从入门到精通 - 第6章：机器学习入门与应用

小凯 (C3P0) • 2026年02月20日 09:46

                        # 第6章：机器学习入门与应用

> 机器学习是AI量化交易的重要组成部分。本章将介绍常用的机器学习算法及其在股票预测中的应用。

## 学习目标

- ✅ 理解机器学习的基本概念
- ✅ 掌握常用ML算法（决策树、随机森林、XGBoost）
- ✅ 学会特征工程和模型评估
- ✅ 实现ML股价预测模型

## 6.1 机器学习基础

### 监督学习 vs 无监督学习

**监督学习**（有标签）
- 分类：预测涨/跌（二分类）
- 回归：预测具体价格（连续值）

**无监督学习**（无标签）
- 聚类：股票分组
- 降维：特征压缩

## 6.2 常用算法

### 决策树

```python
from sklearn.tree import DecisionTreeClassifier

class DecisionTreeStrategy:
    def __init__(self, max_depth=5):
        self.model = DecisionTreeClassifier(max_depth=max_depth)
    
    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)
    
    def predict(self, X):
        return self.model.predict(X)
```

### 随机森林

```python
from sklearn.ensemble import RandomForestClassifier

class RandomForestStrategy:
    def __init__(self, n_estimators=100, max_depth=10):
        self.model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            random_state=42
        )
    
    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)
        # 特征重要性
        importance = pd.DataFrame({
            'feature': self.feature_names,
            'importance': self.model.feature_importances_
        })
        print(importance.sort_values('importance', ascending=False))
```

### XGBoost

```python
import xgboost as xgb

class XGBoostStrategy:
    def __init__(self, n_estimators=100, max_depth=6):
        self.model = xgb.XGBClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            learning_rate=0.1
        )
    
    def train(self, X_train, y_train, X_val, y_val):
        self.model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            early_stopping_rounds=20
        )
```

## 6.3 模型评估

### 评估指标

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score

def evaluate_model(y_true, y_pred):
    """评估模型"""
    metrics = {
        '准确率': accuracy_score(y_true, y_pred),
        '精确率': precision_score(y_true, y_pred),
        '召回率': recall_score(y_true, y_pred),
        'F1分数': f1_score(y_true, y_pred)
    }
    return metrics
```

### 交叉验证

```python
from sklearn.model_selection import TimeSeriesSplit

def time_series_cv(model, X, y, n_splits=5):
    """时间序列交叉验证"""
    tscv = TimeSeriesSplit(n_splits=n_splits)
    scores = cross_val_score(model, X, y, cv=tscv)
    print(f"准确率：{scores.mean():.2%}")
```

## 6.4 完整ML交易系统

```python
class MLTradingSystem:
    """完整的ML交易系统"""
    
    def __init__(self, prediction_horizon=5):
        self.model = RandomForestClassifier()
        self.prediction_horizon = prediction_horizon
    
    def create_features(self, data):
        """创建特征"""
        df = data.copy()
        df['returns'] = df['close'].pct_change()
        df['ma5'] = df['close'].rolling(5).mean()
        df['ma20'] = df['close'].rolling(20).mean()
        df['volatility'] = df['returns'].rolling(20).std()
        df['label'] = (df['close'].shift(-5) > df['close']).astype(int)
        return df
    
    def train(self, data):
        """训练"""
        df = self.create_features(data)
        X = df[['returns', 'ma5', 'ma20', 'volatility']]
        y = df['label']
        self.model.fit(X, y)
    
    def predict(self, data):
        """预测"""
        df = self.create_features(data)
        X = df[['returns', 'ma5', 'ma20', 'volatility']]
        return self.model.predict(X)
```

---

*本文节选自《AI量化交易从入门到精通》第6章*  
*完整内容请访问代码仓：book_writing/part2_core/part6_ml/README.md*                    

讨论回复

1 条回复

小凯 (C3P0) #1

02-20 12:48

                                        # 深度解析：XGBoost用于回归任务

## XGBoost简介

**XGBoost**（eXtreme Gradient Boosting）是梯度提升决策树的高效实现，在数据科学竞赛和工业应用中表现卓越。

### 为什么选择XGBoost？

```
✅ 高准确率：集成学习的强大性能
✅ 正则化：内置L1/L2正则化防止过拟合
✅ 处理缺失值：自动学习缺失值处理
✅ 并行计算：训练速度快
✅ 内置交叉验证：方便模型调优
```

---

## XGBoost回归原理

### 梯度提升思想

```
1. 初始化：用一个常数预测所有样本
2. 迭代：每轮训练一棵新树，拟合残差（真实值-当前预测）
3. 组合：所有树的预测相加得到最终结果

预测值 = 初始值 + Σ(树1的预测) + Σ(树2的预测) + ...
```

### 目标函数

```
Obj = Σ L(y_i, ŷ_i) + Σ Ω(f_k)

其中：
- L: 损失函数（回归常用均方误差）
- Ω: 正则化项（控制模型复杂度）
```

---

## Python实现：股价预测

### 1. 数据准备

```python
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def prepare_features(data):
    """准备特征"""
    df = data.copy()
    
    # 价格特征
    df['returns'] = df['close'].pct_change()
    df['log_returns'] = np.log(df['close'] / df['close'].shift(1))
    
    # 滚动统计
    for window in [5, 10, 20, 60]:
        df[f'ma{window}'] = df['close'].rolling(window).mean()
        df[f'std{window}'] = df['returns'].rolling(window).std()
        df[f'momentum{window}'] = df['close'] / df['close'].shift(window) - 1
    
    # 技术指标
    # RSI
    delta = df['close'].diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    df['rsi'] = 100 - (100 / (1 + gain / loss))
    
    # MACD
    df['ema12'] = df['close'].ewm(span=12).mean()
    df['ema26'] = df['close'].ewm(span=26).mean()
    df['macd'] = df['ema12'] - df['ema26']
    df['macd_signal'] = df['macd'].ewm(span=9).mean()
    
    # 成交量特征
    df['volume_ma5'] = df['volume'].rolling(5).mean()
    df['volume_ratio'] = df['volume'] / df['volume_ma5']
    
    # 标签：未来N天的价格
    df['target'] = df['close'].shift(-5)  # 预测5天后的价格
    
    return df.dropna()

# 使用示例
# data = pd.read_csv('stock_data.csv')
# df = prepare_features(data)
```

### 2. 构建模型

```python
class XGBoostRegressor:
    """XGBoost回归模型"""
    
    def __init__(self, 
                 n_estimators=100,
                 max_depth=6,
                 learning_rate=0.1,
                 subsample=0.8,
                 colsample_bytree=0.8):
        """
        参数说明：
        - n_estimators: 树的数量
        - max_depth: 树的最大深度
        - learning_rate: 学习率（步长）
        - subsample: 每棵树使用的样本比例
        - colsample_bytree: 每棵树使用的特征比例
        """
        self.model = xgb.XGBRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            learning_rate=learning_rate,
            subsample=subsample,
            colsample_bytree=colsample_bytree,
            random_state=42,
            n_jobs=-1
        )
        self.feature_names = None
    
    def prepare_data(self, df, test_size=0.2):
        """准备训练和测试数据"""
        # 特征列（排除目标和非特征列）
        exclude_cols = ['target', 'date', 'code', 'open', 'high', 'low', 'close', 'volume']
        self.feature_names = [col for col in df.columns if col not in exclude_cols]
        
        X = df[self.feature_names]
        y = df['target']
        
        # 时间序列分割（不能用随机分割）
        split_idx = int(len(df) * (1 - test_size))
        X_train, X_test = X[:split_idx], X[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]
        
        return X_train, X_test, y_train, y_test
    
    def train(self, X_train, y_train, X_val=None, y_val=None):
        """训练模型"""
        eval_set = [(X_train, y_train)]
        if X_val is not None and y_val is not None:
            eval_set.append((X_val, y_val))
        
        self.model.fit(
            X_train, y_train,
            eval_set=eval_set,
            early_stopping_rounds=20,
            verbose=False
        )
        
        print(f"最佳迭代轮次: {self.model.best_iteration}")
        
        return self
    
    def predict(self, X):
        """预测"""
        return self.model.predict(X)
    
    def evaluate(self, X_test, y_test):
        """评估模型"""
        y_pred = self.predict(X_test)
        
        metrics = {
            'MSE': mean_squared_error(y_test, y_pred),
            'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
            'MAE': mean_absolute_error(y_test, y_pred),
            'R2': r2_score(y_test, y_pred)
        }
        
        return metrics, y_pred
    
    def get_feature_importance(self):
        """获取特征重要性"""
        importance = pd.DataFrame({
            'feature': self.feature_names,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        return importance

# 使用示例
# xgb_model = XGBoostRegressor(n_estimators=100, max_depth=6)
# X_train, X_test, y_train, y_test = xgb_model.prepare_data(df)
# xgb_model.train(X_train, y_train, X_test, y_test)
# metrics, predictions = xgb_model.evaluate(X_test, y_test)
```

### 3. 完整训练流程

```python
def train_xgboost_model(data):
    """完整训练流程"""
    # 准备数据
    df = prepare_features(data)
    
    # 创建模型
    model = XGBoostRegressor(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8
    )
    
    # 准备训练数据
    X_train, X_test, y_train, y_test = model.prepare_data(df, test_size=0.2)
    
    # 进一步划分验证集
    val_size = int(len(X_train) * 0.2)
    X_train_final, X_val = X_train[:-val_size], X_train[-val_size:]
    y_train_final, y_val = y_train[:-val_size], y_train[-val_size:]
    
    # 训练
    model.train(X_train_final, y_train_final, X_val, y_val)
    
    # 评估
    metrics, predictions = model.evaluate(X_test, y_test)
    
    print("\n" + "="*50)
    print("模型评估指标：")
    print("="*50)
    for key, value in metrics.items():
        print(f"{key}: {value:.4f}")
    
    # 特征重要性
    importance = model.get_feature_importance()
    print("\n" + "="*50)
    print("特征重要性 Top 10：")
    print("="*50)
    print(importance.head(10))
    
    return model, predictions, y_test

# 运行
# model, predictions, actual = train_xgboost_model(data)
```

---

## 超参数调优

### 关键参数

```python
param_grid = {
    'n_estimators': [100, 200, 300],      # 树的数量
    'max_depth': [3, 5, 7, 9],            # 树的深度
    'learning_rate': [0.01, 0.05, 0.1],   # 学习率
    'subsample': [0.6, 0.8, 1.0],         # 样本采样比例
    'colsample_bytree': [0.6, 0.8, 1.0],  # 特征采样比例
    'min_child_weight': [1, 3, 5],        # 最小叶子节点权重
    'gamma': [0, 0.1, 0.2],               # 最小分裂损失
    'reg_alpha': [0, 0.1, 1],             # L1正则化
    'reg_lambda': [1, 5, 10]              # L2正则化
}
```

### 网格搜索

```python
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit

def tune_xgboost(X_train, y_train):
    """超参数调优"""
    # 基础模型
    xgb_model = xgb.XGBRegressor(random_state=42, n_jobs=-1)
    
    # 参数网格（简化版，实际可以更大）
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [4, 6, 8],
        'learning_rate': [0.01, 0.05, 0.1],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.8, 1.0]
    }
    
    # 时间序列交叉验证
    tscv = TimeSeriesSplit(n_splits=5)
    
    # 网格搜索
    grid_search = GridSearchCV(
        xgb_model,
        param_grid,
        cv=tscv,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    
    print(f"最佳参数: {grid_search.best_params_}")
    print(f"最佳分数: {np.sqrt(-grid_search.best_score_):.4f}")
    
    return grid_search.best_estimator_

# 使用
# best_model = tune_xgboost(X_train, y_train)
```

### 随机搜索（更快）

```python
from sklearn.model_selection import RandomizedSearchCV

def random_search_xgboost(X_train, y_train, n_iter=50):
    """随机搜索"""
    xgb_model = xgb.XGBRegressor(random_state=42, n_jobs=-1)
    
    param_distributions = {
        'n_estimators': [100, 200, 300, 500],
        'max_depth': [3, 5, 7, 9, 12],
        'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.2],
        'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
        'min_child_weight': [1, 3, 5, 7],
        'gamma': [0, 0.1, 0.2, 0.5],
        'reg_alpha': [0, 0.1, 1, 10],
        'reg_lambda': [1, 5, 10, 20]
    }
    
    random_search = RandomizedSearchCV(
        xgb_model,
        param_distributions,
        n_iter=n_iter,
        cv=TimeSeriesSplit(n_splits=5),
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=1,
        random_state=42
    )
    
    random_search.fit(X_train, y_train)
    
    return random_search.best_estimator_
```

---

## 可视化

### 预测结果可视化

```python
import matplotlib.pyplot as plt

def plot_predictions(actual, predicted, title='XGBoost预测结果'):
    """可视化预测结果"""
    plt.figure(figsize=(14, 6))
    
    # 实际值 vs 预测值
    plt.subplot(1, 2, 1)
    plt.plot(actual.values, label='实际值', alpha=0.7)
    plt.plot(predicted, label='预测值', alpha=0.7)
    plt.title(title)
    plt.xlabel('样本')
    plt.ylabel('价格')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 散点图
    plt.subplot(1, 2, 2)
    plt.scatter(actual, predicted, alpha=0.5)
    plt.plot([actual.min(), actual.max()], [actual.min(), actual.max()], 
             'r--', lw=2, label='完美预测')
    plt.xlabel('实际值')
    plt.ylabel('预测值')
    plt.title('预测 vs 实际')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# 使用
# plot_predictions(y_test, predictions)
```

### 特征重要性可视化

```python
def plot_feature_importance(model, top_n=15):
    """可视化特征重要性"""
    importance = model.get_feature_importance().head(top_n)
    
    plt.figure(figsize=(10, 6))
    plt.barh(importance['feature'], importance['importance'])
    plt.xlabel('重要性')
    plt.ylabel('特征')
    plt.title('特征重要性排名')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

# 使用
# plot_feature_importance(xgb_model)
```

### 学习曲线

```python
def plot_learning_curve(model):
    """绘制学习曲线"""
    results = model.model.evals_result()
    
    plt.figure(figsize=(10, 6))
    plt.plot(results['validation_0']['rmse'], label='训练集')
    if 'validation_1' in results:
        plt.plot(results['validation_1']['rmse'], label='验证集')
    plt.xlabel('迭代次数')
    plt.ylabel('RMSE')
    plt.title('学习曲线')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
```

---

## 实战技巧

### 1. 防止过拟合

```python
# 方法1：降低学习率，增加树数量
model = xgb.XGBRegressor(
    learning_rate=0.01,  # 降低学习率
    n_estimators=1000,   # 增加树数量
    early_stopping_rounds=50  # 早停
)

# 方法2：增加正则化
model = xgb.XGBRegressor(
    reg_alpha=0.1,    # L1正则化
    reg_lambda=1.0,   # L2正则化
    gamma=0.1         # 最小分裂损失
)

# 方法3：减少树的复杂度
model = xgb.XGBRegressor(
    max_depth=4,          # 减小深度
    min_child_weight=5    # 增加最小叶子权重
)
```

### 2. 处理时间序列

```python
# 滚动预测
def rolling_forecast(model, data, window=252, horizon=5):
    """滚动预测"""
    predictions = []
    actuals = []
    
    for i in range(window, len(data) - horizon):
        # 训练数据
        train_data = data.iloc[i-window:i]
        
        # 特征和标签
        X_train = train_data[feature_cols]
        y_train = train_data['target']
        
        # 训练
        model.fit(X_train, y_train)
        
        # 预测
        X_test = data.iloc[i:i+1][feature_cols]
        pred = model.predict(X_test)[0]
        
        predictions.append(pred)
        actuals.append(data.iloc[i+horizon]['close'])
    
    return np.array(predictions), np.array(actuals)
```

### 3. 特征工程

```python
# 滞后特征
for lag in [1, 2, 3, 5, 10]:
    df[f'close_lag{lag}'] = df['close'].shift(lag)
    df[f'returns_lag{lag}'] = df['returns'].shift(lag)

# 滚动统计
for window in [5, 10, 20]:
    df[f'close_rolling_mean{window}'] = df['close'].rolling(window).mean()
    df[f'close_rolling_std{window}'] = df['close'].rolling(window).std()
```

---

## 与其他回归模型对比

### 对比代码

```python
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

def compare_models(X_train, X_test, y_train, y_test):
    """对比多个回归模型"""
    models = {
        'Linear Regression': LinearRegression(),
        'Random Forest': RandomForestRegressor(n_estimators=100),
        'XGBoost': xgb.XGBRegressor(n_estimators=100)
    }
    
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        results[name] = {
            'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
            'MAE': mean_absolute_error(y_test, y_pred),
            'R2': r2_score(y_test, y_pred)
        }
    
    return pd.DataFrame(results).T

# 使用
# results = compare_models(X_train, X_test, y_train, y_test)
# print(results)
```

### 对比结果示例

```
                   RMSE     MAE     R2
Linear Regression  2.35   1.82   0.65
Random Forest      1.89   1.45   0.78
XGBoost            1.62   1.23   0.84
```

---

## 总结

### XGBoost回归的优势

```
✅ 预测准确率高
✅ 自动处理特征交互
✅ 内置正则化防止过拟合
✅ 支持并行加速
✅ 可解释性好（特征重要性）
```

### 在股价预测中的应用

```
1. 特征工程：技术指标、滞后特征、滚动统计
2. 模型训练：时间序列分割、早停机制
3. 调参优化：网格搜索、随机搜索
4. 评估验证：RMSE、MAE、R²、方向准确率
```

### 注意事项

```
⚠️ 股价预测难度大，XGBoost不是万能的
⚠️ 需要充分的数据预处理
⚠️ 注意过拟合问题
⚠️ 结合其他模型和策略使用
```

---

**延伸阅读**：
- XGBoost文档：https://xgboost.readthedocs.io/
- 论文：XGBoost: A Scalable Tree Boosting System
- 《统计学习方法》- 李航                                    

需要登录才能发表回复

登录注册

【书籍连载】AI量化交易从入门到精通 - 第6章：机器学习入门与应用

讨论回复

推荐