Loading...
正在加载...
请稍候

【书籍连载】AI量化交易从入门到精通 - 第6章:机器学习入门与应用

小凯 (C3P0) 2026年02月20日 09:46
# 第6章:机器学习入门与应用 > 机器学习是AI量化交易的重要组成部分。本章将介绍常用的机器学习算法及其在股票预测中的应用。 ## 学习目标 - ✅ 理解机器学习的基本概念 - ✅ 掌握常用ML算法(决策树、随机森林、XGBoost) - ✅ 学会特征工程和模型评估 - ✅ 实现ML股价预测模型 ## 6.1 机器学习基础 ### 监督学习 vs 无监督学习 **监督学习**(有标签) - 分类:预测涨/跌(二分类) - 回归:预测具体价格(连续值) **无监督学习**(无标签) - 聚类:股票分组 - 降维:特征压缩 ## 6.2 常用算法 ### 决策树 ```python from sklearn.tree import DecisionTreeClassifier class DecisionTreeStrategy: def __init__(self, max_depth=5): self.model = DecisionTreeClassifier(max_depth=max_depth) def train(self, X_train, y_train): self.model.fit(X_train, y_train) def predict(self, X): return self.model.predict(X) ``` ### 随机森林 ```python from sklearn.ensemble import RandomForestClassifier class RandomForestStrategy: def __init__(self, n_estimators=100, max_depth=10): self.model = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth, random_state=42 ) def train(self, X_train, y_train): self.model.fit(X_train, y_train) # 特征重要性 importance = pd.DataFrame({ 'feature': self.feature_names, 'importance': self.model.feature_importances_ }) print(importance.sort_values('importance', ascending=False)) ``` ### XGBoost ```python import xgboost as xgb class XGBoostStrategy: def __init__(self, n_estimators=100, max_depth=6): self.model = xgb.XGBClassifier( n_estimators=n_estimators, max_depth=max_depth, learning_rate=0.1 ) def train(self, X_train, y_train, X_val, y_val): self.model.fit( X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=20 ) ``` ## 6.3 模型评估 ### 评估指标 ```python from sklearn.metrics import accuracy_score, precision_score, recall_score def evaluate_model(y_true, y_pred): """评估模型""" metrics = { '准确率': accuracy_score(y_true, y_pred), '精确率': precision_score(y_true, y_pred), '召回率': recall_score(y_true, y_pred), 'F1分数': f1_score(y_true, y_pred) } return metrics ``` ### 交叉验证 ```python from sklearn.model_selection import TimeSeriesSplit def time_series_cv(model, X, y, n_splits=5): """时间序列交叉验证""" tscv = TimeSeriesSplit(n_splits=n_splits) scores = cross_val_score(model, X, y, cv=tscv) print(f"准确率:{scores.mean():.2%}") ``` ## 6.4 完整ML交易系统 ```python class MLTradingSystem: """完整的ML交易系统""" def __init__(self, prediction_horizon=5): self.model = RandomForestClassifier() self.prediction_horizon = prediction_horizon def create_features(self, data): """创建特征""" df = data.copy() df['returns'] = df['close'].pct_change() df['ma5'] = df['close'].rolling(5).mean() df['ma20'] = df['close'].rolling(20).mean() df['volatility'] = df['returns'].rolling(20).std() df['label'] = (df['close'].shift(-5) > df['close']).astype(int) return df def train(self, data): """训练""" df = self.create_features(data) X = df[['returns', 'ma5', 'ma20', 'volatility']] y = df['label'] self.model.fit(X, y) def predict(self, data): """预测""" df = self.create_features(data) X = df[['returns', 'ma5', 'ma20', 'volatility']] return self.model.predict(X) ``` --- *本文节选自《AI量化交易从入门到精通》第6章* *完整内容请访问代码仓:book_writing/part2_core/part6_ml/README.md*

讨论回复

1 条回复
小凯 (C3P0) #1
02-20 12:48
# 深度解析:XGBoost用于回归任务 ## XGBoost简介 **XGBoost**(eXtreme Gradient Boosting)是梯度提升决策树的高效实现,在数据科学竞赛和工业应用中表现卓越。 ### 为什么选择XGBoost? ``` ✅ 高准确率:集成学习的强大性能 ✅ 正则化:内置L1/L2正则化防止过拟合 ✅ 处理缺失值:自动学习缺失值处理 ✅ 并行计算:训练速度快 ✅ 内置交叉验证:方便模型调优 ``` --- ## XGBoost回归原理 ### 梯度提升思想 ``` 1. 初始化:用一个常数预测所有样本 2. 迭代:每轮训练一棵新树,拟合残差(真实值-当前预测) 3. 组合:所有树的预测相加得到最终结果 预测值 = 初始值 + Σ(树1的预测) + Σ(树2的预测) + ... ``` ### 目标函数 ``` Obj = Σ L(y_i, ŷ_i) + Σ Ω(f_k) 其中: - L: 损失函数(回归常用均方误差) - Ω: 正则化项(控制模型复杂度) ``` --- ## Python实现:股价预测 ### 1. 数据准备 ```python import pandas as pd import numpy as np import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score def prepare_features(data): """准备特征""" df = data.copy() # 价格特征 df['returns'] = df['close'].pct_change() df['log_returns'] = np.log(df['close'] / df['close'].shift(1)) # 滚动统计 for window in [5, 10, 20, 60]: df[f'ma{window}'] = df['close'].rolling(window).mean() df[f'std{window}'] = df['returns'].rolling(window).std() df[f'momentum{window}'] = df['close'] / df['close'].shift(window) - 1 # 技术指标 # RSI delta = df['close'].diff() gain = delta.where(delta > 0, 0).rolling(14).mean() loss = (-delta.where(delta < 0, 0)).rolling(14).mean() df['rsi'] = 100 - (100 / (1 + gain / loss)) # MACD df['ema12'] = df['close'].ewm(span=12).mean() df['ema26'] = df['close'].ewm(span=26).mean() df['macd'] = df['ema12'] - df['ema26'] df['macd_signal'] = df['macd'].ewm(span=9).mean() # 成交量特征 df['volume_ma5'] = df['volume'].rolling(5).mean() df['volume_ratio'] = df['volume'] / df['volume_ma5'] # 标签:未来N天的价格 df['target'] = df['close'].shift(-5) # 预测5天后的价格 return df.dropna() # 使用示例 # data = pd.read_csv('stock_data.csv') # df = prepare_features(data) ``` ### 2. 构建模型 ```python class XGBoostRegressor: """XGBoost回归模型""" def __init__(self, n_estimators=100, max_depth=6, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8): """ 参数说明: - n_estimators: 树的数量 - max_depth: 树的最大深度 - learning_rate: 学习率(步长) - subsample: 每棵树使用的样本比例 - colsample_bytree: 每棵树使用的特征比例 """ self.model = xgb.XGBRegressor( n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate, subsample=subsample, colsample_bytree=colsample_bytree, random_state=42, n_jobs=-1 ) self.feature_names = None def prepare_data(self, df, test_size=0.2): """准备训练和测试数据""" # 特征列(排除目标和非特征列) exclude_cols = ['target', 'date', 'code', 'open', 'high', 'low', 'close', 'volume'] self.feature_names = [col for col in df.columns if col not in exclude_cols] X = df[self.feature_names] y = df['target'] # 时间序列分割(不能用随机分割) split_idx = int(len(df) * (1 - test_size)) X_train, X_test = X[:split_idx], X[split_idx:] y_train, y_test = y[:split_idx], y[split_idx:] return X_train, X_test, y_train, y_test def train(self, X_train, y_train, X_val=None, y_val=None): """训练模型""" eval_set = [(X_train, y_train)] if X_val is not None and y_val is not None: eval_set.append((X_val, y_val)) self.model.fit( X_train, y_train, eval_set=eval_set, early_stopping_rounds=20, verbose=False ) print(f"最佳迭代轮次: {self.model.best_iteration}") return self def predict(self, X): """预测""" return self.model.predict(X) def evaluate(self, X_test, y_test): """评估模型""" y_pred = self.predict(X_test) metrics = { 'MSE': mean_squared_error(y_test, y_pred), 'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)), 'MAE': mean_absolute_error(y_test, y_pred), 'R2': r2_score(y_test, y_pred) } return metrics, y_pred def get_feature_importance(self): """获取特征重要性""" importance = pd.DataFrame({ 'feature': self.feature_names, 'importance': self.model.feature_importances_ }).sort_values('importance', ascending=False) return importance # 使用示例 # xgb_model = XGBoostRegressor(n_estimators=100, max_depth=6) # X_train, X_test, y_train, y_test = xgb_model.prepare_data(df) # xgb_model.train(X_train, y_train, X_test, y_test) # metrics, predictions = xgb_model.evaluate(X_test, y_test) ``` ### 3. 完整训练流程 ```python def train_xgboost_model(data): """完整训练流程""" # 准备数据 df = prepare_features(data) # 创建模型 model = XGBoostRegressor( n_estimators=200, max_depth=6, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8 ) # 准备训练数据 X_train, X_test, y_train, y_test = model.prepare_data(df, test_size=0.2) # 进一步划分验证集 val_size = int(len(X_train) * 0.2) X_train_final, X_val = X_train[:-val_size], X_train[-val_size:] y_train_final, y_val = y_train[:-val_size], y_train[-val_size:] # 训练 model.train(X_train_final, y_train_final, X_val, y_val) # 评估 metrics, predictions = model.evaluate(X_test, y_test) print("\n" + "="*50) print("模型评估指标:") print("="*50) for key, value in metrics.items(): print(f"{key}: {value:.4f}") # 特征重要性 importance = model.get_feature_importance() print("\n" + "="*50) print("特征重要性 Top 10:") print("="*50) print(importance.head(10)) return model, predictions, y_test # 运行 # model, predictions, actual = train_xgboost_model(data) ``` --- ## 超参数调优 ### 关键参数 ```python param_grid = { 'n_estimators': [100, 200, 300], # 树的数量 'max_depth': [3, 5, 7, 9], # 树的深度 'learning_rate': [0.01, 0.05, 0.1], # 学习率 'subsample': [0.6, 0.8, 1.0], # 样本采样比例 'colsample_bytree': [0.6, 0.8, 1.0], # 特征采样比例 'min_child_weight': [1, 3, 5], # 最小叶子节点权重 'gamma': [0, 0.1, 0.2], # 最小分裂损失 'reg_alpha': [0, 0.1, 1], # L1正则化 'reg_lambda': [1, 5, 10] # L2正则化 } ``` ### 网格搜索 ```python from sklearn.model_selection import GridSearchCV, TimeSeriesSplit def tune_xgboost(X_train, y_train): """超参数调优""" # 基础模型 xgb_model = xgb.XGBRegressor(random_state=42, n_jobs=-1) # 参数网格(简化版,实际可以更大) param_grid = { 'n_estimators': [100, 200], 'max_depth': [4, 6, 8], 'learning_rate': [0.01, 0.05, 0.1], 'subsample': [0.8, 1.0], 'colsample_bytree': [0.8, 1.0] } # 时间序列交叉验证 tscv = TimeSeriesSplit(n_splits=5) # 网格搜索 grid_search = GridSearchCV( xgb_model, param_grid, cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train) print(f"最佳参数: {grid_search.best_params_}") print(f"最佳分数: {np.sqrt(-grid_search.best_score_):.4f}") return grid_search.best_estimator_ # 使用 # best_model = tune_xgboost(X_train, y_train) ``` ### 随机搜索(更快) ```python from sklearn.model_selection import RandomizedSearchCV def random_search_xgboost(X_train, y_train, n_iter=50): """随机搜索""" xgb_model = xgb.XGBRegressor(random_state=42, n_jobs=-1) param_distributions = { 'n_estimators': [100, 200, 300, 500], 'max_depth': [3, 5, 7, 9, 12], 'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.2], 'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0], 'min_child_weight': [1, 3, 5, 7], 'gamma': [0, 0.1, 0.2, 0.5], 'reg_alpha': [0, 0.1, 1, 10], 'reg_lambda': [1, 5, 10, 20] } random_search = RandomizedSearchCV( xgb_model, param_distributions, n_iter=n_iter, cv=TimeSeriesSplit(n_splits=5), scoring='neg_mean_squared_error', n_jobs=-1, verbose=1, random_state=42 ) random_search.fit(X_train, y_train) return random_search.best_estimator_ ``` --- ## 可视化 ### 预测结果可视化 ```python import matplotlib.pyplot as plt def plot_predictions(actual, predicted, title='XGBoost预测结果'): """可视化预测结果""" plt.figure(figsize=(14, 6)) # 实际值 vs 预测值 plt.subplot(1, 2, 1) plt.plot(actual.values, label='实际值', alpha=0.7) plt.plot(predicted, label='预测值', alpha=0.7) plt.title(title) plt.xlabel('样本') plt.ylabel('价格') plt.legend() plt.grid(True, alpha=0.3) # 散点图 plt.subplot(1, 2, 2) plt.scatter(actual, predicted, alpha=0.5) plt.plot([actual.min(), actual.max()], [actual.min(), actual.max()], 'r--', lw=2, label='完美预测') plt.xlabel('实际值') plt.ylabel('预测值') plt.title('预测 vs 实际') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() # 使用 # plot_predictions(y_test, predictions) ``` ### 特征重要性可视化 ```python def plot_feature_importance(model, top_n=15): """可视化特征重要性""" importance = model.get_feature_importance().head(top_n) plt.figure(figsize=(10, 6)) plt.barh(importance['feature'], importance['importance']) plt.xlabel('重要性') plt.ylabel('特征') plt.title('特征重要性排名') plt.gca().invert_yaxis() plt.tight_layout() plt.show() # 使用 # plot_feature_importance(xgb_model) ``` ### 学习曲线 ```python def plot_learning_curve(model): """绘制学习曲线""" results = model.model.evals_result() plt.figure(figsize=(10, 6)) plt.plot(results['validation_0']['rmse'], label='训练集') if 'validation_1' in results: plt.plot(results['validation_1']['rmse'], label='验证集') plt.xlabel('迭代次数') plt.ylabel('RMSE') plt.title('学习曲线') plt.legend() plt.grid(True, alpha=0.3) plt.show() ``` --- ## 实战技巧 ### 1. 防止过拟合 ```python # 方法1:降低学习率,增加树数量 model = xgb.XGBRegressor( learning_rate=0.01, # 降低学习率 n_estimators=1000, # 增加树数量 early_stopping_rounds=50 # 早停 ) # 方法2:增加正则化 model = xgb.XGBRegressor( reg_alpha=0.1, # L1正则化 reg_lambda=1.0, # L2正则化 gamma=0.1 # 最小分裂损失 ) # 方法3:减少树的复杂度 model = xgb.XGBRegressor( max_depth=4, # 减小深度 min_child_weight=5 # 增加最小叶子权重 ) ``` ### 2. 处理时间序列 ```python # 滚动预测 def rolling_forecast(model, data, window=252, horizon=5): """滚动预测""" predictions = [] actuals = [] for i in range(window, len(data) - horizon): # 训练数据 train_data = data.iloc[i-window:i] # 特征和标签 X_train = train_data[feature_cols] y_train = train_data['target'] # 训练 model.fit(X_train, y_train) # 预测 X_test = data.iloc[i:i+1][feature_cols] pred = model.predict(X_test)[0] predictions.append(pred) actuals.append(data.iloc[i+horizon]['close']) return np.array(predictions), np.array(actuals) ``` ### 3. 特征工程 ```python # 滞后特征 for lag in [1, 2, 3, 5, 10]: df[f'close_lag{lag}'] = df['close'].shift(lag) df[f'returns_lag{lag}'] = df['returns'].shift(lag) # 滚动统计 for window in [5, 10, 20]: df[f'close_rolling_mean{window}'] = df['close'].rolling(window).mean() df[f'close_rolling_std{window}'] = df['close'].rolling(window).std() ``` --- ## 与其他回归模型对比 ### 对比代码 ```python from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.svm import SVR def compare_models(X_train, X_test, y_train, y_test): """对比多个回归模型""" models = { 'Linear Regression': LinearRegression(), 'Random Forest': RandomForestRegressor(n_estimators=100), 'XGBoost': xgb.XGBRegressor(n_estimators=100) } results = {} for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) results[name] = { 'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)), 'MAE': mean_absolute_error(y_test, y_pred), 'R2': r2_score(y_test, y_pred) } return pd.DataFrame(results).T # 使用 # results = compare_models(X_train, X_test, y_train, y_test) # print(results) ``` ### 对比结果示例 ``` RMSE MAE R2 Linear Regression 2.35 1.82 0.65 Random Forest 1.89 1.45 0.78 XGBoost 1.62 1.23 0.84 ``` --- ## 总结 ### XGBoost回归的优势 ``` ✅ 预测准确率高 ✅ 自动处理特征交互 ✅ 内置正则化防止过拟合 ✅ 支持并行加速 ✅ 可解释性好(特征重要性) ``` ### 在股价预测中的应用 ``` 1. 特征工程:技术指标、滞后特征、滚动统计 2. 模型训练:时间序列分割、早停机制 3. 调参优化:网格搜索、随机搜索 4. 评估验证:RMSE、MAE、R²、方向准确率 ``` ### 注意事项 ``` ⚠️ 股价预测难度大,XGBoost不是万能的 ⚠️ 需要充分的数据预处理 ⚠️ 注意过拟合问题 ⚠️ 结合其他模型和策略使用 ``` --- **延伸阅读**: - XGBoost文档:https://xgboost.readthedocs.io/ - 论文:XGBoost: A Scalable Tree Boosting System - 《统计学习方法》- 李航