<!DOCTYPE html>
<html lang="zh">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>探索性数据分析(EDA)完整指南</title>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@400;500;700&family=Roboto+Mono:wght@400;500&display=swap" rel="stylesheet">
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Noto Sans SC', sans-serif;
background-color: #f5f7fa;
color: #1a237e;
line-height: 1.6;
}
.poster-container {
width: 720px;
min-height: 960px;
margin: 0 auto;
background: linear-gradient(135deg, #e3f2fd, #bbdefb);
padding: 40px;
position: relative;
overflow: hidden;
}
.background-shape {
position: absolute;
border-radius: 50%;
background: rgba(100, 181, 246, 0.2);
z-index: 0;
}
.shape1 {
width: 300px;
height: 300px;
top: -100px;
right: -100px;
}
.shape2 {
width: 200px;
height: 200px;
bottom: 100px;
left: -50px;
}
.content {
position: relative;
z-index: 1;
}
.header {
text-align: center;
margin-bottom: 40px;
}
.title {
font-size: 52px;
font-weight: 700;
color: #1a237e;
margin-bottom: 20px;
line-height: 1.2;
}
.subtitle {
font-size: 24px;
color: #3949ab;
font-weight: 500;
margin-bottom: 10px;
}
.card {
background-color: rgba(255, 255, 255, 0.85);
border-radius: 16px;
padding: 24px;
margin-bottom: 24px;
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.08);
backdrop-filter: blur(10px);
}
.card-title {
font-size: 28px;
font-weight: 700;
color: #1a237e;
margin-bottom: 16px;
display: flex;
align-items: center;
}
.card-title .material-icons {
margin-right: 12px;
color: #3949ab;
}
.goal-item {
display: flex;
align-items: flex-start;
margin-bottom: 12px;
}
.goal-item .material-icons {
color: #3949ab;
margin-right: 12px;
flex-shrink: 0;
}
.goal-text {
font-size: 18px;
}
.highlight {
background-color: rgba(100, 181, 246, 0.3);
padding: 2px 6px;
border-radius: 4px;
font-weight: 500;
}
.step-title {
font-size: 24px;
font-weight: 700;
color: #1a237e;
margin: 24px 0 16px;
padding-bottom: 8px;
border-bottom: 2px solid #bbdefb;
}
.step-item {
margin-bottom: 16px;
}
.step-name {
font-size: 20px;
font-weight: 500;
color: #3949ab;
margin-bottom: 8px;
}
.step-desc {
font-size: 16px;
margin-left: 24px;
}
.code-block {
background-color: #263238;
color: #eeffff;
border-radius: 8px;
padding: 16px;
margin: 16px 0;
font-family: 'Roboto Mono', monospace;
font-size: 14px;
overflow-x: auto;
}
.comment {
color: #546e7a;
}
.keyword {
color: #c792ea;
}
.string {
color: #c3e88d;
}
.function {
color: #82aaff;
}
.method {
color: #89ddff;
}
.grid-container {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 20px;
margin-top: 20px;
}
.grid-item {
background-color: rgba(255, 255, 255, 0.7);
border-radius: 12px;
padding: 16px;
}
.grid-title {
font-size: 20px;
font-weight: 500;
color: #3949ab;
margin-bottom: 12px;
display: flex;
align-items: center;
}
.grid-title .material-icons {
margin-right: 8px;
font-size: 20px;
}
.footer {
text-align: center;
margin-top: 40px;
color: #3949ab;
font-size: 16px;
}
</style>
</head>
<body>
<div class="poster-container">
<div class="background-shape shape1"></div>
<div class="background-shape shape2"></div>
<div class="content">
<div class="header">
<h1 class="title">探索性数据分析<br>(EDA)完整指南</h1>
<p class="subtitle">从数据到洞察的系统方法</p>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">lightbulb</i>
什么是探索性数据分析(EDA)?
</h2>
<p style="font-size: 18px; margin-bottom: 16px;">
探索性数据分析(Exploratory Data Analysis, EDA)是数据分析项目的第一步,旨在<span class="highlight">理解数据的结构、分布和质量</span>,并发现潜在的规律或问题。与传统统计分析不同,EDA更注重数据的真实分布和可视化,帮助分析者发现数据中隐含的模式。
</p>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">stars</i>
EDA的核心目标
</h2>
<div class="goal-item">
<i class="material-icons">check_circle</i>
<div class="goal-text"><span class="highlight">理解数据</span>:数据里有什么?有多少行和列?</div>
</div>
<div class="goal-item">
<i class="material-icons">check_circle</i>
<div class="goal-text"><span class="highlight">评估数据质量</span>:有没有缺失值或异常值?</div>
</div>
<div class="goal-item">
<i class="material-icons">check_circle</i>
<div class="goal-text"><span class="highlight">掌握数据分布</span>:数据的中心趋势和离散程度如何?</div>
</div>
<div class="goal-item">
<i class="material-icons">check_circle</i>
<div class="goal-text"><span class="highlight">发现潜在关系</span>:变量之间有关联吗?</div>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">search</i>
步骤一:数据概览与质量检查
</h2>
<div class="step-item">
<h3 class="step-name">1. 查看数据形状</h3>
<p class="step-desc">
- <strong>行数</strong>:代表有多少个观测样本<br>
- <strong>列数</strong>:代表有多少个特征/变量<br>
- 在Python中,使用 <code>df.shape</code>;在R中,使用 <code>dim(df)</code>
</p>
</div>
<div class="step-item">
<h3 class="step-name">2. 查看列名和数据类型</h3>
<p class="step-desc">
- 了解每个变量代表什么<br>
- 区分<strong>数值型变量</strong>(连续型如年龄、收入;离散型如孩子数量)和<strong>类别型变量</strong>(如性别、国家)<br>
- 在Python中,使用 <code>df.info()</code> 或 <code>df.dtypes</code>;在R中,使用 <code>str(df)</code>
</p>
</div>
<div class="step-item">
<h3 class="step-name">3. 查看头尾数据</h3>
<p class="step-desc">
- 直观地感受数据的样子<br>
- 在Python中,使用 <code>df.head()</code> 和 <code>df.tail()</code>
</p>
</div>
<div class="step-item">
<h3 class="step-name">4. 检查缺失值</h3>
<p class="step-desc">
- 这是数据质量的关键。缺失值会严重影响后续分析<br>
- <strong>方法</strong>:计算每列缺失值的数量和比例<br>
- 在Python中,使用 <code>df.isnull().sum()</code>
</p>
</div>
<div class="step-item">
<h3 class="step-name">5. 检查重复值</h3>
<p class="step-desc">
- 检查是否有完全重复的行<br>
- 在Python中,使用 <code>df.duplicated().sum()</code>
</p>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">analytics</i>
步骤二:数值型变量的统计分析
</h2>
<div class="step-item">
<h3 class="step-name">1. 描述性统计汇总</h3>
<p class="step-desc">
- 这是最常用的一步,可以一键生成多个关键统计量<br>
- 在Python中,使用 <code>df.describe()</code>,输出:<br>
- <strong>count</strong>:非空值的数量<br>
- <strong>mean</strong>:平均值,衡量中心趋势<br>
- <strong>std</strong>:标准差,衡量数据波动大小<br>
- <strong>min</strong>:最小值<br>
- <strong>25%</strong>:第一四分位数<br>
- <strong>50%</strong>:中位数,对异常值不敏感<br>
- <strong>75%</strong>:第三四分位数<br>
- <strong>max</strong>:最大值
</p>
</div>
<div class="step-item">
<h3 class="step-name">2. 深入分析(超越.describe())</h3>
<p class="step-desc">
- <strong>偏度</strong>:衡量数据分布的不对称性<br>
- 正偏(右偏):均值 > 中位数,数据集中在左侧,右侧有长尾<br>
- 负偏(左偏):均值 < 中位数,数据集中在右侧,左侧有长尾<br>
- <strong>峰度</strong>:衡量数据分布的陡峭程度。与正态分布相比,高峰度意味着数据有更重的尾巴和更尖的峰值
</p>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">category</i>
步骤三:类别型变量的统计分析
</h2>
<div class="step-item">
<h3 class="step-name">1. 频数统计</h3>
<p class="step-desc">
- 计算每个类别出现的次数<br>
- 在Python中,使用 <code>df['column_name'].value_counts()</code>
</p>
</div>
<div class="step-item">
<h3 class="step-name">2. 比例/百分比</h3>
<p class="step-desc">
- 查看每个类别占总数的百分比,更能直观反映分布<br>
- 使用 <code>df['column_name'].value_counts(normalize=True) * 100</code>
</p>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">bar_chart</i>
步骤四:数据可视化
</h2>
<div class="grid-container">
<div class="grid-item">
<h3 class="grid-title">
<i class="material-icons">show_chart</i>
数值型变量
</h3>
<p>
- <strong>直方图</strong>:查看单个变量的分布形状<br>
- <strong>箱线图</strong>:展示数据的五数概括,快速识别异常值<br>
- <strong>小提琴图</strong>:结合箱线图和核密度图,显示分布的具体形状
</p>
</div>
<div class="grid-item">
<h3 class="grid-title">
<i class="material-icons">pie_chart</i>
类别型变量
</h3>
<p>
- <strong>条形图</strong>:展示每个类别的频数或比例<br>
- <strong>饼图</strong>:显示各类别占比(适用于类别较少的情况)
</p>
</div>
<div class="grid-item">
<h3 class="grid-title">
<i class="material-icons">scatter_plot</i>
关系探索
</h3>
<p>
- <strong>散点图</strong>:探索两个数值型变量之间的关系<br>
- <strong>热力图</strong>:以颜色深浅展示多个变量之间的相关系数矩阵
</p>
</div>
<div class="grid-item">
<h3 class="grid-title">
<i class="material-icons">insights</i>
高级可视化
</h3>
<p>
- <strong>平行坐标图</strong>:多维数据可视化<br>
- <strong>3D散点图</strong>:三维数据关系探索<br>
- <strong>交互式图表</strong>:使用Plotly或Bokeh创建
</p>
</div>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">auto_awesome</i>
自动化EDA工具
</h2>
<div class="step-item">
<h3 class="step-name">1. Pandas Profiling (ydata-profiling)</h3>
<p class="step-desc">
- 一键生成全面的数据分析报告<br>
- 包含变量统计、相关性分析、缺失值分析等<br>
- 安装:<code>pip install ydata-profiling</code><br>
- 使用:<code>from ydata_profiling import ProfileReport; ProfileReport(df)</code>
</p>
</div>
<div class="step-item">
<h3 class="step-name">2. Sweetviz</h3>
<p class="step-desc">
- 专注于比较数据集和变量<br>
- 生成美观的HTML报告<br>
- 安装:<code>pip install sweetviz</code><br>
- 使用:<code>import sweetviz as sv; report = sv.analyze(df); report.show_html()</code>
</p>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">code</i>
实践示例(修正后的Python代码)
</h2>
<div class="code-block">
<span class="keyword">import</span> pandas <span class="keyword">as</span> pd<br>
<span class="keyword">import</span> numpy <span class="keyword">as</span> np<br>
<span class="keyword">import</span> matplotlib.pyplot <span class="keyword">as</span> plt<br>
<span class="keyword">import</span> seaborn <span class="keyword">as</span> sns<br>
<br>
<span class="comment"># 1. 数据概览</span><br>
<span class="function">print</span>(<span class="string">"数据形状:"</span>, df.shape)<br>
<span class="function">print</span>(<span class="string">"\n数据类型和信息:"</span>)<br>
<span class="function">print</span>(df.info())<br>
<span class="function">print</span>(<span class="string">"\n前5行数据:"</span>)<br>
<span class="function">print</span>(df.head())<br>
<br>
<span class="comment"># 2. 数据质量</span><br>
<span class="function">print</span>(<span class="string">"\n缺失值统计:"</span>)<br>
<span class="function">print</span>(df.isnull().sum())<br>
<span class="function">print</span>(<span class="string">"\n重复值统计:"</span>)<br>
<span class="function">print</span>(df.duplicated().sum())<br>
<br>
<span class="comment"># 3. 数值型变量描述</span><br>
<span class="function">print</span>(<span class="string">"\n数值型变量描述性统计:"</span>)<br>
<span class="function">print</span>(df.describe())<br>
<br>
<span class="comment"># 4. 类别型变量描述</span><br>
categorical_columns = df.select_dtypes(include=['object']).columns<br>
<span class="keyword">for</span> col <span class="keyword">in</span> categorical_columns:<br>
<span class="function">print</span>(<span class="string">f"\n变量 '{col}' 的分布:"</span>)<br>
<span class="function">print</span>(df[col].value_counts())<br>
<br>
<span class="comment"># 5. 可视化</span><br>
sns.set(style=<span class="string">"whitegrid"</span>)<br>
<br>
<span class="comment"># 绘制数值变量的直方图和箱线图</span><br>
numerical_columns = df.select_dtypes(include=[np.number]).columns<br>
<span class="keyword">for</span> col <span class="keyword">in</span> numerical_columns:<br>
fig, axes = plt.subplots(1, 2, figsize=(12, 4))<br>
<span class="comment"># 直方图</span><br>
sns.histplot(df[col], kde=True, ax=axes[0])<br>
axes[0].set_title(<span class="string">f'Distribution of {col}'</span>)<br>
<span class="comment"># 箱线图</span><br>
sns.boxplot(x=df[col], ax=axes[1])<br>
axes[1].set_title(<span class="string">f'Boxplot of {col}'</span>)<br>
plt.show()<br>
<br>
<span class="comment"># 绘制类别变量的条形图</span><br>
<span class="keyword">for</span> col <span class="keyword">in</span> categorical_columns:<br>
plt.figure(figsize=(10, 5))<br>
df[col].value_counts().plot(kind=<span class="string">'bar'</span>)<br>
plt.title(<span class="string">f'Bar Chart of {col}'</span>)<br>
plt.xticks(rotation=45)<br>
plt.show()<br>
<br>
<span class="comment"># 绘制数值变量之间的相关热力图</span><br>
plt.figure(figsize=(10, 8))<br>
correlation_matrix = df.corr()<br>
sns.heatmap(correlation_matrix, annot=True, cmap=<span class="string">'coolwarm'</span>, fmt=<span class="string">".2f"</span>)<br>
plt.title(<span class="string">'Correlation Heatmap'</span>)<br>
plt.show()
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">build</i>
数据预处理建议
</h2>
<div class="step-item">
<h3 class="step-name">1. 缺失值处理</h3>
<p class="step-desc">
- <strong>删除</strong>:缺失值比例较小(<5%)时,可直接删除<br>
- <strong>插补</strong>:使用均值、中位数、众数或预测模型填充<br>
- <strong>标记</strong>:创建新变量标记缺失值,保留信息
</p>
</div>
<div class="step-item">
<h3 class="step-name">2. 异常值处理</h3>
<p class="step-desc">
- <strong>识别</strong>:使用箱线图、Z-score或IQR方法<br>
- <strong>处理</strong>:删除、替换或转换异常值<br>
- <strong>鲁棒统计</strong>:使用中位数、四分位数等对异常值不敏感的统计量
</p>
</div>
<div class="step-item">
<h3 class="step-name">3. 数据转换</h3>
<p class="step-desc">
- <strong>标准化/归一化</strong>:消除量纲影响<br>
- <strong>对数转换</strong>:处理右偏分布<br>
- <strong>分类编码</strong>:将类别变量转换为数值
</p>
</div>
</div>
<div class="card">
<h2 class="card-title">
<i class="material-icons">summarize</i>
总结
</h2>
<p style="font-size: 18px;">
分析数据集的基本统计信息是一个系统性工程,遵循 <span class="highlight">"从整体到局部,从数字到图形"</span> 的原则:
</p>
<div class="goal-item">
<i class="material-icons">looks_one</i>
<div class="goal-text"><strong>整体把握</strong>:形状、类型、头尾</div>
</div>
<div class="goal-item">
<i class="material-icons">looks_two</i>
<div class="goal-text"><strong>质量诊断</strong>:处理缺失值和重复值</div>
</div>
<div class="goal-item">
<i class="material-icons">looks_3</i>
<div class="goal-text"><strong>数值分析</strong>:使用描述性统计和可视化理解分布和异常</div>
</div>
<div class="goal-item">
<i class="material-icons">looks_4</i>
<div class="goal-text"><strong>类别分析</strong>:使用频数统计和条形图理解分布</div>
</div>
<div class="goal-item">
<i class="material-icons">looks_5</i>
<div class="goal-text"><strong>关系探索</strong>:使用散点图和热力图发现变量间的联系</div>
</div>
<p style="font-size: 18px; margin-top: 16px;">
完成这些步骤后,你将对数据集有一个全面而扎实的理解,为后续的数据清洗、特征工程和建模打下坚实的基础。
</p>
</div>
<div class="footer">
<p>© 2023 探索性数据分析指南 | 数据驱动决策,从EDA开始</p>
</div>
</div>
</div>
</body>
</html>
登录后可参与表态
讨论回复
1 条回复
QianXun (QianXun)
#1
11-23 13:23
登录后可参与表态