代码收藏家技术教程 2024-11-16

Python人力资源数据分析案例

一、项目概述

背景描述

HR analytics, also referred to as people analytics, workforce analytics, or talent analytics, involves gathering together, analyzing, and reporting HR data. It is the collection and application of talent data to improve critical talent and business outcomes. It enables your organization to measure the impact of a range of HR metrics on overall business performance and make decisions based on data. They are primarily responsible for interpreting and analyzing vast datasets.

人力资源分析，也称为人员分析、劳动力分析或人才分析，涉及收集、分析和报告人力资源数据。它是收集和应用人才数据，以改善关键人才和业务成果。它使您的组织能够衡量一系列人力资源指标对整体业务绩效的影响，并根据数据做出决策。他们主要负责解释和分析庞大的数据集。

在人力资源管理领域中，分析各项员工工作相关的数据和指标，可以揭示员工流失的趋势和原因、薪酬公平性、员工满意度以及职业发展路径等关键指标。这些见解对于优化人才招聘、留存策略、绩效评估体系和员工发展计划至关重要。通过对这些多维数据的深入分析，组织可以制定更加人性化的管理措施，改进工作环境，提高员工的工作满意度和忠诚度，从而推动组织的整体业绩和竞争力。此外，分析结果还支持制定更加公正和激励性的薪酬体系，以吸引和保留顶尖人才，确保组织的长期成功和可持续发展。

数据说明

字段	说明
EmpID	唯一的员工ID
Age	年龄
AgeGroup	年龄组
Attrition	是否离职
BusinessTravel	出差：很少、频繁、不出差
DailyRate	日薪
Department	任职部门：研发部门、销售部门、人力资源部门
DistanceFromHome	通勤距离
Education	教育等级
EducationField	专业领域：生命科学、医学、市场营销、技术、其他
EnvironmentSatisfaction	工作环境满意度
Gender	性别
HourlyRate	时薪
JobInvolvement	工作参与度
JobLevel	工作级别
JobRole	工作角色
JobSatisfaction	工作满意度
MaritalStatus	婚姻状况
MonthlyIncome	月收入
SalarySlab	工资单
MonthlyRate	月薪
NumCompaniesWorked	工作过的公司数量
PercentSalaryHike	加薪百分比
PerformanceRating	绩效评级
RelationshipSatisfaction	关系满意度
StandardHours	标准工时
StockOptionLevel	股票期权级别
TotalWorkingYears	总工作年数
TrainingTimesLastYear	去年培训时间
WorkLifeBalance	工作生活平衡评价
YearsAtCompany	在公司工作年数
YearsInCurrentRole	担任现职年数
YearsSinceLastPromotion	上次晋升后的年数
YearsWithCurrManager	与现任经理共事年数

数据来源

https://www.kaggle.com/datasets/anshika2301/hr-analytics-dataset

问题描述

员工流失分析

识别导致员工离职的因素（Attrition与其他字段的关系，如满意度、工资、通勤距离等）。

分析不同年龄组、婚姻状况、工作年数与离职率之间的关系。

薪酬公平性研究

比较不同性别（Gender）、教育等级（Education）和专业领域（EducationField）的薪资差异。

探讨工作级别（JobLevel）、工作角色（JobRole）与月收入（MonthlyIncome）、时薪（HourlyRate）、日薪（DailyRate）之间的关系。

工作满意度分析

评估工作满意度（JobSatisfaction）、工作环境满意度（EnvironmentSatisfaction）、关系满意度（RelationshipSatisfaction）与员工绩效（PerformanceRating）之间的关联。

分析工作生活平衡评价（WorkLifeBalance）与工作参与度（JobInvolvement）、在公司工作年数（YearsAtCompany）之间的关系。

职业发展和晋升路径分析

检查晋升历史（YearsSinceLastPromotion）与工作满意度、工作级别和绩效评级之间的关联。

分析员工在当前角色的时间（YearsInCurrentRole）对于工作参与度和晋升机会的影响。

培训和发展需求评估

评估培训次数（TrainingTimesLastYear）与员工绩效评级的关系。

分析工作经验（TotalWorkingYears）与培训需求之间的关系。

员工福利和激励措施分析

探索股票期权级别（StockOptionLevel）对员工留存的影响。

分析加薪百分比（PercentSalaryHike）与员工满意度和绩效的关系。

人力资源规划和预测

预测哪些因素会影响员工留存（如工资、工作满意度、工作环境）。

用历史数据建模，预测员工晋升路径和潜在的流失风险。

二、数据读取与预处理

# 导入需要的库
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据
hr_data = pd.read_csv("HR_Analytics.csv")

# 查看数据维度
hr_data.shape

(1480, 34)

# 查看数据信息
hr_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 34 columns):
EmpID                       1480 non-null object
Age                         1480 non-null int64
AgeGroup                    1480 non-null object
Attrition                   1480 non-null object
BusinessTravel              1480 non-null object
DailyRate                   1480 non-null int64
Department                  1480 non-null object
DistanceFromHome            1480 non-null int64
Education                   1480 non-null int64
EducationField              1480 non-null object
EnvironmentSatisfaction     1480 non-null int64
Gender                      1480 non-null object
HourlyRate                  1480 non-null int64
JobInvolvement              1480 non-null int64
JobLevel                    1480 non-null int64
JobRole                     1480 non-null object
JobSatisfaction             1480 non-null int64
MaritalStatus               1480 non-null object
MonthlyIncome               1480 non-null int64
SalarySlab                  1480 non-null object
MonthlyRate                 1480 non-null int64
NumCompaniesWorked          1480 non-null int64
PercentSalaryHike           1480 non-null int64
PerformanceRating           1480 non-null int64
RelationshipSatisfaction    1480 non-null int64
StandardHours               1480 non-null int64
StockOptionLevel            1480 non-null int64
TotalWorkingYears           1480 non-null int64
TrainingTimesLastYear       1480 non-null int64
WorkLifeBalance             1480 non-null int64
YearsAtCompany              1480 non-null int64
YearsInCurrentRole          1480 non-null int64
YearsSinceLastPromotion     1480 non-null int64
YearsWithCurrManager        1423 non-null float64
dtypes: float64(1), int64(23), object(10)
memory usage: 393.2+ KB

# 描述性分析
hr_data.describe(include='all')

	EmpID	Age	AgeGroup	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	…	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
count	1480	1480.000000	1480	1480	1480	1480.000000	1480	1480.000000	1480.000000	1480	…	1480.000000	1480.0	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1423.000000
unique	1470	NaN	5	2	4	NaN	3	NaN	NaN	6	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
top	RM1462	NaN	26-35	No	Travel_Rarely	NaN	Research & Development	NaN	NaN	Life Sciences	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	2	NaN	611	1242	1042	NaN	967	NaN	NaN	607	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	NaN	36.917568	NaN	NaN	NaN	801.384459	NaN	9.220270	2.910811	NaN	…	2.708784	80.0	0.791892	11.281757	2.797973	2.760811	7.009459	4.228378	2.182432	4.118060
std	NaN	9.128559	NaN	NaN	NaN	403.126988	NaN	8.131201	1.023796	NaN	…	1.081995	0.0	0.850527	7.770870	1.288791	0.707024	6.117945	3.616020	3.219357	3.555484
min	NaN	18.000000	NaN	NaN	NaN	102.000000	NaN	1.000000	1.000000	NaN	…	1.000000	80.0	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
25%	NaN	30.000000	NaN	NaN	NaN	465.000000	NaN	2.000000	2.000000	NaN	…	2.000000	80.0	0.000000	6.000000	2.000000	2.000000	3.000000	2.000000	0.000000	2.000000
50%	NaN	36.000000	NaN	NaN	NaN	800.000000	NaN	7.000000	3.000000	NaN	…	3.000000	80.0	1.000000	10.000000	3.000000	3.000000	5.000000	3.000000	1.000000	3.000000
75%	NaN	43.000000	NaN	NaN	NaN	1157.000000	NaN	14.000000	4.000000	NaN	…	4.000000	80.0	1.000000	15.000000	3.000000	3.000000	9.000000	7.000000	3.000000	7.000000
max	NaN	60.000000	NaN	NaN	NaN	1499.000000	NaN	29.000000	5.000000	NaN	…	4.000000	80.0	3.000000	40.000000	6.000000	4.000000	40.000000	18.000000	15.000000	17.000000

11 rows × 34 columns

# 查看各列缺失值
hr_data.isna().sum()

EmpID                        0
Age                          0
AgeGroup                     0
Attrition                    0
BusinessTravel               0
DailyRate                    0
Department                   0
DistanceFromHome             0
Education                    0
EducationField               0
EnvironmentSatisfaction      0
Gender                       0
HourlyRate                   0
JobInvolvement               0
JobLevel                     0
JobRole                      0
JobSatisfaction              0
MaritalStatus                0
MonthlyIncome                0
SalarySlab                   0
MonthlyRate                  0
NumCompaniesWorked           0
PercentSalaryHike            0
PerformanceRating            0
RelationshipSatisfaction     0
StandardHours                0
StockOptionLevel             0
TotalWorkingYears            0
TrainingTimesLastYear        0
WorkLifeBalance              0
YearsAtCompany               0
YearsInCurrentRole           0
YearsSinceLastPromotion      0
YearsWithCurrManager        57
dtype: int64

说明：YearsWithCurrManager（与现任经理共事年数）列，共有 57 个缺失值，这里我默认和字段YearsInCurrentRole（担任现职年数）的值相同，将缺失值替换为YearsInCurrentRole（担任现职年数）

# 使用 YearsInCurrentRole 的值来填补 YearsWithCurrManager 的缺失值
hr_data['YearsWithCurrManager'].fillna(hr_data['YearsInCurrentRole'], inplace=True)

# 再次检查缺失值情况
hr_data.isna().sum()

EmpID                       0
Age                         0
AgeGroup                    0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
SalarySlab                  0
MonthlyRate                 0
NumCompaniesWorked          0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

# 查看重复值
hr_data.duplicated().sum()

说明：数据集中存在 7 行重复值。考虑到后续分析需要每行数据代表一个独特的实例，删除该重复值

三、员工流失分析

员工流失（Attrition）相关性分析

识别导致员工离职的因素（Attrition与其他字段的关系，如满意度、工资、通勤距离等）。

# 将Attrition列的文本值转换为数值（是: 1, 否: 0）
hr_data['Attrition'] = hr_data['Attrition'].map({'Yes': 1, 'No': 0})

# 计算Attrition与其他数值型字段的相关性
correlation_with_attrition = hr_data.corr()['Attrition'].sort_values()
correlation_with_attrition

TotalWorkingYears          -0.168358
JobLevel                   -0.167150
YearsWithCurrManager       -0.161021
YearsInCurrentRole         -0.160968
MonthlyIncome              -0.157672
Age                        -0.155476
StockOptionLevel           -0.135140
YearsAtCompany             -0.135108
JobInvolvement             -0.130769
JobSatisfaction            -0.104232
EnvironmentSatisfaction    -0.101696
WorkLifeBalance            -0.062646
TrainingTimesLastYear      -0.058415
DailyRate                  -0.056976
RelationshipSatisfaction   -0.045387
YearsSinceLastPromotion    -0.032244
Education                  -0.030144
PercentSalaryHike          -0.014603
HourlyRate                 -0.008252
PerformanceRating           0.002531
MonthlyRate                 0.016585
NumCompaniesWorked          0.045336
DistanceFromHome            0.080006
Attrition                   1.000000
StandardHours                    NaN
Name: Attrition, dtype: float64

说明：通过相关系数，首先是负相关关系TotalWorkingYears -0.170847JobLevel -0.168926YearsInCurrentRole -0.160302MonthlyIncome -0.159458YearsWithCurrManager -0.159338Age -0.158775StockOptionLevel -0.136939YearsAtCompany -0.134106JobInvolvement -0.129678EnvironmentSatisfaction -0.104022JobSatisfaction -0.103276WorkLifeBalance -0.064221TrainingTimesLastYear -0.059769DailyRate -0.056809RelationshipSatisfaction -0.045763YearsSinceLastPromotion -0.032487Education -0.030526PercentSalaryHike -0.013827HourlyRate -0.005593以上字段成负相关关系：（1）工作时长越长，离职的可能越小，（2）工作级别越高，离职可能性越小，（3）现任职位时间越长，也约不太可能离职，（4）月收入越高，越不会离职接着是正相关关系PerformanceRating 0.003268MonthlyRate 0.014647NumCompaniesWorked 0.043469DistanceFromHome 0.077585以上字段成正相关关系：（1）离家越远，越有可能离职，（2）加薪越低越有可能离职结论都是符合客观情况的

不同年龄组、婚姻状况、工作年数与离职率分析

分析不同年龄组、婚姻状况、工作年数与离职率之间的关系。

# 设置绘图风格
sns.set(style="whitegrid")
# 绘制不同年龄组的员工流失率
plt.figure(figsize=(12, 6))
sns.barplot(x='AgeGroup', y='Attrition', data=hr_data, ci=None)
plt.title('员工流失率按年龄组')
plt.ylabel('Attrition')
plt.xlabel('AgeGroup')
plt.show()

如18-25岁年龄段的离职率较高，55岁以上年龄段的离职率较高；36-45岁年龄段的离职率较低过于年轻的员工和过于年长的员工的离职率相对于正值壮年的员工离职率比较高

# 婚姻状况与离职率的关系
plt.figure(figsize=(8, 5))
sns.countplot(x='MaritalStatus', hue='Attrition', data=hr_data)
plt.title('婚姻状况与离职率的关系')
plt.xlabel('MaritalStatus')
plt.ylabel('Attrition')
plt.show()

未婚员工的离职率相对较高。已婚和离婚员工的离职率相对较低

# 绘制工作年数与员工流失率的关系
plt.figure(figsize=(12, 6))
sns.lineplot(x='TotalWorkingYears', y='Attrition', data=hr_data)
plt.title('员工流失率随工作年数的变化')
plt.ylabel('Attrition')
plt.xlabel('TotalWorkingYears')
plt.show()

在工作年数较少的员工中（特别是在5年以下），离职率较高。工作年限40年的员工也比较高（异常的高）随着工作年数的增加，离职率逐渐降低，特别是在10年以上的员工中，离职率明显较低。

四、薪酬公平性研究

不同性别、教育等级和专业领域的薪资差异分析

比较不同性别（Gender）、教育等级（Education）和专业领域（EducationField）的薪资差异。

# 薪资差异分析

# 按性别比较月收入
plt.figure(figsize=(8, 5))
sns.boxplot(x='Gender', y='MonthlyIncome', data=hr_data)
plt.title('性别与月收入的关系')
plt.xlabel('Gender')
plt.ylabel('MonthlyIncome')
plt.show()

# 按教育等级比较月收入
plt.figure(figsize=(8, 5))
sns.boxplot(x='Education', y='MonthlyIncome', data=hr_data)
plt.title('教育等级与月收入的关系')
plt.xlabel('Education')
plt.ylabel('MonthlyIncome')
plt.show()

# 按专业领域比较月收入
plt.figure(figsize=(12, 6))
sns.boxplot(x='EducationField', y='MonthlyIncome', data=hr_data)
plt.title('专业领域与月收入的关系')
plt.xlabel('EducationField')
plt.ylabel('MonthlyIncome')
plt.xticks(rotation=45)
plt.show()

男女性的收入分布都差不多教育等级和收入的分布大致呈现了线性增长的趋势，教育等级较高，月收入也越高不同专业领域的员工月收入不同。市场Marketing的比较高，人力Human Resources行业的比较低

工作级别、工作角色与月收入、时薪、日薪关系分享

探讨工作级别（JobLevel）、工作角色（JobRole）与月收入（MonthlyIncome）、时薪（HourlyRate）、日薪（DailyRate）之间的关系。

# 工作级别与薪酬的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='JobLevel', y='MonthlyIncome', data=hr_data)
plt.title('工作级别与薪酬的关系')
plt.xlabel('JobLevel')
plt.ylabel('MonthlyIncome')
plt.show()

# 工作角色与月收入的关系
plt.figure(figsize=(14, 7))
sns.boxplot(x='JobRole', y='MonthlyIncome', data=hr_data)
plt.title('工作角色与月收入的关系')
plt.xlabel('JobRole')
plt.ylabel('MonthlyIncome')
plt.xticks(rotation=45)
plt.show()

# 工作级别与时薪的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='JobLevel', y='HourlyRate', data=hr_data)
plt.title('工作级别与时薪的关系')
plt.xlabel('工作级别')
plt.ylabel('时薪')
plt.show()

# 工作角色与时薪的关系
plt.figure(figsize=(14, 7))
sns.boxplot(x='JobRole', y='HourlyRate', data=hr_data)
plt.title('工作角色与时薪的关系')
plt.xlabel('JobRole')
plt.ylabel('HourlyRate')
plt.xticks(rotation=45)
plt.show()

# 工作级别与日薪的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='JobLevel', y='DailyRate', data=hr_data)
plt.title('工作级别与日薪的关系')
plt.xlabel('JobLevel')
plt.ylabel('DailyRate')
plt.show()

# 工作角色与日薪的关系
plt.figure(figsize=(14, 7))
sns.boxplot(x='JobRole', y='DailyRate', data=hr_data)
plt.title('工作角色与日薪的关系')
plt.xlabel('JobRole')
plt.ylabel('DailyRate')
plt.xticks(rotation=45)
plt.show()

工作等级符合客观事实，级别越高钱越多大部分角色岗位的中位数薪资都差不多，不过好像人力Human Resources这个角色的工资上限确实是比较低的

五、工作满意度分析

工作满意度、工作环境满意度、关系满意度与员工绩效分析

评估工作满意度（JobSatisfaction）、工作环境满意度（EnvironmentSatisfaction）、关系满意度（RelationshipSatisfaction）与员工绩效（PerformanceRating）之间的关联。

# 创建一个散点矩阵图（Pair Plot）来可视化这些变量之间的关系
sns.pairplot(hr_data, vars=["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "PerformanceRating"])
plt.show()

# 计算这些变量之间的相关系数
correlation_matrix = hr_data[["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "PerformanceRating"]].corr()

# 打印相关系数矩阵
print(correlation_matrix)

                          JobSatisfaction  EnvironmentSatisfaction  \
JobSatisfaction                  1.000000                -0.010201   
EnvironmentSatisfaction         -0.010201                 1.000000   
RelationshipSatisfaction        -0.009918                 0.009256   
PerformanceRating                0.002421                -0.031625   

                          RelationshipSatisfaction  PerformanceRating  
JobSatisfaction                          -0.009918           0.002421  
EnvironmentSatisfaction                   0.009256          -0.031625  
RelationshipSatisfaction                  1.000000          -0.031020  
PerformanceRating                        -0.031020           1.000000

# 重新制图，使用热力图来显示
selected_columns = ["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "PerformanceRating"]

# 计算这些列之间的相关系数矩阵
correlation_matrix = hr_data[selected_columns].corr()

# 创建热力图来可视化相关性
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

correlation_matrix

	JobSatisfaction	EnvironmentSatisfaction	RelationshipSatisfaction	PerformanceRating
JobSatisfaction	1.000000	-0.010201	-0.009918	0.002421
EnvironmentSatisfaction	-0.010201	1.000000	0.009256	-0.031625
RelationshipSatisfaction	-0.009918	0.009256	1.000000	-0.031020
PerformanceRating	0.002421	-0.031625	-0.031020	1.000000

晋升历史（YearsSinceLastPromotion）与其他变量的相关性：与工作满意度（JobSatisfaction）的相关系数为-0.014979，接近零。这表示晋升历史与工作满意度之间几乎没有线性关系，它们之间的变化不太可能通过线性关系来解释。与工作级别（JobLevel）的相关系数为0.355518，为正值。这表示晋升历史与工作级别之间存在正相关关系，即晋升历史较长的员工更有可能达到更高的工作级别。与绩效评级（PerformanceRating）的相关系数为0.017239，接近零。这意味着晋升历史与绩效评级之间没有明显的线性关系，它们之间的变化不太可能通过线性关系来解释。

工作满意度（JobSatisfaction）与其他变量的相关性：与工作级别（JobLevel）的相关系数为-0.001440，接近零。这表明工作满意度与工作级别之间几乎没有线性关系，它们之间的变化不太可能通过线性关系来解释。与绩效评级（PerformanceRating）的相关系数为0.002421，接近零。这意味着工作满意度与绩效评级之间没有明显的线性关系，它们之间的变化不太可能通过线性关系来解释。

工作级别（JobLevel）与绩效评级（PerformanceRating）之间的相关性：这两个变量之间的相关系数为-0.021588，接近零。这表明工作级别与绩效评级之间也几乎没有线性关系，它们之间的变化不太可能通过线性关系来解释。

这个图展示了评估工作满意度（JobSatisfaction）、工作环境满意度（EnvironmentSatisfaction）、关系满意度（RelationshipSatisfaction）与员工绩效（PerformanceRating）之间的关联关系。几个变量之间系数不大，影响也不大，且相互之间没有太大的实际影响

工作生活平衡评价与工作参与度、在公司工作年数关系分析

分析工作生活平衡评价（WorkLifeBalance）与工作参与度（JobInvolvement）、在公司工作年数（YearsAtCompany）之间的关系。

# 创建散点图来可视化工作生活平衡评价和工作参与度之间的关系
plt.figure(figsize=(10, 6))
sns.scatterplot(data=hr_data, x="WorkLifeBalance", y="JobInvolvement", hue="YearsAtCompany", palette="coolwarm", size="YearsAtCompany", sizes=(20, 200))
plt.title("WorkLifeBalance vs. JobInvolvement (Color by YearsAtCompany)")
plt.show()

# 创建箱线图来可视化工作生活平衡评价和在公司工作年数之间的关系
plt.figure(figsize=(8, 6))
sns.boxplot(data=hr_data, x="WorkLifeBalance", y="YearsAtCompany")
plt.title("WorkLifeBalance vs. YearsAtCompany")
plt.show()

展示不太好看

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# 选择字段
selected_columns = ["WorkLifeBalance", "JobInvolvement", "YearsAtCompany"]

# 创建一个新的Matplotlib图形
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# 在三维坐标系上绘制散点图
ax.scatter(hr_data["WorkLifeBalance"], hr_data["JobInvolvement"], hr_data["YearsAtCompany"], c=hr_data["YearsAtCompany"], cmap='coolwarm', s=50)

# 设置坐标轴标签
ax.set_xlabel("WorkLifeBalance")
ax.set_ylabel("JobInvolvement")
ax.set_zlabel("YearsAtCompany")

# 添加颜色图例
cbar = plt.colorbar(ax.scatter(hr_data["WorkLifeBalance"], hr_data["JobInvolvement"], hr_data["YearsAtCompany"], c=hr_data["YearsAtCompany"], cmap='coolwarm', s=50))
cbar.set_label("YearsAtCompany")

plt.title("3D Scatter Plot (WorkLifeBalance, JobInvolvement, YearsAtCompany)")
plt.show()

大致分成四种情况：在公司工作年数（YearsAtCompany）比较低的员工人数占比较高1，工作生活平衡评价（WorkLifeBalance）低，工作参与度（JobInvolvement）低，在公司工作年数（YearsAtCompany）比较低，这种情况的员工比较少2，工作生活平衡评价（WorkLifeBalance）居中，工作参与度（JobInvolvement）居中，这部分员工数比较多，可能存在的关系正向关系：工作生活平衡与工作参与度之间可能存在正向关系。当员工感到他们的工作和生活之间的平衡较好时，他们可能更容易投入工作并表现出更高的工作参与度。这是因为他们能够更好地管理工作压力，保持积极的情绪状态，从而更积极地投入工作。负向关系：相反，工作生活平衡也可能与在公司工作年数之间存在负向关系。如果员工长时间工作并感到工作与生活之间的平衡不佳，他们可能会更容易感到疲惫和燃尽，可能会考虑离职或寻找更好的平衡。中介关系：工作生活平衡也可能充当工作参与度和在公司工作年数之间的中介因素。良好的工作生活平衡可能有助于提高工作参与度，并最终促使员工更长时间地在公司工作。个体差异：关系也可能因个体差异而异。一些员工可能更注重工作生活平衡，而另一些员工可能更注重工作参与度。同样，员工的年龄、家庭状况和职业目标等因素也可能影响这些关系

六、职业发展和晋升路径分析

晋升历史与工作满意度、工作级别和绩效评级关系分析

检查晋升历史（YearsSinceLastPromotion）与工作满意度、工作级别和绩效评级之间的关联。

#选择字段
selected_columns = ["YearsSinceLastPromotion", "JobSatisfaction", "JobLevel", "PerformanceRating"]
subset_data = hr_data[selected_columns]

# 计算这些列之间的相关系数矩阵
correlation_matrix = subset_data.corr()

# 创建热力图来可视化相关性
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap (YearsSinceLastPromotion, JobSatisfaction, JobLevel, PerformanceRating)")
plt.show()

correlation_matrix

	YearsSinceLastPromotion	JobSatisfaction	JobLevel	PerformanceRating
YearsSinceLastPromotion	1.000000	-0.014979	0.355518	0.017239
JobSatisfaction	-0.014979	1.000000	-0.001440	0.002421
JobLevel	0.355518	-0.001440	1.000000	-0.021588
PerformanceRating	0.017239	0.002421	-0.021588	1.000000

晋升历史（YearsSinceLastPromotion）与其他变量的相关性：与工作满意度（JobSatisfaction）的相关系数为-0.014979，接近零。这意味着晋升历史与工作满意度之间没有明显的线性关系。与工作级别（JobLevel）的相关系数为0.355518，为正值。这表示晋升历史与工作级别之间存在正相关关系，即晋升历史较长的员工更有可能达到更高的工作级别。与绩效评级（PerformanceRating）的相关系数为0.017239，接近零。这意味着晋升历史与绩效评级之间没有明显的线性关系。

工作满意度（JobSatisfaction）与其他变量的相关性：与工作级别（JobLevel）的相关系数为-0.001440，接近零。这表明工作满意度与工作级别之间几乎没有线性关系。与绩效评级（PerformanceRating）的相关系数为0.002421，接近零。这意味着工作满意度与绩效评级之间没有明显的线性关系。

工作级别（JobLevel）与绩效评级（PerformanceRating）之间的相关性：这两个变量之间的相关系数为-0.021588，接近零。这表明工作级别与绩效评级之间没有强烈的线性关系。

当前角色的时间对于工作参与度和晋升机会影响分析

分析员工在当前角色的时间（YearsInCurrentRole）对于工作参与度和晋升机会的影响。

员工在当前角色的时间与工作满意度、绩效评级和工作生活平衡的关系

# 选择字段
correlation_matrix_2 = hr_data[['YearsInCurrentRole', 'JobSatisfaction', 'YearsSinceLastPromotion']].corr()

# 相关矩阵热力图
plt.figure(figsize=(8, 5))
sns.heatmap(correlation_matrix_2, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix for YearsInCurrentRole, JobSatisfaction, and YearsSinceLastPromotion")
plt.show()

correlation_matrix_2

	YearsInCurrentRole	JobSatisfaction	YearsSinceLastPromotion
YearsInCurrentRole	1.000000	-0.001871	0.548418
JobSatisfaction	-0.001871	1.000000	-0.014979
YearsSinceLastPromotion	0.548418	-0.014979	1.000000

员工在当前角色的时间（YearsInCurrentRole）与其他变量的相关性：与工作满意度（JobSatisfaction）的相关系数为-0.001871，接近零。这表示员工在当前角色的时间与工作满意度之间几乎没有线性关系，它们之间的变化不太可能通过线性关系来解释。与距离上次晋升的时间（YearsSinceLastPromotion）的相关系数为0.548418，为正值。这表示员工在当前角色的时间与距离上次晋升的时间之间存在正相关关系，即在当前角色工作的时间较长的员工通常在距离上次晋升的时间上也较长。

工作满意度（JobSatisfaction）与其他变量的相关性：与员工在当前角色的时间（YearsInCurrentRole）的相关系数为-0.001871，接近零。这表明工作满意度与员工在当前角色的时间之间几乎没有线性关系。与距离上次晋升的时间（YearsSinceLastPromotion）的相关系数为-0.014979，接近零。这意味着工作满意度与距离上次晋升的时间之间也没有明显的线性关系。

距离上次晋升的时间（YearsSinceLastPromotion）与其他变量的相关性：与员工在当前角色的时间（YearsInCurrentRole）的相关系数为0.548418，为正值。这表示距离上次晋升的时间与员工在当前角色的时间之间存在正相关关系，即在当前角色工作的时间较长的员工通常距离上次晋升的时间也较长。与工作满意度（JobSatisfaction）的相关系数为-0.014979，接近零。这意味着距离上次晋升的时间与工作满意度之间也没有明显的线性关系。

七、培训和发展需求评估

评估培训次数与员工绩效评级关系分析

评估培训次数（TrainingTimesLastYear）与员工绩效评级的关系。

# 相关性矩阵
correlation = hr_data['TrainingTimesLastYear'].corr(hr_data['PerformanceRating'])

# 绘图
plt.figure(figsize=(10, 6))
sns.scatterplot(x='TrainingTimesLastYear', y='PerformanceRating', data=hr_data)
plt.title('Relationship between Training Times Last Year and Performance Rating')
plt.xlabel('Training Times Last Year')
plt.ylabel('Performance Rating')
plt.grid(True)
plt.show()

correlation

-0.019123465918259225

TrainingTimesLastYear（去年的培训次数）和PerformanceRating（员工绩效评级）之间的相关性非常低（约为 -0.019），这表明这两者之间没有显著的线性关系。

工作经验与培训需求关系分析

分析工作经验（TotalWorkingYears）与培训需求之间的关系。

# 相关性矩阵
correlation_work_train = hr_data['TotalWorkingYears'].corr(hr_data['TrainingTimesLastYear'])

# 绘图
plt.figure(figsize=(10, 6))
sns.scatterplot(x='TotalWorkingYears', y='TrainingTimesLastYear', data=hr_data)
plt.title('Relationship between Total Working Years and Training Times Last Year')
plt.xlabel('Total Working Years')
plt.ylabel('Training Times Last Year')
plt.grid(True)
plt.show()

correlation_work_train

-0.0348195234809513

TotalWorkingYears（总工作年限）和TrainingTimesLastYear（去年的培训次数）之间的相关性也非常低（约为 -0.035），这同样表明这两者之间没有显著的线性关系。

八、员工福利和激励措施分析

股票期权级别与员工留存关系分析

探索股票期权级别（StockOptionLevel）对员工留存的影响。

# 按股票期权水平分组并计算留存率
retention_by_stock_option = hr_data.groupby('StockOptionLevel')['Attrition'].mean()

# 绘图
plt.figure(figsize=(10, 6))
retention_by_stock_option.plot(kind='bar')
plt.title('Employee Retention Rate by Stock Option Level')
plt.xlabel('Stock Option Level')
plt.ylabel('Attrition Rate (Lower is Better)')
plt.xticks(rotation=0)
plt.grid(True)
plt.show()

retention_by_stock_option

StockOptionLevel
0    0.242138
1    0.094842
2    0.075949
3    0.176471
Name: Attrition, dtype: float64

股票期权级别为0的员工离职率最高，约为24.21%。股票期权级别为1和2的员工离职率较低，分别为9.48%和7.59%。股票期权级别为3的员工离职率稍高于1和2级，约为17.65%。

加薪百分比与员工满意度和绩效关系分析

分析加薪百分比（PercentSalaryHike）与员工满意度和绩效的关系。

# 相关性矩阵计算
correlation_salary_hike_satisfaction = hr_data['PercentSalaryHike'].corr(hr_data['JobSatisfaction'])
correlation_salary_hike_performance = hr_data['PercentSalaryHike'].corr(hr_data['PerformanceRating'])

# 加薪百分比与工作满意度的关系图
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='PercentSalaryHike', y='JobSatisfaction', data=hr_data)
plt.title('Percent Salary Hike vs Job Satisfaction')
plt.xlabel('Percent Salary Hike')
plt.ylabel('Job Satisfaction')
plt.grid(True)

# 加薪百分比与绩效评级
plt.subplot(1, 2, 2)
sns.scatterplot(x='PercentSalaryHike', y='PerformanceRating', data=hr_data)
plt.title('Percent Salary Hike vs Performance Rating')
plt.xlabel('Percent Salary Hike')
plt.ylabel('Performance Rating')
plt.grid(True)

plt.tight_layout()
plt.show()

(correlation_salary_hike_satisfaction, correlation_salary_hike_performance)

(0.018850833014910383, 0.7724203035631152)

加薪百分比与员工满意度：加薪百分比和员工满意度之间的相关性非常低（约为0.019），说明加薪百分比与员工的满意度之间没有显著的线性关系。加薪百分比与绩效评级：加薪百分比和绩效评级之间的相关性较高（约为0.772），说明加薪百分比与绩效评级之间存在显著的正相关关系。

九、人力资源规划和预测

人力资源规划和预测

预测员工留存率预测

预测员工留存率

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder

# 选择潜在特征
features = ['MonthlyIncome', 'JobSatisfaction', 'Department', 'WorkLifeBalance']
target = 'Attrition'

#检查所选特征中的缺失值
missing_values = hr_data[features].isnull().sum()

# 数据准备
# 对分类数据进行编码
le = LabelEncoder()
hr_data['Department'] = le.fit_transform(hr_data['Department'])

# 选择特征和目标数据
X = hr_data[features]
y = hr_data[target]

# 将数据集分割成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建一个随机森林分类器
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# 预测和评估模型
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

missing_values, accuracy, report

(MonthlyIncome      0
 JobSatisfaction    0
 Department         0
 WorkLifeBalance    0
 dtype: int64,
 0.777027027027027,
 '              precision    recall  f1-score   support\n\n           0       0.85      0.89      0.87       249\n           1       0.24      0.19      0.21        47\n\n    accuracy                           0.78       296\n   macro avg       0.55      0.54      0.54       296\nweighted avg       0.76      0.78      0.77       296\n')

模型的准确度为77.7%。分类报告显示，模型对于预测非离职员工（标签“0”）的性能较好，其精确度为85%，召回率为89%。然而，对于预测离职员工（标签“1”）的性能较差，其精确度为24%，召回率为19%。

员工晋升路径和潜在的流失风险预测

用历史数据建模，预测员工晋升路径和潜在的流失风险。

# 定义晋升的目标变量
# 假设如果'YearsSinceLastPromotion'为0，员工在去年晋升
hr_data['RecentlyPromoted'] = hr_data['YearsSinceLastPromotion'].apply(lambda x: 1 if x == 0 else 0)

# 选择可能影响晋升的潜在特征
# 为简单起见，使用一些常见的可能影响晋升的特征
promotion_features = ['TotalWorkingYears', 'JobSatisfaction', 'PerformanceRating', 'TrainingTimesLastYear', 'YearsAtCompany']
promotion_target = 'RecentlyPromoted'

# 数据准备
# 如果需要，我们将使用与之前相同的LabelEncoder来处理分类数据
# 我们只使用数值特征为晋升预测模型选择特征和目标

# 为晋升预测模型选择特征和目标数据
X_promotion = hr_data[promotion_features]
y_promotion = hr_data[promotion_target]

# 将数据集分割成用于晋升模型的训练集和测试集
X_train_promo, X_test_promo, y_train_promo, y_test_promo = train_test_split(X_promotion, y_promotion, test_size=0.2, random_state=42)

# 构建一个用于晋升模型的随机森林分类器
rf_classifier_promo = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier_promo.fit(X_train_promo, y_train_promo)

# 预测和评估晋升模型
y_pred_promo = rf_classifier_promo.predict(X_test_promo)
accuracy_promo = accuracy_score(y_test_promo, y_pred_promo)
report_promo = classification_report(y_test_promo, y_pred_promo)

accuracy_promo, report_promo

(0.6790540540540541,
 '              precision    recall  f1-score   support\n\n           0       0.73      0.81      0.76       191\n           1       0.56      0.45      0.50       105\n\n    accuracy                           0.68       296\n   macro avg       0.64      0.63      0.63       296\nweighted avg       0.67      0.68      0.67       296\n')

模型的准确度为67.9%。分类报告显示，对于预测未晋升员工（标签“0”）的性能较好，其精确度为73%，召回率为81%。对于预测晋升员工（标签“1”）的性能较差，其精确度为56%，召回率为45%

十、PyCharm完整代码

# 导入需要的库
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据
hr_data = pd.read_csv("HR_Analytics.csv")
# 查看数据维度
print(hr_data.shape)
# 查看数据信息
print(hr_data.info())
# 描述性分析
print(hr_data.describe(include='all'))
# 查看各列缺失值
print(hr_data.isna().sum())
# 使用 YearsInCurrentRole 的值来填补 YearsWithCurrManager 的缺失值
hr_data['YearsWithCurrManager'].fillna(hr_data['YearsInCurrentRole'], inplace=True)

# 再次检查缺失值情况
print(hr_data.isna().sum())
# 查看重复值
print(hr_data.duplicated().sum())

# 将Attrition列的文本值转换为数值（是: 1, 否: 0）
hr_data['Attrition'] = hr_data['Attrition'].map({'Yes': 1, 'No': 0})

# 计算Attrition与其他数值型字段的相关性
correlation_with_attrition = hr_data.corr()['Attrition'].sort_values()
print(correlation_with_attrition)

# 设置绘图风格
sns.set(style="whitegrid")
plt.rcParams["font.family"] = "SimHei"
# 绘制不同年龄组的员工流失率
plt.figure(figsize=(12, 6))
sns.barplot(x='AgeGroup', y='Attrition', data=hr_data)
plt.title('员工流失率按年龄组')
plt.ylabel('流失率')
plt.xlabel('年龄分组')
plt.show()

# 婚姻状况与离职率的关系
plt.figure(figsize=(8, 5))
sns.countplot(x='MaritalStatus', hue='Attrition', data=hr_data)
plt.title('婚姻状况与离职率的关系')
plt.xlabel('婚姻状况')
plt.ylabel('离职率')
plt.show()

# 绘制工作年数与员工流失率的关系
plt.figure(figsize=(12, 6))
sns.lineplot(x='TotalWorkingYears', y='Attrition', data=hr_data)
plt.title('员工流失率随工作年数的变化')
plt.ylabel('离职率')
plt.xlabel('工作年数')
plt.show()

# 薪资差异分析

# 按性别比较月收入
plt.figure(figsize=(8, 5))
sns.boxplot(x='Gender', y='MonthlyIncome', data=hr_data)
plt.title('性别与月收入的关系')
plt.xlabel('性别')
plt.ylabel('月收入')
plt.show()

# 按教育等级比较月收入
plt.figure(figsize=(8, 5))
sns.boxplot(x='Education', y='MonthlyIncome', data=hr_data)
plt.title('教育等级与月收入的关系')
plt.xlabel('教育等级')
plt.ylabel('月收入')
plt.show()

# 按专业领域比较月收入
plt.figure(figsize=(12, 6))
sns.boxplot(x='EducationField', y='MonthlyIncome', data=hr_data)
plt.title('专业领域与月收入的关系')
plt.xlabel('专业领域')
plt.ylabel('月收入')
plt.xticks(rotation=45)
plt.show()


# 工作级别与薪酬的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='JobLevel', y='MonthlyIncome', data=hr_data)
plt.title('工作级别与薪酬的关系')
plt.xlabel('工作级别')
plt.ylabel('月收入')
plt.show()

# 工作角色与月收入的关系
plt.figure(figsize=(14, 7))
sns.boxplot(x='JobRole', y='MonthlyIncome', data=hr_data)
plt.title('工作角色与月收入的关系')
plt.xlabel('工作角色')
plt.ylabel('月收入')
plt.xticks(rotation=45)
plt.show()

# 工作级别与时薪的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='JobLevel', y='HourlyRate', data=hr_data)
plt.title('工作级别与时薪的关系')
plt.xlabel('工作级别')
plt.ylabel('时薪')
plt.show()

# 工作角色与时薪的关系
plt.figure(figsize=(14, 7))
sns.boxplot(x='JobRole', y='HourlyRate', data=hr_data)
plt.title('工作角色与时薪的关系')
plt.xlabel('工作级别')
plt.ylabel('时薪')
plt.xticks(rotation=45)
plt.show()

# 工作级别与日薪的关系
plt.figure(figsize=(10, 6))
sns.boxplot(x='JobLevel', y='DailyRate', data=hr_data)
plt.title('工作级别与日薪的关系')
plt.xlabel('工作角色')
plt.ylabel('时薪')
plt.show()

# 工作角色与日薪的关系
plt.figure(figsize=(14, 7))
sns.boxplot(x='JobRole', y='DailyRate', data=hr_data)
plt.title('工作角色与日薪的关系')
plt.xlabel('工作级别')
plt.ylabel('日薪')
plt.xticks(rotation=45)
plt.show()

# 创建一个散点矩阵图（Pair Plot）来可视化这些变量之间的关系
sns.pairplot(hr_data, vars=["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "PerformanceRating"])
plt.show()

# 计算这些变量之间的相关系数
correlation_matrix = hr_data[["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "PerformanceRating"]].corr()

# 打印相关系数矩阵
print(correlation_matrix)

# 重新制图，使用热力图来显示
selected_columns = ["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "PerformanceRating"]

# 计算这些列之间的相关系数矩阵
correlation_matrix = hr_data[selected_columns].corr()

# 创建热力图来可视化相关性
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("相关性热力图")
plt.show()

print(correlation_matrix)

# 创建散点图来可视化工作生活平衡评价和工作参与度之间的关系
plt.figure(figsize=(10, 6))
sns.scatterplot(data=hr_data, x="WorkLifeBalance", y="JobInvolvement", hue="YearsAtCompany", palette="coolwarm", size="YearsAtCompany", sizes=(20, 200))
plt.title("工作生活平衡评价与工作参与度 (Color by 在公司工作年数)")
plt.show()

# 创建箱线图来可视化工作生活平衡评价和在公司工作年数之间的关系
plt.figure(figsize=(8, 6))
sns.boxplot(data=hr_data, x="WorkLifeBalance", y="YearsAtCompany")
plt.title("工作生活平衡评价与在公司工作年数")
plt.show()


import matplotlib.pyplot as plt

# 选择字段
selected_columns = ["WorkLifeBalance", "JobInvolvement", "YearsAtCompany"]

# 创建一个新的Matplotlib图形
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# 在三维坐标系上绘制散点图
ax.scatter(hr_data["WorkLifeBalance"], hr_data["JobInvolvement"], hr_data["YearsAtCompany"], c=hr_data["YearsAtCompany"], cmap='coolwarm', s=50)

# 设置坐标轴标签
ax.set_xlabel("WorkLifeBalance")
ax.set_ylabel("JobInvolvement")
ax.set_zlabel("YearsAtCompany")

# 添加颜色图例
cbar = plt.colorbar(ax.scatter(hr_data["WorkLifeBalance"], hr_data["JobInvolvement"], hr_data["YearsAtCompany"], c=hr_data["YearsAtCompany"], cmap='coolwarm', s=50))
cbar.set_label("YearsAtCompany")

plt.title("3D 散点图 (工作生活平衡评价, 工作参与度, 在公司工作年数)")
plt.show()

#选择字段
selected_columns = ["YearsSinceLastPromotion", "JobSatisfaction", "JobLevel", "PerformanceRating"]
subset_data = hr_data[selected_columns]

# 计算这些列之间的相关系数矩阵
correlation_matrix = subset_data.corr()

# 创建热力图来可视化相关性
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("相关性热力图 (上次晋升后的年数, 工作满意度, 工作级别, 绩效评级)")
plt.show()

# 选择字段
correlation_matrix_2 = hr_data[['YearsInCurrentRole', 'JobSatisfaction', 'YearsSinceLastPromotion']].corr()

# 相关矩阵热力图
plt.figure(figsize=(8, 5))
sns.heatmap(correlation_matrix_2, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("相关性矩阵(担任现职年数, 工作满意度, 上次晋升后的年数)")
plt.show()

# 相关性矩阵
correlation = hr_data['TrainingTimesLastYear'].corr(hr_data['PerformanceRating'])

# 绘图
plt.figure(figsize=(10, 6))
sns.scatterplot(x='TrainingTimesLastYear', y='PerformanceRating', data=hr_data)
plt.title('去年培训时间与绩效评级关系')
plt.xlabel('去年培训时间')
plt.ylabel('绩效评级')
plt.grid(True)
plt.show()

print(correlation)
# 相关性矩阵
correlation_work_train = hr_data['TotalWorkingYears'].corr(hr_data['TrainingTimesLastYear'])

# 绘图
plt.figure(figsize=(10, 6))
sns.scatterplot(x='TotalWorkingYears', y='TrainingTimesLastYear', data=hr_data)
plt.title('总工作年数和去年培训时间关系')
plt.xlabel('总工作年数')
plt.ylabel('去年培训时间')
plt.grid(True)
plt.show()
print(correlation_work_train)


# 按股票期权水平分组并计算留存率
retention_by_stock_option = hr_data.groupby('StockOptionLevel')['Attrition'].mean()

# 绘图
plt.figure(figsize=(10, 6))
retention_by_stock_option.plot(kind='bar')
plt.title('按股票期权级别的员工留存率分析')
plt.xlabel('股票期权级别')
plt.ylabel('离职率(越小越好)')
plt.xticks(rotation=0)
plt.grid(True)
plt.show()
print(retention_by_stock_option)

# 相关性矩阵计算
correlation_salary_hike_satisfaction = hr_data['PercentSalaryHike'].corr(hr_data['JobSatisfaction'])
correlation_salary_hike_performance = hr_data['PercentSalaryHike'].corr(hr_data['PerformanceRating'])

# 加薪百分比与工作满意度的关系图
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='PercentSalaryHike', y='JobSatisfaction', data=hr_data)
plt.title('加薪百分比与工作满意度')
plt.xlabel('加薪百分比')
plt.ylabel('工作满意度')
plt.grid(True)

# 加薪百分比与绩效评级
plt.subplot(1, 2, 2)
sns.scatterplot(x='PercentSalaryHike', y='PerformanceRating', data=hr_data)
plt.title('加薪百分比与绩效评级')
plt.xlabel('加薪百分比')
plt.ylabel('绩效评级')
plt.grid(True)

plt.tight_layout()
plt.show()

print(correlation_salary_hike_satisfaction, correlation_salary_hike_performance)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder

# 选择潜在特征
features = ['MonthlyIncome', 'JobSatisfaction', 'Department', 'WorkLifeBalance']
target = 'Attrition'

#检查所选特征中的缺失值
missing_values = hr_data[features].isnull().sum()

# 数据准备
# 对分类数据进行编码
le = LabelEncoder()
hr_data['Department'] = le.fit_transform(hr_data['Department'])

# 选择特征和目标数据
X = hr_data[features]
y = hr_data[target]

# 将数据集分割成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建一个随机森林分类器
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# 预测和评估模型
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(missing_values, accuracy, report)

# 定义晋升的目标变量
# 假设如果'YearsSinceLastPromotion'为0，员工在去年晋升
hr_data['RecentlyPromoted'] = hr_data['YearsSinceLastPromotion'].apply(lambda x: 1 if x == 0 else 0)

# 选择可能影响晋升的潜在特征
# 为简单起见，使用一些常见的可能影响晋升的特征
promotion_features = ['TotalWorkingYears', 'JobSatisfaction', 'PerformanceRating', 'TrainingTimesLastYear', 'YearsAtCompany']
promotion_target = 'RecentlyPromoted'

# 数据准备
# 如果需要，我们将使用与之前相同的LabelEncoder来处理分类数据
# 我们只使用数值特征为晋升预测模型选择特征和目标

# 为晋升预测模型选择特征和目标数据
X_promotion = hr_data[promotion_features]
y_promotion = hr_data[promotion_target]

# 将数据集分割成用于晋升模型的训练集和测试集
X_train_promo, X_test_promo, y_train_promo, y_test_promo = train_test_split(X_promotion, y_promotion, test_size=0.2, random_state=42)

# 构建一个用于晋升模型的随机森林分类器
rf_classifier_promo = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier_promo.fit(X_train_promo, y_train_promo)

# 预测和评估晋升模型
y_pred_promo = rf_classifier_promo.predict(X_test_promo)
accuracy_promo = accuracy_score(y_test_promo, y_pred_promo)
report_promo = classification_report(y_test_promo, y_pred_promo)

print(accuracy_promo,report_promo)