采用PYTHON进行泰坦尼克号生存者预测
采用PYTHON进行泰坦尼克号生存者预测
参考资料:
来自kaggle中的笔记
Titanic幸存者预测
问题说明
采用PYTHON进行泰坦尼克号生存者预测。了解特征预处理,算法等流程
代码及说明
准备工作
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', 1000)
# Ignore all DeprecationWarnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from scipy.stats import chi2_contingency
train_df=pd.read_csv('D:\下载文件\\Titanic data\\Titanic data\\train.csv')
test_df=pd.read_csv("D:\下载文件\\Titanic data\\Titanic data\\test.csv")
数据预处理
对数据有一个基本的认识
print('Train Dataset Shape',train_df.shape)
print('------------------------------------')
print('Test Dataset Shape',test_df.shape)
结果:
可见Train.csv文件有891行数据,共12列,Test.csv文件有418行数据,共11列.
将两个表数据合并,并再添一列表示数据来自哪个表
train_df['dataset_type']='train'
test_df['dataset_type']='test'
df=pd.concat([train_df,test_df])
df.head()
print('Combine Dataset Shape',df.shape)
df['dataset_type'].value_counts()
输出
检查重复数据
df.duplicated().sum()
测试为0
再检查缺失数据
df.info()
plt.figure(figsize=(10,5))
records=df.isnull().mean()*100
plt.xlabel('Features')
plt.ylabel('Percentage Of Missing Values')
plt.title('Missing Value Plot')
ax = records.plot(kind='bar')
for i, value in enumerate(records):
plt.text(i, value + 0.5, f'{value:.2f}%', ha='center', va='bottom')
plt.show()
结果:
根据空值分析,我们观察到 Cabin 特征大约有 77% 的缺失值。由于缺失数据的比例很高,通过插补技术解决这个问题是不切实际的,另外,舱在哪个位置对于是否获救的影响也比较小。因此,我们建议从数据集中删除 Cabin 列,以保证数据质量。
同时,票证特征具有最大的特征值,其对生存预测的概率或影响最小,因此我们从数据集中删除票证列
df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)
df.head()
df.info()
df.describe(include='all')
结果
删除后由原来的13列变成了11列
剩下的还有Age,Fare,Embarked列有缺失,我们通过对数据的分析进行缺失值的预测
首先是Age和Fare
plt.figure(figsize=(10,5))
plt.subplot(2,2,1)
df['Age'].plot(kind='box')
plt.subplot(2,2,2)
df['Age'].plot(kind='hist')
plt.subplot(2,2,3)
df['Fare'].plot(kind='box')
plt.subplot(2,2,4)
df['Fare'].plot(kind='kde')
plt.tight_layout()
plt.show()
结果如下
箱线图(box):
再结合直方图和密度图发现Fare有明显的右偏,于是为了克服偏斜效应,我们将它们转换为分箱数据。
df[['Age','Fare']].skew()
df['Embarked'].value_counts()
print('Avg Fare',df[df['Fare']>100]['Fare'].median())
print('Avg Age',df['Age'].median())
结果
发现Embarked(登船位置)最多的是S
然后进行缺失值或空值填补
df['Age']=df['Age'].fillna(df['Age'].median())
df['Fare']=df['Fare'].fillna(df[df['Fare']>100]['Fare'].median())
df['Embarked']=df['Embarked'].fillna('S')
fare_bins = [0, 7.89, 14.45, 31.07, 100, 263]
# Define professional labels for each bin
fare_labels = ['Very Low Fare', 'Low Fare', 'Moderate Fare', 'High Fare', 'Very High Fare']
# Create a new column 'Fare_bin' by binning the 'Fare' column
df['Fare_bin'] = pd.cut(df['Fare'], bins=fare_bins, labels=fare_labels, include_lowest=True)
age_bins = [0, 12, 21, 28, 39, 80]
# Define professional labels for each bin
age_labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']
# Create a new column 'Age_bin' by binning the 'Age' column
df['Age_bin'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, include_lowest=True)
df.head()
然后再删除Age和Fare列
df.drop(['Fare',"Age"],axis=1,inplace=True)
df.head()
再对其他数据参照上述办法进行分类与合并
df['Name']
def get_unique_titles(names_list):
parts = names_list.split(',')
if len(parts) > 1:
title = parts[1].split('.')[0].strip()
return title
else:
return 'Unknown'
df['Title'] = df['Name'].apply(get_unique_titles)
#We can seet that After Master There few title which are less than 10 records hence we convert them into Others
df['Title'].value_counts()
titles_to_keep = ['Mr', 'Miss', 'Mrs', 'Master']
df['Title'] = df['Title'].apply(lambda x: x if x in titles_to_keep else 'Others')
df['Title'].value_counts()
df.drop(['Name'],axis=1,inplace=True)
df.head()
df['FamilySize']=df['SibSp']+df['Parch']
df.head()
df['FamilySize'].plot(kind='hist')
bins = [-1, 0, 2, 4, 6, 10] # -1 is included to capture 0
labels = ['Single', 'Small Family', 'Medium Family', 'Large Family', 'Very Large Family']
# Create the FamilySize bin feature
df['FamilySizeCategory'] = pd.cut(df['FamilySize'], bins=bins, labels=labels)
df['FamilySizeCategory'].value_counts()
df.drop(['FamilySize','SibSp','Parch'],axis=1,inplace=True)
df.head()
df.info()
卡方检验与列联表
def chi_squared_test(df, categorical_variable):
# Create a contingency table
contingency_table = pd.crosstab(df[categorical_variable], df['Survived'])
# Perform Chi-Squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)
return chi2, p, contingency_table
# List of categorical variables to test
categorical_variables = ['Pclass', 'Sex', 'Embarked', 'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']
# Perform Chi-Squared test for each variable and store results
results = {}
for variable in categorical_variables:
chi2_stat, p_value, contingency = chi_squared_test(df, variable)
results[variable] = {
'Chi-Squared Statistic': chi2_stat,
'p-value': p_value,
'Contingency Table': contingency
}
# Display results
for variable, result in results.items():
print(f"Variable: {variable}")
print(f"Chi-Squared Statistic: {result['Chi-Squared Statistic']:.4f}")
print(f"p-value: {result['p-value']:.4f}")
print("Contingency Table:")
print(result['Contingency Table'])
print("\n" + "-" * 40 + "\n")
结果:
结果解读
卡方统计量:数值越大,表示分类变量与生存之间的关联性越强。
p值:通常,p值小于0.05表明分类变量与生存之间存在统计学上的显著关联。
通过遵循上述过程,您可以分析不同分类因素与泰坦尼克号生存情况的关系。
泰坦尼克号上与生存最显著相关的特征如下:
性别:女性的生存率最高。
头衔:头衔反映了性别和社会地位,与生存机会密切相关。
乘客等级(Pclass):高等舱乘客(一等舱)的生存率远高于低等舱乘客。
票价区间(Fare_bin):支付更高票价的乘客生存率更高,这与乘客等级的关联一致。
登船港口(Embarked):登船港口也起到一定作用,某些港口的生存率较高。
家庭规模类别(FamilySizeCategory):虽然显著,但家庭规模的影响小于上述特征。
年龄区间(Age_bin):尽管显著,但年龄在这些特征中与生存的关联最弱,其中儿童的生存率最高。
综上所述,性别和乘客等级是影响泰坦尼克号生存情况的最重要因素,其次是头衔。
数据可视化
单变量分析
categorical_cols = ['Survived', 'Pclass', 'Sex', 'Embarked',
'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']
# Set up the subplot grid
n_cols = 2
n_rows = (len(categorical_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 4))
axes = axes.flatten() # Flatten to easily index
# Create bar plots for each categorical column using sns.barplot
for idx, col in enumerate(categorical_cols):
# Calculate counts
counts = df[col].value_counts().reset_index()
counts.columns = [col, 'Count'] # Rename columns for clarity
# Create the bar plot
sns.barplot(data=counts, x=col, y='Count', ax=axes[idx], palette='coolwarm')
axes[idx].set_title(f'Count of {col}')
axes[idx].set_ylabel('Count')
axes[idx].set_xlabel(col)
# Annotate the bars with their values
for p in axes[idx].patches:
axes[idx].annotate(f'{int(p.get_height())}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='bottom',
fontsize=10, color='black',
xytext=(0, 5), # 5 points vertical offset
textcoords='offset points')
# Remove any empty subplots
for j in range(idx + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
双变量分析
现在,我们分析与生存相关的特征以获取更多信息。在训练数据中,我们有888条记录的“生存”(Survived)列。在单变量分析中,我们考虑了训练数据集和测试数据集的总计数。然而,在双变量分析中,我们仅使用训练数据来识别模式。
#Survival Analysis by Demographics:
#Investigate survival rates across different age groups and genders.
Demographics_crosstab=pd.crosstab(index=df['Pclass'],columns=[df['Sex'],df['Age_bin'],df['Survived']])
plt.figure(figsize=(14, 6))
sns.heatmap(Demographics_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Age and Sex')
plt.ylabel('Pclass')
plt.xlabel('Sex and Survival Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
从上面的图表中可以明显看出,位于三等舱(Pclass 3)且处于青少年年龄段的男性乘客的生存率最低。
#Socioeconomic Impact:
#Analyze how class and fare influence survival rates.
Socioeconomic_crosstab=pd.crosstab(index=df['Pclass'],columns=[df['Fare_bin'],df['Survived']])
plt.figure(figsize=(14, 6))
sns.heatmap(Socioeconomic_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Age and Sex')
plt.ylabel('Pclass')
plt.xlabel('Fare and Survival Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
从上面的图表中可以明显看出,在Pclass 3(三等舱)的888名乘客中,共有264人未能幸存。在未能幸存的乘客中,130人属于“极低票价”类别,而134人属于“低票价”类别。
#Family and Social Structure Impact:
#Examine whether family size affected survival chances.
family_crosstab=pd.crosstab(index=df['FamilySizeCategory'],columns=[df['Survived']])
plt.figure(figsize=(5, 3))
sns.heatmap(family_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Age and Sex')
plt.ylabel('Family Size')
plt.xlabel('Survival Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
title_crosstab = pd.crosstab(index=df['Title'], columns=[df['Survived']])
# Travel Origin Analysis
embarked_crosstab = pd.crosstab(index=df['Embarked'], columns=[df['Survived']])
# Set up the figure with 1 row and 2 columns
plt.figure(figsize=(12, 5))
# Subplot for Title Analysis
plt.subplot(1, 2, 1)
sns.heatmap(title_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Title')
plt.ylabel('Title')
plt.xlabel('Survival Status')
plt.xticks(rotation=45)
# Subplot for Travel Origin Analysis
plt.subplot(1, 2, 2)
sns.heatmap(embarked_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Embarkation Point')
plt.ylabel('Embarkation Point')
plt.xlabel('Survival Status')
plt.xticks(rotation=45)
# Adjust layout
plt.tight_layout()
plt.show()
从图表中可以明显看出,拥有“Mrs.”(夫人)头衔的乘客中未幸存者人数最多,总计436人,而登船点中未幸存者人数最多的是“S”,有427名乘客未能幸存。
family_crosstab=pd.crosstab(index=df['Survived'],columns=[df['Sex'],df['Embarked'],df['Pclass'],df['Fare_bin']])
plt.figure(figsize=(30, 5))
sns.heatmap(family_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Sex, Embarked,Pclass,Fare_bin')
plt.ylabel('Survived')
plt.xlabel('Sex, Embarked,Pclass,Fare_bin Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
模型构建与评估
df.head()
df.columns
columns = ['Sex', 'Embarked', 'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for column in columns:
df[column] = label_encoder.fit_transform(df[column])
df.head()
clean_train_df = df[df['dataset_type'] == 'train']
clean_test_df = df[df['dataset_type'] == 'test']
print('Clean Train Dataset Shape', clean_train_df.shape)
print('------------------------------------')
print('Clean Test Dataset Shape', clean_test_df.shape)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
x=clean_train_df[['Pclass','Sex','Embarked','Fare_bin','Age_bin','Title','FamilySizeCategory']]
y=clean_train_df['Survived']
from sklearn.model_selection import train_test_split,GridSearchCV
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
def train_model(model, model_name):
print(f'Model: {model_name}')
# Fit the model on the training data
model.fit(x_train, y_train)
# Predictions on the training data
y_train_pred = model.predict(x_train)
# Predictions on the testing data
y_test_pred = model.predict(x_test)
# Calculate accuracy scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Training Accuracy Score: {train_accuracy:.2f}')
print(f'Testing Accuracy Score: {test_accuracy:.2f}')
# Generate classification report for testing data
report = classification_report(y_test, y_test_pred)
print('Classification Report:')
print(report)
return model
model_list = dict(
knn=KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2),
svc=SVC(kernel='linear', random_state=0),
logistic=LogisticRegression(),
naive=GaussianNB(),
tree=DecisionTreeClassifier(criterion='entropy', random_state=0),
forest=RandomForestClassifier(n_estimators=50, criterion="entropy"),
xgboost=XGBClassifier(),
gradientboost=GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=1)
)
for key, value in model_list.items():
print('*'*30)
train_model(value,key)
使用GridSearchCV进行超参数调优以优化模型
# Define the model
model = XGBClassifier(random_state=42)
# Set up the parameter grid
param_grid = {
'max_depth': [3, 5, 7, 9, 11],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'n_estimators': [50, 100, 150, 200],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
}
# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
scoring='accuracy', cv=5, verbose=0, n_jobs=-1)
# Fit the model
grid_search.fit(x_train, y_train)
# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Print the results
print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)
# Optionally, evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", test_accuracy)
预测结果
我们将使用XGBoost模型来预测输出结果,因为这个模型可以达到83%的准确率。
model=XGBClassifier(
colsample_bytree=1.0,
learning_rate=0.01,
max_depth=7,
n_estimators=100,
subsample=1.0,
use_label_encoder=False, # Optional, based on your XGBoost version
eval_metric='mlogloss' # Optional, based on your needs
)
train_model(model,'XGBOOST')
最后输出文件:
作者:Z_with_z