代码收藏家技术教程 2024-11-11

采用PYTHON进行泰坦尼克号生存者预测

采用PYTHON进行泰坦尼克号生存者预测
参考资料：

来自kaggle中的笔记

Titanic幸存者预测

问题说明

代码及说明

准备工作

数据预处理

卡方检验与列联表

数据可视化

单变量分析

双变量分析

模型构建与评估

预测结果

问题说明

采用PYTHON进行泰坦尼克号生存者预测。了解特征预处理，算法等流程

代码及说明

准备工作

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 1000)
# Ignore all DeprecationWarnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from scipy.stats import chi2_contingency

train_df=pd.read_csv('D:\下载文件\\Titanic data\\Titanic data\\train.csv')
test_df=pd.read_csv("D:\下载文件\\Titanic data\\Titanic data\\test.csv")

数据预处理

对数据有一个基本的认识

print('Train Dataset Shape',train_df.shape)
print('------------------------------------')
print('Test Dataset Shape',test_df.shape)

结果：

可见Train.csv文件有891行数据，共12列，Test.csv文件有418行数据，共11列.
将两个表数据合并，并再添一列表示数据来自哪个表

train_df['dataset_type']='train'
test_df['dataset_type']='test'
df=pd.concat([train_df,test_df])
df.head()

print('Combine Dataset Shape',df.shape)
df['dataset_type'].value_counts()

输出

检查重复数据

df.duplicated().sum()

测试为0
再检查缺失数据

df.info()
plt.figure(figsize=(10,5))
records=df.isnull().mean()*100
plt.xlabel('Features')
plt.ylabel('Percentage Of Missing Values')
plt.title('Missing Value Plot')
ax = records.plot(kind='bar')

for i, value in enumerate(records):
    plt.text(i, value + 0.5, f'{value:.2f}%', ha='center', va='bottom')

plt.show()

结果：

根据空值分析，我们观察到 Cabin 特征大约有 77% 的缺失值。由于缺失数据的比例很高，通过插补技术解决这个问题是不切实际的，另外，舱在哪个位置对于是否获救的影响也比较小。因此，我们建议从数据集中删除 Cabin 列，以保证数据质量。
同时，票证特征具有最大的特征值，其对生存预测的概率或影响最小，因此我们从数据集中删除票证列

df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)
df.head()
df.info()
df.describe(include='all')

结果

删除后由原来的13列变成了11列
剩下的还有Age，Fare，Embarked列有缺失，我们通过对数据的分析进行缺失值的预测
首先是Age和Fare

plt.figure(figsize=(10,5))

plt.subplot(2,2,1)
df['Age'].plot(kind='box')
plt.subplot(2,2,2)
df['Age'].plot(kind='hist')


plt.subplot(2,2,3)
df['Fare'].plot(kind='box')
plt.subplot(2,2,4)
df['Fare'].plot(kind='kde')



plt.tight_layout()

plt.show()

结果如下

箱线图（box）：

再结合直方图和密度图发现Fare有明显的右偏，于是为了克服偏斜效应，我们将它们转换为分箱数据。

df[['Age','Fare']].skew()

df['Embarked'].value_counts()

print('Avg Fare',df[df['Fare']>100]['Fare'].median())
print('Avg Age',df['Age'].median())

结果

发现Embarked（登船位置）最多的是S
然后进行缺失值或空值填补

df['Age']=df['Age'].fillna(df['Age'].median())

df['Fare']=df['Fare'].fillna(df[df['Fare']>100]['Fare'].median())

df['Embarked']=df['Embarked'].fillna('S')
fare_bins = [0, 7.89, 14.45, 31.07, 100, 263]
# Define professional labels for each bin
fare_labels = ['Very Low Fare', 'Low Fare', 'Moderate Fare', 'High Fare', 'Very High Fare']
# Create a new column 'Fare_bin' by binning the 'Fare' column
df['Fare_bin'] = pd.cut(df['Fare'], bins=fare_bins, labels=fare_labels, include_lowest=True)

age_bins = [0, 12, 21, 28, 39, 80]

# Define professional labels for each bin
age_labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']

# Create a new column 'Age_bin' by binning the 'Age' column
df['Age_bin'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, include_lowest=True)
df.head()

然后再删除Age和Fare列

df.drop(['Fare',"Age"],axis=1,inplace=True)
df.head()

再对其他数据参照上述办法进行分类与合并

df['Name']

def get_unique_titles(names_list):
    parts = names_list.split(',')
    if len(parts) > 1:
        title = parts[1].split('.')[0].strip()
        return title
    else:
        return 'Unknown'
df['Title'] = df['Name'].apply(get_unique_titles)
#We can seet that After Master There few title which are less than 10 records hence we convert them into Others
df['Title'].value_counts()

titles_to_keep = ['Mr', 'Miss', 'Mrs', 'Master']

df['Title'] = df['Title'].apply(lambda x: x if x in titles_to_keep else 'Others')

df['Title'].value_counts()

df.drop(['Name'],axis=1,inplace=True)
df.head()
df['FamilySize']=df['SibSp']+df['Parch']
df.head()
df['FamilySize'].plot(kind='hist')

bins = [-1, 0, 2, 4, 6, 10]  # -1 is included to capture 0
labels = ['Single', 'Small Family', 'Medium Family', 'Large Family', 'Very Large Family']

# Create the FamilySize bin feature
df['FamilySizeCategory'] = pd.cut(df['FamilySize'], bins=bins, labels=labels)
df['FamilySizeCategory'].value_counts()

df.drop(['FamilySize','SibSp','Parch'],axis=1,inplace=True)
df.head()
df.info()

卡方检验与列联表

def chi_squared_test(df, categorical_variable):
    # Create a contingency table
    contingency_table = pd.crosstab(df[categorical_variable], df['Survived'])

    # Perform Chi-Squared test
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    return chi2, p, contingency_table


# List of categorical variables to test
categorical_variables = ['Pclass', 'Sex', 'Embarked', 'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']

# Perform Chi-Squared test for each variable and store results
results = {}
for variable in categorical_variables:
    chi2_stat, p_value, contingency = chi_squared_test(df, variable)
    results[variable] = {
        'Chi-Squared Statistic': chi2_stat,
        'p-value': p_value,
        'Contingency Table': contingency
    }

# Display results
for variable, result in results.items():
    print(f"Variable: {variable}")
    print(f"Chi-Squared Statistic: {result['Chi-Squared Statistic']:.4f}")
    print(f"p-value: {result['p-value']:.4f}")
    print("Contingency Table:")
    print(result['Contingency Table'])
    print("\n" + "-" * 40 + "\n")

结果：

结果解读
卡方统计量：数值越大，表示分类变量与生存之间的关联性越强。
p值：通常，p值小于0.05表明分类变量与生存之间存在统计学上的显著关联。

通过遵循上述过程，您可以分析不同分类因素与泰坦尼克号生存情况的关系。

泰坦尼克号上与生存最显著相关的特征如下：
性别：女性的生存率最高。
头衔：头衔反映了性别和社会地位，与生存机会密切相关。
乘客等级（Pclass）：高等舱乘客（一等舱）的生存率远高于低等舱乘客。
票价区间（Fare_bin）：支付更高票价的乘客生存率更高，这与乘客等级的关联一致。
登船港口（Embarked）：登船港口也起到一定作用，某些港口的生存率较高。
家庭规模类别（FamilySizeCategory）：虽然显著，但家庭规模的影响小于上述特征。
年龄区间（Age_bin）：尽管显著，但年龄在这些特征中与生存的关联最弱，其中儿童的生存率最高。
综上所述，性别和乘客等级是影响泰坦尼克号生存情况的最重要因素，其次是头衔。

数据可视化

单变量分析

categorical_cols = ['Survived', 'Pclass', 'Sex', 'Embarked', 
                    'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']

# Set up the subplot grid
n_cols = 2
n_rows = (len(categorical_cols) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 4))
axes = axes.flatten()  # Flatten to easily index

# Create bar plots for each categorical column using sns.barplot
for idx, col in enumerate(categorical_cols):
    # Calculate counts
    counts = df[col].value_counts().reset_index()
    counts.columns = [col, 'Count']  # Rename columns for clarity

    # Create the bar plot
    sns.barplot(data=counts, x=col, y='Count', ax=axes[idx], palette='coolwarm')

    axes[idx].set_title(f'Count of {col}')
    axes[idx].set_ylabel('Count')
    axes[idx].set_xlabel(col)

    # Annotate the bars with their values
    for p in axes[idx].patches:
        axes[idx].annotate(f'{int(p.get_height())}', 
                           (p.get_x() + p.get_width() / 2., p.get_height()), 
                           ha='center', va='bottom', 
                           fontsize=10, color='black', 
                           xytext=(0, 5),  # 5 points vertical offset
                           textcoords='offset points')

# Remove any empty subplots
for j in range(idx + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

双变量分析

现在，我们分析与生存相关的特征以获取更多信息。在训练数据中，我们有888条记录的“生存”（Survived）列。在单变量分析中，我们考虑了训练数据集和测试数据集的总计数。然而，在双变量分析中，我们仅使用训练数据来识别模式。

#Survival Analysis by Demographics:
#Investigate survival rates across different age groups and genders.
Demographics_crosstab=pd.crosstab(index=df['Pclass'],columns=[df['Sex'],df['Age_bin'],df['Survived']])
plt.figure(figsize=(14, 6))
sns.heatmap(Demographics_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Age and Sex')
plt.ylabel('Pclass')
plt.xlabel('Sex and Survival Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

从上面的图表中可以明显看出，位于三等舱（Pclass 3）且处于青少年年龄段的男性乘客的生存率最低。

#Socioeconomic Impact:
#Analyze how class and fare influence survival rates.
Socioeconomic_crosstab=pd.crosstab(index=df['Pclass'],columns=[df['Fare_bin'],df['Survived']])
plt.figure(figsize=(14, 6))
sns.heatmap(Socioeconomic_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Age and Sex')
plt.ylabel('Pclass')
plt.xlabel('Fare and Survival Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

从上面的图表中可以明显看出，在Pclass 3（三等舱）的888名乘客中，共有264人未能幸存。在未能幸存的乘客中，130人属于“极低票价”类别，而134人属于“低票价”类别。

#Family and Social Structure Impact:
#Examine whether family size affected survival chances.

family_crosstab=pd.crosstab(index=df['FamilySizeCategory'],columns=[df['Survived']])
plt.figure(figsize=(5, 3))
sns.heatmap(family_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Age and Sex')
plt.ylabel('Family Size')
plt.xlabel('Survival Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

title_crosstab = pd.crosstab(index=df['Title'], columns=[df['Survived']])
# Travel Origin Analysis
embarked_crosstab = pd.crosstab(index=df['Embarked'], columns=[df['Survived']])

# Set up the figure with 1 row and 2 columns
plt.figure(figsize=(12, 5))

# Subplot for Title Analysis
plt.subplot(1, 2, 1)
sns.heatmap(title_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Title')
plt.ylabel('Title')
plt.xlabel('Survival Status')
plt.xticks(rotation=45)

# Subplot for Travel Origin Analysis
plt.subplot(1, 2, 2)
sns.heatmap(embarked_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Embarkation Point')
plt.ylabel('Embarkation Point')
plt.xlabel('Survival Status')
plt.xticks(rotation=45)

# Adjust layout
plt.tight_layout()
plt.show()

从图表中可以明显看出，拥有“Mrs.”（夫人）头衔的乘客中未幸存者人数最多，总计436人，而登船点中未幸存者人数最多的是“S”，有427名乘客未能幸存。

family_crosstab=pd.crosstab(index=df['Survived'],columns=[df['Sex'],df['Embarked'],df['Pclass'],df['Fare_bin']])
plt.figure(figsize=(30, 5))
sns.heatmap(family_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
plt.title('Survival Count by Sex, Embarked,Pclass,Fare_bin')
plt.ylabel('Survived')
plt.xlabel('Sex, Embarked,Pclass,Fare_bin Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

模型构建与评估

df.head()
df.columns

columns = ['Sex', 'Embarked', 'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

for column in columns:
    df[column] = label_encoder.fit_transform(df[column])

df.head()

clean_train_df = df[df['dataset_type'] == 'train']
clean_test_df = df[df['dataset_type'] == 'test']
print('Clean Train Dataset Shape', clean_train_df.shape)
print('------------------------------------')
print('Clean Test Dataset Shape', clean_test_df.shape)

from sklearn.linear_model import LogisticRegression  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier  
from sklearn.ensemble import RandomForestClassifier  
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

x=clean_train_df[['Pclass','Sex','Embarked','Fare_bin','Age_bin','Title','FamilySizeCategory']]
y=clean_train_df['Survived']

from sklearn.model_selection import train_test_split,GridSearchCV
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

def train_model(model, model_name):
    print(f'Model: {model_name}')
    
    # Fit the model on the training data
    model.fit(x_train, y_train)
    
    # Predictions on the training data
    y_train_pred = model.predict(x_train)
    # Predictions on the testing data
    y_test_pred = model.predict(x_test)
    
    # Calculate accuracy scores
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    print(f'Training Accuracy Score: {train_accuracy:.2f}')
    print(f'Testing Accuracy Score: {test_accuracy:.2f}')
    
    # Generate classification report for testing data
    report = classification_report(y_test, y_test_pred)
    print('Classification Report:')
    print(report)
    
    return model
model_list = dict(
    knn=KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2),
    svc=SVC(kernel='linear', random_state=0),
    logistic=LogisticRegression(),
    naive=GaussianNB(),
    tree=DecisionTreeClassifier(criterion='entropy', random_state=0),
    forest=RandomForestClassifier(n_estimators=50, criterion="entropy"),
    xgboost=XGBClassifier(),
    gradientboost=GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=1)
)
for key, value in model_list.items():
    print('*'*30)
    train_model(value,key)

使用GridSearchCV进行超参数调优以优化模型

# Define the model
model = XGBClassifier(random_state=42)

# Set up the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 9, 11],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'n_estimators': [50, 100, 150, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, 
                           scoring='accuracy', cv=5, verbose=0, n_jobs=-1)

# Fit the model
grid_search.fit(x_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Print the results
print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)

# Optionally, evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", test_accuracy)

预测结果

我们将使用XGBoost模型来预测输出结果，因为这个模型可以达到83%的准确率。

model=XGBClassifier(
    colsample_bytree=1.0,
    learning_rate=0.01,
    max_depth=7,
    n_estimators=100,
    subsample=1.0,
    use_label_encoder=False,  # Optional, based on your XGBoost version
    eval_metric='mlogloss'    # Optional, based on your needs
)

train_model(model,'XGBOOST')

最后输出文件：