采用PYTHON进行泰坦尼克号生存者预测

采用PYTHON进行泰坦尼克号生存者预测
参考资料:

来自kaggle中的笔记

Titanic幸存者预测

  • 问题说明
  • 代码及说明
  • 准备工作
  • 数据预处理
  • 卡方检验与列联表
  • 数据可视化
  • 单变量分析
  • 双变量分析
  • 模型构建与评估
  • 预测结果
  • 问题说明

    采用PYTHON进行泰坦尼克号生存者预测。了解特征预处理,算法等流程

    代码及说明

    准备工作

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    pd.set_option('display.max_columns', None)  # Show all columns
    pd.set_option('display.width', 1000)
    # Ignore all DeprecationWarnings
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    from scipy.stats import chi2_contingency
    
    train_df=pd.read_csv('D:\下载文件\\Titanic data\\Titanic data\\train.csv')
    test_df=pd.read_csv("D:\下载文件\\Titanic data\\Titanic data\\test.csv")
    

    数据预处理

    对数据有一个基本的认识

    print('Train Dataset Shape',train_df.shape)
    print('------------------------------------')
    print('Test Dataset Shape',test_df.shape)
    

    结果:

    可见Train.csv文件有891行数据,共12列,Test.csv文件有418行数据,共11列.
    将两个表数据合并,并再添一列表示数据来自哪个表

    train_df['dataset_type']='train'
    test_df['dataset_type']='test'
    df=pd.concat([train_df,test_df])
    df.head()
    
    print('Combine Dataset Shape',df.shape)
    df['dataset_type'].value_counts()
    

    输出

    检查重复数据

    df.duplicated().sum()
    

    测试为0
    再检查缺失数据

    df.info()
    plt.figure(figsize=(10,5))
    records=df.isnull().mean()*100
    plt.xlabel('Features')
    plt.ylabel('Percentage Of Missing Values')
    plt.title('Missing Value Plot')
    ax = records.plot(kind='bar')
    
    for i, value in enumerate(records):
        plt.text(i, value + 0.5, f'{value:.2f}%', ha='center', va='bottom')
    
    plt.show()
    

    结果:


    根据空值分析,我们观察到 Cabin 特征大约有 77% 的缺失值。由于缺失数据的比例很高,通过插补技术解决这个问题是不切实际的,另外,舱在哪个位置对于是否获救的影响也比较小。因此,我们建议从数据集中删除 Cabin 列,以保证数据质量。
    同时,票证特征具有最大的特征值,其对生存预测的概率或影响最小,因此我们从数据集中删除票证列

    df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)
    df.head()
    df.info()
    df.describe(include='all')
    

    结果

    删除后由原来的13列变成了11列
    剩下的还有Age,Fare,Embarked列有缺失,我们通过对数据的分析进行缺失值的预测
    首先是Age和Fare

    plt.figure(figsize=(10,5))
    
    plt.subplot(2,2,1)
    df['Age'].plot(kind='box')
    plt.subplot(2,2,2)
    df['Age'].plot(kind='hist')
    
    
    plt.subplot(2,2,3)
    df['Fare'].plot(kind='box')
    plt.subplot(2,2,4)
    df['Fare'].plot(kind='kde')
    
    
    
    plt.tight_layout()
    
    plt.show()
    

    结果如下

    箱线图(box):

    再结合直方图和密度图发现Fare有明显的右偏,于是为了克服偏斜效应,我们将它们转换为分箱数据。

    df[['Age','Fare']].skew()
    
    df['Embarked'].value_counts()
    
    print('Avg Fare',df[df['Fare']>100]['Fare'].median())
    print('Avg Age',df['Age'].median())
    
    

    结果


    发现Embarked(登船位置)最多的是S
    然后进行缺失值或空值填补

    df['Age']=df['Age'].fillna(df['Age'].median())
    
    df['Fare']=df['Fare'].fillna(df[df['Fare']>100]['Fare'].median())
    
    df['Embarked']=df['Embarked'].fillna('S')
    fare_bins = [0, 7.89, 14.45, 31.07, 100, 263]
    # Define professional labels for each bin
    fare_labels = ['Very Low Fare', 'Low Fare', 'Moderate Fare', 'High Fare', 'Very High Fare']
    # Create a new column 'Fare_bin' by binning the 'Fare' column
    df['Fare_bin'] = pd.cut(df['Fare'], bins=fare_bins, labels=fare_labels, include_lowest=True)
    
    age_bins = [0, 12, 21, 28, 39, 80]
    
    # Define professional labels for each bin
    age_labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']
    
    # Create a new column 'Age_bin' by binning the 'Age' column
    df['Age_bin'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, include_lowest=True)
    df.head()
    

    然后再删除Age和Fare列

    df.drop(['Fare',"Age"],axis=1,inplace=True)
    df.head()
    

    再对其他数据参照上述办法进行分类与合并

    df['Name']
    
    def get_unique_titles(names_list):
        parts = names_list.split(',')
        if len(parts) > 1:
            title = parts[1].split('.')[0].strip()
            return title
        else:
            return 'Unknown'
    df['Title'] = df['Name'].apply(get_unique_titles)
    #We can seet that After Master There few title which are less than 10 records hence we convert them into Others
    df['Title'].value_counts()
    
    titles_to_keep = ['Mr', 'Miss', 'Mrs', 'Master']
    
    df['Title'] = df['Title'].apply(lambda x: x if x in titles_to_keep else 'Others')
    
    df['Title'].value_counts()
    
    df.drop(['Name'],axis=1,inplace=True)
    df.head()
    df['FamilySize']=df['SibSp']+df['Parch']
    df.head()
    df['FamilySize'].plot(kind='hist')
    
    bins = [-1, 0, 2, 4, 6, 10]  # -1 is included to capture 0
    labels = ['Single', 'Small Family', 'Medium Family', 'Large Family', 'Very Large Family']
    
    # Create the FamilySize bin feature
    df['FamilySizeCategory'] = pd.cut(df['FamilySize'], bins=bins, labels=labels)
    df['FamilySizeCategory'].value_counts()
    
    df.drop(['FamilySize','SibSp','Parch'],axis=1,inplace=True)
    df.head()
    df.info()
    

    卡方检验与列联表

    def chi_squared_test(df, categorical_variable):
        # Create a contingency table
        contingency_table = pd.crosstab(df[categorical_variable], df['Survived'])
    
        # Perform Chi-Squared test
        chi2, p, dof, expected = chi2_contingency(contingency_table)
    
        return chi2, p, contingency_table
    
    
    # List of categorical variables to test
    categorical_variables = ['Pclass', 'Sex', 'Embarked', 'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']
    
    # Perform Chi-Squared test for each variable and store results
    results = {}
    for variable in categorical_variables:
        chi2_stat, p_value, contingency = chi_squared_test(df, variable)
        results[variable] = {
            'Chi-Squared Statistic': chi2_stat,
            'p-value': p_value,
            'Contingency Table': contingency
        }
    
    # Display results
    for variable, result in results.items():
        print(f"Variable: {variable}")
        print(f"Chi-Squared Statistic: {result['Chi-Squared Statistic']:.4f}")
        print(f"p-value: {result['p-value']:.4f}")
        print("Contingency Table:")
        print(result['Contingency Table'])
        print("\n" + "-" * 40 + "\n")
    

    结果:


    结果解读
    卡方统计量:数值越大,表示分类变量与生存之间的关联性越强。
    p值:通常,p值小于0.05表明分类变量与生存之间存在统计学上的显著关联。

    通过遵循上述过程,您可以分析不同分类因素与泰坦尼克号生存情况的关系。

    泰坦尼克号上与生存最显著相关的特征如下:
    性别:女性的生存率最高。
    头衔:头衔反映了性别和社会地位,与生存机会密切相关。
    乘客等级(Pclass):高等舱乘客(一等舱)的生存率远高于低等舱乘客。
    票价区间(Fare_bin):支付更高票价的乘客生存率更高,这与乘客等级的关联一致。
    登船港口(Embarked):登船港口也起到一定作用,某些港口的生存率较高。
    家庭规模类别(FamilySizeCategory):虽然显著,但家庭规模的影响小于上述特征。
    年龄区间(Age_bin):尽管显著,但年龄在这些特征中与生存的关联最弱,其中儿童的生存率最高。
    综上所述,性别和乘客等级是影响泰坦尼克号生存情况的最重要因素,其次是头衔。

    数据可视化

    单变量分析

    categorical_cols = ['Survived', 'Pclass', 'Sex', 'Embarked', 
                        'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']
    
    # Set up the subplot grid
    n_cols = 2
    n_rows = (len(categorical_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 4))
    axes = axes.flatten()  # Flatten to easily index
    
    # Create bar plots for each categorical column using sns.barplot
    for idx, col in enumerate(categorical_cols):
        # Calculate counts
        counts = df[col].value_counts().reset_index()
        counts.columns = [col, 'Count']  # Rename columns for clarity
    
        # Create the bar plot
        sns.barplot(data=counts, x=col, y='Count', ax=axes[idx], palette='coolwarm')
    
        axes[idx].set_title(f'Count of {col}')
        axes[idx].set_ylabel('Count')
        axes[idx].set_xlabel(col)
    
        # Annotate the bars with their values
        for p in axes[idx].patches:
            axes[idx].annotate(f'{int(p.get_height())}', 
                               (p.get_x() + p.get_width() / 2., p.get_height()), 
                               ha='center', va='bottom', 
                               fontsize=10, color='black', 
                               xytext=(0, 5),  # 5 points vertical offset
                               textcoords='offset points')
    
    # Remove any empty subplots
    for j in range(idx + 1, len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()
    

    双变量分析

    现在,我们分析与生存相关的特征以获取更多信息。在训练数据中,我们有888条记录的“生存”(Survived)列。在单变量分析中,我们考虑了训练数据集和测试数据集的总计数。然而,在双变量分析中,我们仅使用训练数据来识别模式。

    #Survival Analysis by Demographics:
    #Investigate survival rates across different age groups and genders.
    Demographics_crosstab=pd.crosstab(index=df['Pclass'],columns=[df['Sex'],df['Age_bin'],df['Survived']])
    plt.figure(figsize=(14, 6))
    sns.heatmap(Demographics_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
    plt.title('Survival Count by Age and Sex')
    plt.ylabel('Pclass')
    plt.xlabel('Sex and Survival Status')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    


    从上面的图表中可以明显看出,位于三等舱(Pclass 3)且处于青少年年龄段的男性乘客的生存率最低。

    #Socioeconomic Impact:
    #Analyze how class and fare influence survival rates.
    Socioeconomic_crosstab=pd.crosstab(index=df['Pclass'],columns=[df['Fare_bin'],df['Survived']])
    plt.figure(figsize=(14, 6))
    sns.heatmap(Socioeconomic_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
    plt.title('Survival Count by Age and Sex')
    plt.ylabel('Pclass')
    plt.xlabel('Fare and Survival Status')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    


    从上面的图表中可以明显看出,在Pclass 3(三等舱)的888名乘客中,共有264人未能幸存。在未能幸存的乘客中,130人属于“极低票价”类别,而134人属于“低票价”类别。

    #Family and Social Structure Impact:
    #Examine whether family size affected survival chances.
    
    family_crosstab=pd.crosstab(index=df['FamilySizeCategory'],columns=[df['Survived']])
    plt.figure(figsize=(5, 3))
    sns.heatmap(family_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
    plt.title('Survival Count by Age and Sex')
    plt.ylabel('Family Size')
    plt.xlabel('Survival Status')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    

    title_crosstab = pd.crosstab(index=df['Title'], columns=[df['Survived']])
    # Travel Origin Analysis
    embarked_crosstab = pd.crosstab(index=df['Embarked'], columns=[df['Survived']])
    
    # Set up the figure with 1 row and 2 columns
    plt.figure(figsize=(12, 5))
    
    # Subplot for Title Analysis
    plt.subplot(1, 2, 1)
    sns.heatmap(title_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
    plt.title('Survival Count by Title')
    plt.ylabel('Title')
    plt.xlabel('Survival Status')
    plt.xticks(rotation=45)
    
    # Subplot for Travel Origin Analysis
    plt.subplot(1, 2, 2)
    sns.heatmap(embarked_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
    plt.title('Survival Count by Embarkation Point')
    plt.ylabel('Embarkation Point')
    plt.xlabel('Survival Status')
    plt.xticks(rotation=45)
    
    # Adjust layout
    plt.tight_layout()
    plt.show()
    


    从图表中可以明显看出,拥有“Mrs.”(夫人)头衔的乘客中未幸存者人数最多,总计436人,而登船点中未幸存者人数最多的是“S”,有427名乘客未能幸存。

    family_crosstab=pd.crosstab(index=df['Survived'],columns=[df['Sex'],df['Embarked'],df['Pclass'],df['Fare_bin']])
    plt.figure(figsize=(30, 5))
    sns.heatmap(family_crosstab, annot=True, fmt='d', cmap='viridis', cbar_kws={'label': 'Count'})
    plt.title('Survival Count by Sex, Embarked,Pclass,Fare_bin')
    plt.ylabel('Survived')
    plt.xlabel('Sex, Embarked,Pclass,Fare_bin Status')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    

    模型构建与评估

    df.head()
    df.columns
    
    columns = ['Sex', 'Embarked', 'Fare_bin', 'Age_bin', 'Title', 'FamilySizeCategory']
    
    from sklearn.preprocessing import LabelEncoder
    
    label_encoder = LabelEncoder()
    
    for column in columns:
        df[column] = label_encoder.fit_transform(df[column])
    
    df.head()
    
    clean_train_df = df[df['dataset_type'] == 'train']
    clean_test_df = df[df['dataset_type'] == 'test']
    print('Clean Train Dataset Shape', clean_train_df.shape)
    print('------------------------------------')
    print('Clean Test Dataset Shape', clean_test_df.shape)
    

    from sklearn.linear_model import LogisticRegression  
    from sklearn.neighbors import KNeighborsClassifier  
    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier  
    from sklearn.ensemble import RandomForestClassifier  
    from xgboost import XGBClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import accuracy_score, classification_report
    
    x=clean_train_df[['Pclass','Sex','Embarked','Fare_bin','Age_bin','Title','FamilySizeCategory']]
    y=clean_train_df['Survived']
    
    from sklearn.model_selection import train_test_split,GridSearchCV
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
    
    def train_model(model, model_name):
        print(f'Model: {model_name}')
        
        # Fit the model on the training data
        model.fit(x_train, y_train)
        
        # Predictions on the training data
        y_train_pred = model.predict(x_train)
        # Predictions on the testing data
        y_test_pred = model.predict(x_test)
        
        # Calculate accuracy scores
        train_accuracy = accuracy_score(y_train, y_train_pred)
        test_accuracy = accuracy_score(y_test, y_test_pred)
        
        print(f'Training Accuracy Score: {train_accuracy:.2f}')
        print(f'Testing Accuracy Score: {test_accuracy:.2f}')
        
        # Generate classification report for testing data
        report = classification_report(y_test, y_test_pred)
        print('Classification Report:')
        print(report)
        
        return model
    model_list = dict(
        knn=KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2),
        svc=SVC(kernel='linear', random_state=0),
        logistic=LogisticRegression(),
        naive=GaussianNB(),
        tree=DecisionTreeClassifier(criterion='entropy', random_state=0),
        forest=RandomForestClassifier(n_estimators=50, criterion="entropy"),
        xgboost=XGBClassifier(),
        gradientboost=GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=1)
    )
    for key, value in model_list.items():
        print('*'*30)
        train_model(value,key)
    




    使用GridSearchCV进行超参数调优以优化模型

    # Define the model
    model = XGBClassifier(random_state=42)
    
    # Set up the parameter grid
    param_grid = {
        'max_depth': [3, 5, 7, 9, 11],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'n_estimators': [50, 100, 150, 200],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.8, 1.0]
    }
    
    # Set up the grid search
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, 
                               scoring='accuracy', cv=5, verbose=0, n_jobs=-1)
    
    # Fit the model
    grid_search.fit(x_train, y_train)
    
    # Get the best parameters and best score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    
    # Print the results
    print("Best Parameters:", best_params)
    print("Best Cross-Validation Score:", best_score)
    
    # Optionally, evaluate on the test set
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(x_test)
    test_accuracy = accuracy_score(y_test, y_pred)
    print("Test Set Accuracy:", test_accuracy)
    

    预测结果

    我们将使用XGBoost模型来预测输出结果,因为这个模型可以达到83%的准确率。

    model=XGBClassifier(
        colsample_bytree=1.0,
        learning_rate=0.01,
        max_depth=7,
        n_estimators=100,
        subsample=1.0,
        use_label_encoder=False,  # Optional, based on your XGBoost version
        eval_metric='mlogloss'    # Optional, based on your needs
    )
    
    train_model(model,'XGBOOST')
    


    最后输出文件:

    作者:Z_with_z

    物联沃分享整理
    物联沃-IOTWORD物联网 » 采用PYTHON进行泰坦尼克号生存者预测

    发表回复