技术宅的结界

 找回密码
 立即注册→加入我们

QQ登录

只需一步,快速开始

搜索
热搜: 下载 VB C 实现 编写
查看: 202|回复: 2
收起左侧

机器学习算法训练泰坦尼克数据集

[复制链接]

265

主题

434

帖子

4516

积分

用户组: 真·技术宅

UID
2
精华
61
威望
147 点
宅币
3358 个
贡献
125 次
宅之契约
0 份
在线时间
586 小时
注册时间
2014-1-25
发表于 2017-10-21 17:39:26 | 显示全部楼层 |阅读模式

欢迎访问技术宅的结界,请注册或者登录吧。

您需要 登录 才可以下载或查看,没有帐号?立即注册→加入我们

x
泰坦尼克生存数据集分类及调优
一、数据集介绍
泰坦尼克生存数据集(https://www.kaggle.com/c/titanic/download/train.csv) 共有891位乘客的数据信息,通过分析以下数据特征寻找与成功生存的关系:
PassengerId:乘客编号
Survived:乘客是否存活
Pclass:乘客所在的船舱等级
Name:乘客姓名
Sex:乘客性别
Age:乘客年龄
SibSp:乘客的兄弟姐妹和配偶数量
Parch:乘客的父母与子女数量
Ticket:票的编号
Fare:票价
Cabin:座位号
Embarked:乘客登船码头
目前泰坦尼克号数据预测准确度是80%+

二、数据集初步分析
train.csv数据集前几行
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

[Python] 纯文本查看 复制代码
import pandas as pd
# header=0指定第一行为表头
dataset = pd.read_csv('train.csv', header = 0)
print("\n----------------- head -------------------")
print(dataset.head())
print("\n----------------- info -------------------")
print(dataset.info())
print("\n----------------- Sex -------------------")
print(dataset["Sex"].value_counts())
print("\n----------------- Pclass -------------------")
print(dataset["Pclass"].value_counts())
print("\n----------------- SibSp -------------------")
print(dataset["SibSp"].value_counts())
print("\n----------------- Parch -------------------")
print(dataset["Parch"].value_counts())
print("\n----------------- Embarked -------------------")
print(dataset["Embarked"].value_counts())


----------------- head -------------------
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

----------------- info -------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

----------------- Sex -------------------
male      577
female    314
Name: Sex, dtype: int64

----------------- Pclass -------------------
3    491
1    216
2    184
Name: Pclass, dtype: int64

----------------- SibSp -------------------
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

----------------- Parch -------------------
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

----------------- Embarked -------------------
S    644
C    168
Q     77
Name: Embarked, dtype: int64

上述数据可以得知
Age缺失20%数据,Cabin缺失77%数据(缺失过多,去除),Embarked缺失2个数据
PassengerId,一定是与生存结果无关的数据
Survived是生存结果,需要从输入中剔除
Pclass Sex Age SibSp Parch Fare Embarked可以编码成普通数据,与生存结果可能是相关的
Name,名字或许存在某些隐含联系,但是我们不好直接提取特征
Ticket Cabin,我们暂时不从这里提取特征
Sex Embarked字段需要编码,Ticket Cabin字段可能需要编码
由于数据量较小,采用机器学习算法较合适

[Python] 纯文本查看 复制代码
import warnings
warnings.filterwarnings("ignore")
import time
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import *
from sklearn.tree import *
from sklearn.neighbors import *
from sklearn.ensemble import *
from sklearn.svm import *
from sklearn.naive_bayes import *
from sklearn.preprocessing import *
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import Imputer
clfs = [
    [LogisticRegression(max_iter=1000), {}],
    [SVC(), {'C': [0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0]}],
    [PassiveAggressiveClassifier(), {}],
    [RidgeClassifier(), {}],
    [RidgeClassifierCV(), {}],
    [SGDClassifier(), {}],
    [KNeighborsClassifier(n_neighbors=20), {}],
    [NearestCentroid(), {}],
    [DecisionTreeClassifier(), {}],
    [ExtraTreeClassifier(), {}],
    [AdaBoostClassifier(), {}],
    [BaggingClassifier(), {}],
    [ExtraTreeClassifier(), {}],
    [GradientBoostingClassifier(), {}],
    [RandomForestClassifier(n_estimators=100), {}],
    [BernoulliNB(), {}],
    [GaussianNB(), {}],
]

pipes = [Pipeline([
    ['sc', StandardScaler()],
    ['clf', GridSearchCV(pair[0], param_grid=pair[1])]
]) for pair in clfs]  # 用于统一化初值处理、分类

def test_classifier():
    for i in range(0, len(clfs)):
        minscore = 1.0  # 记录最小准确度用于后续进一步优化
        start = time.time()
        acc_arr = []
        for j in range(0, testnum):
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
            pipes[i].fit(X_train, y_train)
            y_pred = pipes[i].predict(X_test)
            acc_arr.append(accuracy_score(y_test, y_pred))
        npacc = np.array(acc_arr)
        end = time.time()
        print('Accuraty:%s meanscore=%.2f minscore=%.2f time=%d' % (type(clfs[i][0]), npacc.mean(), npacc.min(), end - start))

dataset = pd.read_csv('train.csv', header = 0)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values) 
# Age缺失较多,这次用平均值代替
dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)

X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
y = dataset["Survived"].values

testnum=100
test_classifier()


Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.80 minscore=0.70 time=23
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.83 minscore=0.71 time=34
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.69 minscore=0.36 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.67 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.73 minscore=0.40 time=0
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.80 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.63 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.67 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.73 time=33
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.80 minscore=0.70 time=8
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.62 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.81 minscore=0.71 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.77 minscore=0.61 time=0
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.70 time=0

三、第一次参数调优
从上面结果可以看到,没有任何优化的分类器,对于titanic数据集的准确率是80+%,下面通过对分类器进行参数和数据调整达到更高的准确率,使用随机森林预测Age参数。通常遇到缺值的情况,我们会有几种常见的处理方式:
如果缺值的样本占总数比例极高,我们可能就直接舍弃了,作为特征加入的话,可能反倒带入noise,影响最后的结果了,或者考虑有值的是一类,没有值的是一类,
如果缺值的样本适中,而该属性非连续值特征属性(比如说类目属性),那就把NaN作为一个新类别,加到类别特征中
有些情况下,缺失的值个数并不是特别多,那我们也可以试着根据已有的值,拟合一下数据,补充上。

[Python] 纯文本查看 复制代码
from sklearn.ensemble import RandomForestRegressor

dataset = pd.read_csv('train.csv', header = 0)
dataset.drop("PassengerId", inplace=True, axis=1)
dataset.drop("Name", inplace=True, axis=1)
dataset.drop("Ticket", inplace=True, axis=1)
dataset.drop("Cabin", inplace=True, axis=1)

# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)

# 使用随机森林预测缺失值,Age作为目标值进行预测
datasetxx = dataset[["Age", "Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]]
known_age = datasetxx[datasetxx["Age"].notnull()].as_matrix()
unknown_age = datasetxx[datasetxx["Age"].isnull()].as_matrix()
yy = known_age[:, 0]
XX = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(XX, yy)
dataset["Age"][dataset["Age"].isnull()] = rfr.predict(unknown_age[:, 1:]) # 预测值写入数据集

X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
y = dataset["Survived"].values

testnum=100
test_classifier()


Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.81 minscore=0.67 time=24
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.72 time=33
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.73 minscore=0.31 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.68 time=2
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.42 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.67 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.79 minscore=0.70 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.80 minscore=0.70 time=32
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=9
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.78 minscore=0.66 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.82 minscore=0.70 time=65
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.78 minscore=0.66 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.79 minscore=0.69 time=1

四、 第二次参数调优
可以看到使用随机森林进行缺失数据估计,和使用平均值估值的结果差不多,这次我们从Ticket Cabin字段解析特征值
* Ticket的值大概是一个字符串加数字形式,如“SOTON/OQ 392076”

[Python] 纯文本查看 复制代码
import re
def get_ticket(ticket):
    out = re.compile("[0-9]+$").search(ticket)
    if out is None:
        return 0
    else:
        return int(out.group(0))

dataset = pd.read_csv('train.csv', header = 0)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values) 
# Age缺失较多,这次用平均值代替
dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)

dataset['Ticket'] = dataset['Ticket'].apply(get_ticket)

X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Ticket"]].values
y = dataset["Survived"].values

testnum=100
test_classifier()


Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.79 minscore=0.63 time=25
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.76 time=41
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.72 minscore=0.43 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.79 minscore=0.71 time=3
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.57 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.71 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.76 minscore=0.64 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.78 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.69 time=36
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=14
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.63 time=1
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.84 minscore=0.72 time=31
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.84 minscore=0.76 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.75 minscore=0.61 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.69 time=1

五、总结
参数调优对于taitanic数据集表现不大,表现较好的有以下算法:
* RandomForestClassifier
* GradientBoostingClassifier
* AdaBoostClassifier/SVC等
* 神经网络

1

主题

15

帖子

15

积分

用户组: 初·技术宅

UID
2735
精华
0
威望
0 点
宅币
0 个
贡献
0 次
宅之契约
0 份
在线时间
6 小时
注册时间
2017-7-28
发表于 2017-10-21 18:15:01 | 显示全部楼层
大神膜拜
回复

使用道具 举报

1

主题

86

帖子

91

积分

用户组: 小·技术宅

UID
3026
精华
0
威望
1 点
宅币
3 个
贡献
0 次
宅之契约
0 份
在线时间
6 小时
注册时间
2017-10-31
发表于 2017-11-8 08:40:20 | 显示全部楼层
厉害,厉害厉害。。。

本版积分规则

QQ|申请友链|Archiver|手机版|小黑屋|技术宅的结界 ( 滇ICP备16008837号|网站地图

GMT+8, 2018-5-28 13:17 , Processed in 0.085724 second(s), 14 queries , Gzip On, Memcache On.

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表