找回密码
 立即注册→加入我们

QQ登录

只需一步,快速开始

搜索
热搜: 下载 VB C 实现 编写
查看: 2352|回复: 2

机器学习算法训练泰坦尼克数据集

[复制链接]

307

主题

228

回帖

7343

积分

用户组: 真·技术宅

UID
2
精华
76
威望
291 点
宅币
5593 个
贡献
253 次
宅之契约
0 份
在线时间
948 小时
注册时间
2014-1-25
发表于 2017-10-21 17:39:26 | 显示全部楼层 |阅读模式

欢迎访问技术宅的结界,请注册或者登录吧。

您需要 登录 才可以下载或查看,没有账号?立即注册→加入我们

×
泰坦尼克生存数据集分类及调优
一、数据集介绍
泰坦尼克生存数据集(https://www.kaggle.com/c/titanic/download/train.csv) 共有891位乘客的数据信息,通过分析以下数据特征寻找与成功生存的关系:
PassengerId:乘客编号
Survived:乘客是否存活
Pclass:乘客所在的船舱等级
Name:乘客姓名
Sex:乘客性别
Age:乘客年龄
SibSp:乘客的兄弟姐妹和配偶数量
Parch:乘客的父母与子女数量
Ticket:票的编号
Fare:票价
Cabin:座位号
Embarked:乘客登船码头
目前泰坦尼克号数据预测准确度是80%+

二、数据集初步分析
train.csv数据集前几行
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S


  1. import pandas as pd
  2. # header=0指定第一行为表头
  3. dataset = pd.read_csv('train.csv', header = 0)
  4. print("\n----------------- head -------------------")
  5. print(dataset.head())
  6. print("\n----------------- info -------------------")
  7. print(dataset.info())
  8. print("\n----------------- Sex -------------------")
  9. print(dataset["Sex"].value_counts())
  10. print("\n----------------- Pclass -------------------")
  11. print(dataset["Pclass"].value_counts())
  12. print("\n----------------- SibSp -------------------")
  13. print(dataset["SibSp"].value_counts())
  14. print("\n----------------- Parch -------------------")
  15. print(dataset["Parch"].value_counts())
  16. print("\n----------------- Embarked -------------------")
  17. print(dataset["Embarked"].value_counts())
复制代码


----------------- head -------------------
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

----------------- info -------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

----------------- Sex -------------------
male      577
female    314
Name: Sex, dtype: int64

----------------- Pclass -------------------
3    491
1    216
2    184
Name: Pclass, dtype: int64

----------------- SibSp -------------------
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

----------------- Parch -------------------
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

----------------- Embarked -------------------
S    644
C    168
Q     77
Name: Embarked, dtype: int64

上述数据可以得知
Age缺失20%数据,Cabin缺失77%数据(缺失过多,去除),Embarked缺失2个数据
PassengerId,一定是与生存结果无关的数据
Survived是生存结果,需要从输入中剔除
Pclass Sex Age SibSp Parch Fare Embarked可以编码成普通数据,与生存结果可能是相关的
Name,名字或许存在某些隐含联系,但是我们不好直接提取特征
Ticket Cabin,我们暂时不从这里提取特征
Sex Embarked字段需要编码,Ticket Cabin字段可能需要编码
由于数据量较小,采用机器学习算法较合适


  1. import warnings
  2. warnings.filterwarnings("ignore")
  3. import time
  4. import numpy as np
  5. from sklearn.pipeline import Pipeline
  6. from sklearn.linear_model import *
  7. from sklearn.tree import *
  8. from sklearn.neighbors import *
  9. from sklearn.ensemble import *
  10. from sklearn.svm import *
  11. from sklearn.naive_bayes import *
  12. from sklearn.preprocessing import *
  13. from sklearn.cross_validation import train_test_split
  14. from sklearn.metrics import accuracy_score
  15. from sklearn.neural_network import *
  16. from sklearn.model_selection import *
  17. from sklearn.preprocessing import Imputer
  18. clfs = [
  19.     [LogisticRegression(max_iter=1000), {}],
  20.     [SVC(), {'C': [0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0]}],
  21.     [PassiveAggressiveClassifier(), {}],
  22.     [RidgeClassifier(), {}],
  23.     [RidgeClassifierCV(), {}],
  24.     [SGDClassifier(), {}],
  25.     [KNeighborsClassifier(n_neighbors=20), {}],
  26.     [NearestCentroid(), {}],
  27.     [DecisionTreeClassifier(), {}],
  28.     [ExtraTreeClassifier(), {}],
  29.     [AdaBoostClassifier(), {}],
  30.     [BaggingClassifier(), {}],
  31.     [ExtraTreeClassifier(), {}],
  32.     [GradientBoostingClassifier(), {}],
  33.     [RandomForestClassifier(n_estimators=100), {}],
  34.     [BernoulliNB(), {}],
  35.     [GaussianNB(), {}],
  36. ]

  37. pipes = [Pipeline([
  38.     ['sc', StandardScaler()],
  39.     ['clf', GridSearchCV(pair[0], param_grid=pair[1])]
  40. ]) for pair in clfs]  # 用于统一化初值处理、分类

  41. def test_classifier():
  42.     for i in range(0, len(clfs)):
  43.         minscore = 1.0  # 记录最小准确度用于后续进一步优化
  44.         start = time.time()
  45.         acc_arr = []
  46.         for j in range(0, testnum):
  47.             X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
  48.             pipes[i].fit(X_train, y_train)
  49.             y_pred = pipes[i].predict(X_test)
  50.             acc_arr.append(accuracy_score(y_test, y_pred))
  51.         npacc = np.array(acc_arr)
  52.         end = time.time()
  53.         print('Accuraty:%s meanscore=%.2f minscore=%.2f time=%d' % (type(clfs[i][0]), npacc.mean(), npacc.min(), end - start))

  54. dataset = pd.read_csv('train.csv', header = 0)
  55. # 由于Embarked只有2个缺失,用出现最多的S代替
  56. dataset["Embarked"].fillna("S", inplace=True)
  57. # 编码Embarked字段
  58. dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
  59. # 编码Sex字段
  60. dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
  61. # Age缺失较多,这次用平均值代替
  62. dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)

  63. X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
  64. y = dataset["Survived"].values

  65. testnum=100
  66. test_classifier()
复制代码


Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.80 minscore=0.70 time=23
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.83 minscore=0.71 time=34
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.69 minscore=0.36 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.67 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.73 minscore=0.40 time=0
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.80 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.63 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.67 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.73 time=33
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.80 minscore=0.70 time=8
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.62 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.81 minscore=0.71 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.77 minscore=0.61 time=0
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.70 time=0

三、第一次参数调优
从上面结果可以看到,没有任何优化的分类器,对于titanic数据集的准确率是80+%,下面通过对分类器进行参数和数据调整达到更高的准确率,使用随机森林预测Age参数。通常遇到缺值的情况,我们会有几种常见的处理方式:
如果缺值的样本占总数比例极高,我们可能就直接舍弃了,作为特征加入的话,可能反倒带入noise,影响最后的结果了,或者考虑有值的是一类,没有值的是一类,
如果缺值的样本适中,而该属性非连续值特征属性(比如说类目属性),那就把NaN作为一个新类别,加到类别特征中
有些情况下,缺失的值个数并不是特别多,那我们也可以试着根据已有的值,拟合一下数据,补充上。


  1. from sklearn.ensemble import RandomForestRegressor

  2. dataset = pd.read_csv('train.csv', header = 0)
  3. dataset.drop("PassengerId", inplace=True, axis=1)
  4. dataset.drop("Name", inplace=True, axis=1)
  5. dataset.drop("Ticket", inplace=True, axis=1)
  6. dataset.drop("Cabin", inplace=True, axis=1)

  7. # 由于Embarked只有2个缺失,用出现最多的S代替
  8. dataset["Embarked"].fillna("S", inplace=True)
  9. # 编码Embarked字段
  10. dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
  11. # 编码Sex字段
  12. dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)

  13. # 使用随机森林预测缺失值,Age作为目标值进行预测
  14. datasetxx = dataset[["Age", "Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]]
  15. known_age = datasetxx[datasetxx["Age"].notnull()].as_matrix()
  16. unknown_age = datasetxx[datasetxx["Age"].isnull()].as_matrix()
  17. yy = known_age[:, 0]
  18. XX = known_age[:, 1:]
  19. rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
  20. rfr.fit(XX, yy)
  21. dataset["Age"][dataset["Age"].isnull()] = rfr.predict(unknown_age[:, 1:]) # 预测值写入数据集

  22. X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
  23. y = dataset["Survived"].values

  24. testnum=100
  25. test_classifier()
复制代码


Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.81 minscore=0.67 time=24
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.72 time=33
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.73 minscore=0.31 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.68 time=2
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.42 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.67 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.79 minscore=0.70 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.80 minscore=0.70 time=32
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=9
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.78 minscore=0.66 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.82 minscore=0.70 time=65
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.78 minscore=0.66 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.79 minscore=0.69 time=1

四、 第二次参数调优
可以看到使用随机森林进行缺失数据估计,和使用平均值估值的结果差不多,这次我们从Ticket Cabin字段解析特征值
* Ticket的值大概是一个字符串加数字形式,如“SOTON/OQ 392076”


  1. import re
  2. def get_ticket(ticket):
  3.     out = re.compile("[0-9]+$").search(ticket)
  4.     if out is None:
  5.         return 0
  6.     else:
  7.         return int(out.group(0))

  8. dataset = pd.read_csv('train.csv', header = 0)
  9. # 由于Embarked只有2个缺失,用出现最多的S代替
  10. dataset["Embarked"].fillna("S", inplace=True)
  11. # 编码Embarked字段
  12. dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
  13. # 编码Sex字段
  14. dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
  15. # Age缺失较多,这次用平均值代替
  16. dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)

  17. dataset['Ticket'] = dataset['Ticket'].apply(get_ticket)

  18. X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Ticket"]].values
  19. y = dataset["Survived"].values

  20. testnum=100
  21. test_classifier()
复制代码


Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.79 minscore=0.63 time=25
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.76 time=41
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.72 minscore=0.43 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.79 minscore=0.71 time=3
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.57 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.71 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.76 minscore=0.64 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.78 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.69 time=36
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=14
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.63 time=1
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.84 minscore=0.72 time=31
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.84 minscore=0.76 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.75 minscore=0.61 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.69 time=1

五、总结
参数调优对于taitanic数据集表现不大,表现较好的有以下算法:
* RandomForestClassifier
* GradientBoostingClassifier
* AdaBoostClassifier/SVC等
* 神经网络
回复

使用道具 举报

1

主题

14

回帖

15

积分

用户组: 初·技术宅

UID
2735
精华
0
威望
0 点
宅币
0 个
贡献
0 次
宅之契约
0 份
在线时间
6 小时
注册时间
2017-7-28
发表于 2017-10-21 18:15:01 | 显示全部楼层
大神膜拜
回复

使用道具 举报

1

主题

83

回帖

89

积分

用户组: 小·技术宅

UID
3026
精华
0
威望
1 点
宅币
3 个
贡献
0 次
宅之契约
0 份
在线时间
6 小时
注册时间
2017-10-31
发表于 2017-11-8 08:40:20 | 显示全部楼层
厉害,厉害厉害。。。
回复 赞! 靠!

使用道具 举报

QQ|Archiver|小黑屋|技术宅的结界 ( 滇ICP备16008837号 )|网站地图

GMT+8, 2024-4-27 07:19 , Processed in 0.038621 second(s), 28 queries , Gzip On.

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表