- UID
- 2
- 精华
- 61
- 积分
- 4760
- 威望
- 148 点
- 宅币
- 3583 个
- 贡献
- 131 次
- 宅之契约
- 0 份
- 最后登录
- 2019-1-11
- 在线时间
- 614 小时
- QQ

用户组: 真·技术宅
- UID
- 2
- 精华
- 61
- 威望
- 148 点
- 宅币
- 3583 个
- 贡献
- 131 次
- 宅之契约
- 0 份
- 在线时间
- 614 小时
- 注册时间
- 2014-1-25
|
泰坦尼克生存数据集分类及调优
一、数据集介绍
泰坦尼克生存数据集(https://www.kaggle.com/c/titanic/download/train.csv) 共有891位乘客的数据信息,通过分析以下数据特征寻找与成功生存的关系:
PassengerId:乘客编号
Survived:乘客是否存活
Pclass:乘客所在的船舱等级
Name:乘客姓名
Sex:乘客性别
Age:乘客年龄
SibSp:乘客的兄弟姐妹和配偶数量
Parch:乘客的父母与子女数量
Ticket:票的编号
Fare:票价
Cabin:座位号
Embarked:乘客登船码头
目前泰坦尼克号数据预测准确度是80%+
二、数据集初步分析
train.csv数据集前几行
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
[Python] 纯文本查看 复制代码
import pandas as pd
# header=0指定第一行为表头
dataset = pd.read_csv('train.csv', header = 0)
print("\n----------------- head -------------------")
print(dataset.head())
print("\n----------------- info -------------------")
print(dataset.info())
print("\n----------------- Sex -------------------")
print(dataset["Sex"].value_counts())
print("\n----------------- Pclass -------------------")
print(dataset["Pclass"].value_counts())
print("\n----------------- SibSp -------------------")
print(dataset["SibSp"].value_counts())
print("\n----------------- Parch -------------------")
print(dataset["Parch"].value_counts())
print("\n----------------- Embarked -------------------")
print(dataset["Embarked"].value_counts())
----------------- head -------------------
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
----------------- info -------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
----------------- Sex -------------------
male 577
female 314
Name: Sex, dtype: int64
----------------- Pclass -------------------
3 491
1 216
2 184
Name: Pclass, dtype: int64
----------------- SibSp -------------------
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: SibSp, dtype: int64
----------------- Parch -------------------
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: Parch, dtype: int64
----------------- Embarked -------------------
S 644
C 168
Q 77
Name: Embarked, dtype: int64
上述数据可以得知
Age缺失20%数据,Cabin缺失77%数据(缺失过多,去除),Embarked缺失2个数据
PassengerId,一定是与生存结果无关的数据
Survived是生存结果,需要从输入中剔除
Pclass Sex Age SibSp Parch Fare Embarked可以编码成普通数据,与生存结果可能是相关的
Name,名字或许存在某些隐含联系,但是我们不好直接提取特征
Ticket Cabin,我们暂时不从这里提取特征
Sex Embarked字段需要编码,Ticket Cabin字段可能需要编码
由于数据量较小,采用机器学习算法较合适
[Python] 纯文本查看 复制代码
import warnings
warnings.filterwarnings("ignore")
import time
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import *
from sklearn.tree import *
from sklearn.neighbors import *
from sklearn.ensemble import *
from sklearn.svm import *
from sklearn.naive_bayes import *
from sklearn.preprocessing import *
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import Imputer
clfs = [
[LogisticRegression(max_iter=1000), {}],
[SVC(), {'C': [0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0]}],
[PassiveAggressiveClassifier(), {}],
[RidgeClassifier(), {}],
[RidgeClassifierCV(), {}],
[SGDClassifier(), {}],
[KNeighborsClassifier(n_neighbors=20), {}],
[NearestCentroid(), {}],
[DecisionTreeClassifier(), {}],
[ExtraTreeClassifier(), {}],
[AdaBoostClassifier(), {}],
[BaggingClassifier(), {}],
[ExtraTreeClassifier(), {}],
[GradientBoostingClassifier(), {}],
[RandomForestClassifier(n_estimators=100), {}],
[BernoulliNB(), {}],
[GaussianNB(), {}],
]
pipes = [Pipeline([
['sc', StandardScaler()],
['clf', GridSearchCV(pair[0], param_grid=pair[1])]
]) for pair in clfs] # 用于统一化初值处理、分类
def test_classifier():
for i in range(0, len(clfs)):
minscore = 1.0 # 记录最小准确度用于后续进一步优化
start = time.time()
acc_arr = []
for j in range(0, testnum):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
pipes[i].fit(X_train, y_train)
y_pred = pipes[i].predict(X_test)
acc_arr.append(accuracy_score(y_test, y_pred))
npacc = np.array(acc_arr)
end = time.time()
print('Accuraty:%s meanscore=%.2f minscore=%.2f time=%d' % (type(clfs[i][0]), npacc.mean(), npacc.min(), end - start))
dataset = pd.read_csv('train.csv', header = 0)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
# Age缺失较多,这次用平均值代替
dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)
X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
y = dataset["Survived"].values
testnum=100
test_classifier()
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.80 minscore=0.70 time=23
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.83 minscore=0.71 time=34
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.69 minscore=0.36 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.67 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.73 minscore=0.40 time=0
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.80 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.63 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.67 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.73 time=33
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.80 minscore=0.70 time=8
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.62 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.81 minscore=0.71 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.77 minscore=0.61 time=0
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.70 time=0
三、第一次参数调优
从上面结果可以看到,没有任何优化的分类器,对于titanic数据集的准确率是80+%,下面通过对分类器进行参数和数据调整达到更高的准确率,使用随机森林预测Age参数。通常遇到缺值的情况,我们会有几种常见的处理方式:
如果缺值的样本占总数比例极高,我们可能就直接舍弃了,作为特征加入的话,可能反倒带入noise,影响最后的结果了,或者考虑有值的是一类,没有值的是一类,
如果缺值的样本适中,而该属性非连续值特征属性(比如说类目属性),那就把NaN作为一个新类别,加到类别特征中
有些情况下,缺失的值个数并不是特别多,那我们也可以试着根据已有的值,拟合一下数据,补充上。
[Python] 纯文本查看 复制代码
from sklearn.ensemble import RandomForestRegressor
dataset = pd.read_csv('train.csv', header = 0)
dataset.drop("PassengerId", inplace=True, axis=1)
dataset.drop("Name", inplace=True, axis=1)
dataset.drop("Ticket", inplace=True, axis=1)
dataset.drop("Cabin", inplace=True, axis=1)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
# 使用随机森林预测缺失值,Age作为目标值进行预测
datasetxx = dataset[["Age", "Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]]
known_age = datasetxx[datasetxx["Age"].notnull()].as_matrix()
unknown_age = datasetxx[datasetxx["Age"].isnull()].as_matrix()
yy = known_age[:, 0]
XX = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(XX, yy)
dataset["Age"][dataset["Age"].isnull()] = rfr.predict(unknown_age[:, 1:]) # 预测值写入数据集
X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].values
y = dataset["Survived"].values
testnum=100
test_classifier()
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.71 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.81 minscore=0.67 time=24
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.72 time=33
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.73 minscore=0.31 time=0
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.80 minscore=0.68 time=2
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.42 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.72 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.77 minscore=0.67 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.79 minscore=0.70 time=0
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.80 minscore=0.70 time=32
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=9
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.78 minscore=0.66 time=0
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.82 minscore=0.72 time=28
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.82 minscore=0.70 time=65
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.78 minscore=0.66 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.79 minscore=0.69 time=1
四、 第二次参数调优
可以看到使用随机森林进行缺失数据估计,和使用平均值估值的结果差不多,这次我们从Ticket Cabin字段解析特征值
* Ticket的值大概是一个字符串加数字形式,如“SOTON/OQ 392076”
[Python] 纯文本查看 复制代码
import re
def get_ticket(ticket):
out = re.compile("[0-9]+$").search(ticket)
if out is None:
return 0
else:
return int(out.group(0))
dataset = pd.read_csv('train.csv', header = 0)
# 由于Embarked只有2个缺失,用出现最多的S代替
dataset["Embarked"].fillna("S", inplace=True)
# 编码Embarked字段
dataset["Embarked"] = LabelEncoder().fit_transform(dataset["Embarked"].values)
# 编码Sex字段
dataset["Sex"]= LabelEncoder().fit_transform(dataset["Sex"].values)
# Age缺失较多,这次用平均值代替
dataset["Age"].fillna(dataset["Age"].mean(), inplace=True)
dataset['Ticket'] = dataset['Ticket'].apply(get_ticket)
X = dataset[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Ticket"]].values
y = dataset["Survived"].values
testnum=100
test_classifier()
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegression'> meanscore=0.80 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.logistic.LogisticRegressionCV'> meanscore=0.79 minscore=0.63 time=25
Accuraty:<class 'sklearn.svm.classes.SVC'> meanscore=0.82 minscore=0.76 time=41
Accuraty:<class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> meanscore=0.72 minscore=0.43 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifier'> meanscore=0.79 minscore=0.69 time=1
Accuraty:<class 'sklearn.linear_model.ridge.RidgeClassifierCV'> meanscore=0.79 minscore=0.71 time=3
Accuraty:<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> meanscore=0.74 minscore=0.57 time=1
Accuraty:<class 'sklearn.neighbors.classification.KNeighborsClassifier'> meanscore=0.81 minscore=0.71 time=4
Accuraty:<class 'sklearn.neighbors.nearest_centroid.NearestCentroid'> meanscore=0.76 minscore=0.64 time=0
Accuraty:<class 'sklearn.tree.tree.DecisionTreeClassifier'> meanscore=0.78 minscore=0.69 time=1
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.77 minscore=0.67 time=1
Accuraty:<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> meanscore=0.81 minscore=0.69 time=36
Accuraty:<class 'sklearn.ensemble.bagging.BaggingClassifier'> meanscore=0.82 minscore=0.70 time=14
Accuraty:<class 'sklearn.tree.tree.ExtraTreeClassifier'> meanscore=0.76 minscore=0.63 time=1
Accuraty:<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> meanscore=0.84 minscore=0.72 time=31
Accuraty:<class 'sklearn.ensemble.forest.RandomForestClassifier'> meanscore=0.84 minscore=0.76 time=66
Accuraty:<class 'sklearn.naive_bayes.BernoulliNB'> meanscore=0.75 minscore=0.61 time=1
Accuraty:<class 'sklearn.naive_bayes.GaussianNB'> meanscore=0.78 minscore=0.69 time=1
五、总结
参数调优对于taitanic数据集表现不大,表现较好的有以下算法:
* RandomForestClassifier
* GradientBoostingClassifier
* AdaBoostClassifier/SVC等
* 神经网络 |
|