kaggle首战Titanic 0.82275-Top3% & 0.83732-Top2%

Post author: SnailDove
Post link: <a href="https://snaildove.github.io/2019/01/06/Titanic_with_name_sex_age_and_ticket_features-0.82275-0.83732/" title="kaggle首战Titanic 0.82275-Top3% & 0.83732-Top2%">https://snaildove.github.io/2019/01/06/Titanic_with_name_sex_age_and_ticket_features-0.82275-0.83732/
Copyright Notice: All articles in this blog are licensed under <a href="https://creativecommons.org/licenses/by-nc-sa/3.0/" rel="external nofollow" target="_blank">CC BY-NC-SA 3.0 unless stating additionally.

sticky | Posted on 2019-01-06 | In 中文 | Hits

Words count in article 167 | Reading time ≈ 1

本文用数据分析探索规律，效果好于一堆的随机森林和xgboost，超过参加这个比赛的很多ensemble模型，至少排在前156/10021（Top 2%），最终只选择 name，sex，age，Ticket 4个特征，构建出新的特征，然后进行规则判断，即多个嵌套的if-else，再一次感受到了特征工程的强大。省了数据缺失弥补，其他繁琐的数据预处理，数据清洗，后续的调参和集成模型。
需要注意的是：需要自己定制交叉验证函数。

具体方案细节，查看我的jupyter notebook：

Titanic_with_name_sex_age_and_ticket_features-0.82275.ipynb
Titanic_with_name_sex_age_and_ticket_features-0.83732.ipynb