- The programming language used in this project is python 3.5.
- The packages used in this project including: pandas, numpy, sklearn, scipy, matplotlib, seaborn.
Input files are stored under rawdata
folder, including samplesubmission.csv
, test.csv
and train.csv
.
To run this project, firstly you have to perform data pre-process and feature engineering.
- Under
dataprocessiing
folder, runprocess.py
, perform data pre-process, you can get two CSV files namedprefeatures_dropold.csv
andtest_feature.csv
underdataprocessiing/processed data
folder. - Run
feature ranking.py
underdataprocessiing/FeatureEngineering
folder, you can get a CSV file namedslctdfeature .csv
underdataprocessiing/processed data
folder. But because of the randomness of running result offeature ranking.py
, if you want to generate my final submission, I have uploaded theslctdfeature .csv
file that I used in my prediction models underdataprocessiing/processed data
folder, plesase use this file directly. - Run
KFold.py
, it will show cross validation result of several diffrent regression models.
Run randomforest_selctdft.py
under submit1
folder, it can generate the resulting test.csv
file under submit1
folder.
Run randomforest_selctdft_model2.py
under submit2
folder, it can generate the resulting test.csv
file under submit2
folder.
The resulting test.csv
file I submitted on kaggle website are stored under submit1
folder and submit2
folder.