- This repository holds an attempt to complete the Santander Customer Transaction Prediction using a few dimension reduction techniques and three models: Logistic Regression, Deep Learning, and K-Nearest Neighbors.
Definition of Challenge
- The overarching question this challenge asks is "Can you identify who will make a transaction?"
- Given 199 features of anonymized data on 200,000 customers, including a binary feature outlining whether the customer made the purchase or not, use ML/DL to create a model that predicts, with great accuracy, if a given customer will make a purchase.
My Approach
- Due to the high dimensionality of this dataset, I employed a few dimension redcution algorithms in order to feed my models cleaned, balanced, and appropriate data.
- Dimension Reduction Techniques
- Principal Component Analysis (PCA)
- Random Forest
- Variance Inflation Factor (VIF)
- Models
- Logistic Regression
- Deep Learning Neural Network
- K-Nearest Neighbors
Performance Achieved
- The highest accuracy was achieved by Logistic Regression at 91.2% and the highest Kaggle score was achieved by Deep Learning at 0.629. The higest Kaggle score achieved by a competitor was 0.92573.
- Type:
- 200 anonymized features representing customer behavior/history.
- Training Dataset
- 200000 rows × 202 columns
- Includes target variable
- Testing Dataset
- 200000 rows × 201 columns
- Omits target variable
- Size: 1.06 GB for both the training and testing datasets
- Train & Test Split after Dimension Reduction:
- Training Dataset
- 200000 rows × 177 columns
- Testing Dataset
- 200000 rows × 176 columns
- Training Dataset
Dimension Reduction
-
Prinicipal Component Analysis (PCA)
- Due to the nature of the dataset explained in the Data Visualization section, I couldn't use these results.
-
- Due to the nature of the dataset explained in the Data Visualization section, I couldn't use these results.
-
- The features represented in this plot were omitted from model training/testing.
- Dataset Descriptive Statistics
-
Distrtibution of pd.describe() function
- Training
- Testing
-
Distribution of Target Variable
-
Feature Correlation Heatmap
-
- Define:
- Input / Output
- Models
- Describe the different models you tried and why.
- Loss, Optimizer, other Hyperparameters.
- Describe the training:
- How you trained: software and hardware.
- How did training take.
- Training curves (loss vs epoch for test/train).
- How did you decide to stop training.
- Any difficulties? How did you resolve them?
- Clearly define the key performance metric(s).
- Show/compare results in one table.
- Show one (or few) visualization(s) of results, for example ROC curves.
- State any conclusions you can infer from your work. Example: LSTM work better than GRU.
- What would be the next thing that you would try.
- What are some other studies that can be done starting from here.
- In this section, provide instructions at least one of the following:
- Reproduce your results fully, including training.
- Apply this package to other data. For example, how to use the model you trained.
- Use this package to perform their own study.
- Also describe what resources to use for this package, if appropirate. For example, point them to Collab and TPUs.
-
Describe the directory structure, if any.
-
List all relavent files and describe their role in the package.
-
An example:
- utils.py: various functions that are used in cleaning and visualizing data.
- preprocess.ipynb: Takes input data in CSV and writes out data frame after cleanup.
- visualization.ipynb: Creates various visualizations of the data.
- models.py: Contains functions that build the various models.
- training-model-1.ipynb: Trains the first model and saves model during training.
- training-model-2.ipynb: Trains the second model and saves model during training.
- training-model-3.ipynb: Trains the third model and saves model during training.
- performance.ipynb: loads multiple trained models and compares results.
- inference.ipynb: loads a trained model and applies it to test data to create kaggle submission.
-
Note that all of these notebooks should contain enough text for someone to understand what is happening.
- List all of the required packages.
- If not standard, provide or point to instruction for installing the packages.
- Describe how to install your package.
- Point to where they can download the data.
- Lead them through preprocessing steps, if necessary.
- Describe how to train the model
- Describe how to run the performance evaluation.
- Provide any references.