Santander Customer Transaction Prediction

This repository holds an attempt to complete the Santander Customer Transaction Prediction using a few dimension reduction techniques and three models: Logistic Regression, Deep Learning, and K-Nearest Neighbors.

Overview

Definition of Challenge

The overarching question this challenge asks is "Can you identify who will make a transaction?"
Given 199 features of anonymized data on 200,000 customers, including a binary feature outlining whether the customer made the purchase or not, use ML/DL to create a model that predicts, with great accuracy, if a given customer will make a purchase.

My Approach

Due to the high dimensionality of this dataset, I employed a few dimension redcution algorithms in order to feed my models cleaned, balanced, and appropriate data.
Dimension Reduction Techniques
- Principal Component Analysis (PCA)
- Random Forest
- Variance Inflation Factor (VIF)
Models
- Logistic Regression
- Deep Learning Neural Network
- K-Nearest Neighbors

Performance Achieved

The highest accuracy was achieved by Logistic Regression at 91.2% and the highest Kaggle score was achieved by Deep Learning at 0.629. The higest Kaggle score achieved by a competitor was 0.92573.

Summary of Work Done

Data

Type:
- 200 anonymized features representing customer behavior/history.
Training Dataset
- 200000 rows × 202 columns
- Includes target variable
Testing Dataset
- 200000 rows × 201 columns
- Omits target variable
Size: 1.06 GB for both the training and testing datasets
Train & Test Split after Dimension Reduction:
- Training Dataset
  - 200000 rows × 177 columns
- Testing Dataset
  - 200000 rows × 176 columns

Preprocessing / Clean up

Dimension Reduction

Prinicipal Component Analysis (PCA)
- Due to the nature of the dataset explained in the Data Visualization section, I couldn't use these results.
Random Forest Regressor
- Due to the nature of the dataset explained in the Data Visualization section, I couldn't use these results.
Variance Inflation Factor
- The features represented in this plot were omitted from model training/testing.

Data Visualization

Dataset Descriptive Statistics
- Distrtibution of pd.describe() function
  - Training
  - Testing
- Distribution of Target Variable
  - Ones - Customer made a purchase (10.049%)
  - Zeros - Customer did not make a purchase (89.951%)
- Feature Correlation Heatmap
  - This heatmap essentially saying there are no correlations between any features which is generally favorable, but for the sake of dimension reduction it is unfavorable.

Problem Formulation

Define:
- Input / Output
- Models
  - Describe the different models you tried and why.
- Loss, Optimizer, other Hyperparameters.

Training

Describe the training:
- How you trained: software and hardware.
- How did training take.
- Training curves (loss vs epoch for test/train).
- How did you decide to stop training.
- Any difficulties? How did you resolve them?

Performance Comparison

Clearly define the key performance metric(s).
Show/compare results in one table.
Show one (or few) visualization(s) of results, for example ROC curves.

Conclusions

State any conclusions you can infer from your work. Example: LSTM work better than GRU.

Future Work

What would be the next thing that you would try.
What are some other studies that can be done starting from here.

How to reproduce results

In this section, provide instructions at least one of the following:
- Reproduce your results fully, including training.
- Apply this package to other data. For example, how to use the model you trained.
- Use this package to perform their own study.
Also describe what resources to use for this package, if appropirate. For example, point them to Collab and TPUs.

Overview of files in repository

Describe the directory structure, if any.
List all relavent files and describe their role in the package.
An example:
- utils.py: various functions that are used in cleaning and visualizing data.
- preprocess.ipynb: Takes input data in CSV and writes out data frame after cleanup.
- visualization.ipynb: Creates various visualizations of the data.
- models.py: Contains functions that build the various models.
- training-model-1.ipynb: Trains the first model and saves model during training.
- training-model-2.ipynb: Trains the second model and saves model during training.
- training-model-3.ipynb: Trains the third model and saves model during training.
- performance.ipynb: loads multiple trained models and compares results.
- inference.ipynb: loads a trained model and applies it to test data to create kaggle submission.
Note that all of these notebooks should contain enough text for someone to understand what is happening.

Software Setup

List all of the required packages.
If not standard, provide or point to instruction for installing the packages.
Describe how to install your package.

Data

Point to where they can download the data.
Lead them through preprocessing steps, if necessary.

Training

Describe how to train the model

Performance Evaluation

Describe how to run the performance evaluation.

Citations

Provide any references.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Santander Customer Transaction Prediction

Overview

Summary of Work Done

Data

Preprocessing / Clean up

Data Visualization

Problem Formulation

Training

Performance Comparison

Conclusions

Future Work

How to reproduce results

Overview of files in repository

Software Setup

Data

Training

Performance Evaluation

Citations

About

Releases

Packages

License

lemaurK/SantanderBankBinaryClassification

Folders and files

Latest commit

History

Repository files navigation

Santander Customer Transaction Prediction

Overview

Summary of Work Done

Data

Preprocessing / Clean up

Data Visualization

Problem Formulation

Training

Performance Comparison

Conclusions

Future Work

How to reproduce results

Overview of files in repository

Software Setup

Data

Training

Performance Evaluation

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages