diff --git a/docs/_quarto.yml b/docs/_quarto.yml index 471b430..61a1a19 100644 --- a/docs/_quarto.yml +++ b/docs/_quarto.yml @@ -268,6 +268,9 @@ website: - section: href: notes/predictive-modeling/ml-foundations/data-encoding.qmd text: "Data Encoding" + - section: + href: notes/applied-stats/data-scaling.qmd + text: "Data Scaling" - "--------------" - section: diff --git a/docs/images/features-labels.png b/docs/images/features-labels.png new file mode 100644 index 0000000..64bcca2 Binary files /dev/null and b/docs/images/features-labels.png differ diff --git a/docs/images/ml-vs-software-bw.png b/docs/images/ml-vs-software-bw.png new file mode 100644 index 0000000..91c1e74 Binary files /dev/null and b/docs/images/ml-vs-software-bw.png differ diff --git a/docs/images/ml-vs-software.png b/docs/images/ml-vs-software.png new file mode 100644 index 0000000..a298082 Binary files /dev/null and b/docs/images/ml-vs-software.png differ diff --git a/docs/notes/applied-stats/data-scaling.qmd b/docs/notes/applied-stats/data-scaling.qmd index be04899..1e9324b 100644 --- a/docs/notes/applied-stats/data-scaling.qmd +++ b/docs/notes/applied-stats/data-scaling.qmd @@ -50,7 +50,7 @@ Scaling the data will make it easier to plot all these different series on a gra ## Min-Max Scaling -One scaling approach is by dividing each value over the maximum value in that column, essentially expressing each value as a percentage of the greatest value. +One scaling approach called **min-max scaling** calls for dividing each value over the maximum value in that column, essentially expressing each value as a percentage of the greatest value. ```{python} scaled_df = df.copy() @@ -70,7 +70,7 @@ When we use min-max scaling, resulting values will be expressed on a scale betwe ## Standard Scaling -An alternative, more rigorous, scaling approach mean-centers the data and normalizes by the standard deviation: +An alternative, more rigorous, scaling approach, called **standard scaling** or z-score normalization, mean-centers the data and normalizes by the standard deviation: ```{python} scaled_df = df.copy() @@ -93,6 +93,6 @@ Now that we have scaled the data, we can more easily compare the movements of al ## Importance for Machine Learning -Data scaling is important in machine learning because many algorithms are sensitive to the range of the input data. Algorithms including gradient descent-based methods (e.g. neural networks, logistic regression) and distance-based models (e.g. k-nearest neighbors, support vector machines) perform better when features are on a similar scale. +Data scaling is relevant in machine learning when using multiple input features that have different scales. Many algorithms are sensitive to the range of the input data, and perform better when all features are on a similar scale. If features are not scaled, those features with larger ranges may disproportionately influence the model, leading to biased predictions or slower convergence during training. -If features are not scaled, those with larger ranges may disproportionately influence the model, leading to biased predictions or slower convergence during training. By scaling data—whether through techniques like min-max scaling or z-score normalization, we ensure that each feature contributes equally, improving model performance, training efficiency, and the accuracy of predictions. +By scaling data, using techniques like min-max scaling or standard scaling, we ensure that each feature contributes equally, improving model performance, training efficiency, and the accuracy of predictions. diff --git a/docs/notes/predictive-modeling/classification/index.qmd b/docs/notes/predictive-modeling/classification/index.qmd index 4e8066d..d1ed891 100644 --- a/docs/notes/predictive-modeling/classification/index.qmd +++ b/docs/notes/predictive-modeling/classification/index.qmd @@ -1 +1,26 @@ # Classification + +## Classification Objectives + + + +## Classification Models + +Classification Models: + + + Logistic Regression (yes, this is a classification, not a regression model) + + Decision Tree + + Random Forest + + etc. + +## Classification Metrics + + +Classification Metrics: + + + Accuracy + + Precision + + Recall + + F-1 Score + + ROC AUC + + etc. diff --git a/docs/notes/predictive-modeling/dimensionality-reduction/index.qmd b/docs/notes/predictive-modeling/dimensionality-reduction/index.qmd new file mode 100644 index 0000000..0e9a691 --- /dev/null +++ b/docs/notes/predictive-modeling/dimensionality-reduction/index.qmd @@ -0,0 +1,11 @@ +# Dimensionality Reduction + + +## Dimensionality Reduction Models + +Dimensionality Reduction Models: + + + Principal Component Analysis (PCA) + + T-SNE + + UMAP + + etc. diff --git a/docs/notes/predictive-modeling/index.qmd b/docs/notes/predictive-modeling/index.qmd index 28f60ec..d7847ca 100644 --- a/docs/notes/predictive-modeling/index.qmd +++ b/docs/notes/predictive-modeling/index.qmd @@ -1 +1,60 @@ # Predictive Modeling in Python (for Finance) + +## What is Predictive Modeling? + +**Predictive modeling** refers to the use of statistical techniques and machine learning algorithms to predict future outcomes based on historical data. It involves creating models that learn patterns from past observations and use them to forecast future trends or behavior. At its core, predictive modeling is about understanding the relationships between variables and using these relationships to make informed predictions. + + + + +## Predictive Modeling Process + +The process of predictive modeling can generally be broken down into several steps: + + 1. **Data Collection and Preparation**: Gather historical data and prepare it for analysis. This involves cleaning the data, handling missing values, and transforming features for better interpretability and accuracy. + + 2. **Model Selection**: Choose the right algorithm for the problem, whether it's a regression model, a classification algorithm, or a time-series forecasting model. + + 3. **Model Training**: Fit the model to the data by using training datasets to find patterns and relationships. + + 4. **Model Evaluation**: Validate the model to ensure that it generalizes well to new data. This typically involves splitting the data into training and testing sets or using cross-validation techniques. + + 5. **Prediction and Forecasting**: Once validated, the model can be used to predict outcomes on new or unseen data, providing valuable foresight for decision-making. + +In predictive modeling, the quality of the predictions is closely linked to the quality of the input data, the robustness of the algorithms, and the appropriate handling of uncertainties inherent in the real world. + +## Relevance of Predictive Modeling in Finance + +Predictive modeling has become increasingly relevant in the financial industry due to its ability to analyze vast amounts of data and provide forecasts that support strategic decision-making. In finance, the ability to make accurate predictions can provide a significant competitive edge. Whether it's forecasting stock prices, assessing credit risk, or predicting customer behavior, predictive models have transformed how financial professionals make decisions. + +Here are some key areas where predictive modeling is applied in finance: + + + **Risk Management**: Predictive models are used to assess the likelihood of defaults in loans or investments. By analyzing historical data on borrowers, lenders can develop models that predict the probability of default, helping them manage credit risk effectively. + + + **Market Forecasting**: Traders and analysts use predictive models to forecast stock prices, commodity trends, or exchange rates. These models incorporate historical price movements, trading volume, and external factors to provide insights into future market behavior. + + + **Fraud Detection**: Predictive analytics plays a crucial role in identifying fraudulent transactions. By examining patterns in transaction data, algorithms can detect anomalies that suggest fraudulent behavior, enabling financial institutions to respond quickly. + + + **Customer Behavior Prediction**: Banks and other financial institutions use predictive models to anticipate customer needs, identify upsell opportunities, or predict churn. This helps tailor services and products to individual customers, improving retention and profitability. + + + **Portfolio Management**: In portfolio optimization, predictive models forecast asset returns and risks. These models assist investors in making data-driven decisions, optimizing portfolios to meet their risk preferences and return objectives. + +## Why Python for Predictive Modeling in Finance? + +Python has emerged as a dominant language in the finance industry for predictive modeling due to its simplicity, flexibility, and vast ecosystem of libraries. Here's why Python is particularly suited for finance: + + + **Ease of Use**: Python's simple syntax makes it easy for both beginners and experienced developers to build complex models without getting bogged down by the intricacies of coding. + + + **Rich Libraries**: Python has a comprehensive set of libraries for data analysis and machine learning, such as `pandas`, `numpy`, `scikit-learn`, `statsmodels`, and `tensorflow`. These libraries provide pre-built functions that simplify the implementation of predictive models. + + + **Visualization Capabilities**: With libraries like `matplotlib`, `seaborn`, and `plotly`, Python excels in visualizing financial data, helping professionals interpret and present their findings effectively. + + + **Community Support**: Python has a large, active community that contributes to continuous development and troubleshooting, making it a great choice for building and maintaining predictive models in a fast-moving industry. + + + **Integration with Financial Systems**: Python seamlessly integrates with databases, APIs, and financial platforms, enabling real-time data analysis and model deployment in production environments. + +In this course, we'll explore how to leverage Python's capabilities to build predictive models tailored to various financial applications, starting from data preparation to model validation and prediction. + +## Summary + +Predictive modeling is a powerful tool for making data-driven decisions, and its importance in finance cannot be overstated. With growing amounts of data and advances in machine learning, the ability to make accurate financial predictions is now more accessible than ever. In the following chapters, we will dive deep into the practical aspects of building these models using Python, from basic concepts to advanced techniques, helping you develop the skills to apply predictive analytics in real-world finance. diff --git a/docs/notes/predictive-modeling/ml-foundations/index.qmd b/docs/notes/predictive-modeling/ml-foundations/index.qmd index 5b1d970..0747b7f 100644 --- a/docs/notes/predictive-modeling/ml-foundations/index.qmd +++ b/docs/notes/predictive-modeling/ml-foundations/index.qmd @@ -1,153 +1,69 @@ # Machine Learning Foundations -its about predicting something, x/y (target and features), supervised vs unsupervised (ground truth labels / test set or not), regression vs classification +**Machine learning** is a subfield of artificial intelligence that enables systems to automatically learn and improve from experience, without being explicitly programmed. It involves the development of algorithms and statistical models to identify patterns, trends, and relationships in a given dataset, and make predictions or decisions based on that data. +In traditional software development, humans explicitly write the rules or instructions for the computer to follow, to arrive at the desired result or output. But machine learning flips the paradigm. In machine learning, the computer infers or "learns" the rules by examining patterns in the data. +![Machine learning vs software development. [Source](https://www.avenga.com/wp-content/uploads/2021/12/image4-1.png).](../../../images/ml-vs-software.png) + +This shift enables machine learning to handle far more complex and nuanced tasks than traditional programming, especially when patterns in the data are subtle or too complicated to capture with simple rules. -## Predictive Modeling Process +## Machine Learning Concepts -Define the Problem: +A machine learning **model** is a mathematical representation that captures patterns or relationships in the data. - + What kind of task is this (i.e. regression vs classification)? - + What is the target output variable (`y`) we want to predict? - + What are the input features (`x`) we can use to make the prediction? - + What kind of model(s) should we use to do the predicting? - + What scoring metrics should we use? +A model arrives at the patterns through a process called **training**, which in some cases involves a closed-form calculation, and in other cases involves an iterative optimization process. -Prepare the Data: - + Checking for Nulls - + Checking for Outliers - + Examining Relationships - + Data Scaling - + Data Encoding - + Data Splitting +The model is trained on input data known as **features**, which are the variables or attributes the model uses to make predictions. These features could be anything from numerical values, like a company's annual revenue, to more abstract representations, such as text embeddings or image pixels. +When available, the corresponding output or target variable, called the **label**, serves as the outcome the model is trying to predict. -Train and Evaluate the Model(s): - + Train the model on the training dataset (x and y), so it knows the "right answers" - + Evaluate the model on the testing dataset, which contains data it hasn't yet "seen" +![Features and labels. Source: [Google ML](https://developers.google.com/machine-learning/intro-to-ml/supervised).](../../../images/features-labels.png) -Use Trained Model for Predictions and Forecasting +For example, in a loan default prediction scenario, the features might include an applicant's credit score and income, while the label would indicate whether the applicant defaulted on the loan. +## Types of Machine Learning Approaches +Machine learning can be broadly divided into three categories: supervised learning, unsupervised learning, and reinforcement learning. Each approach has its own set of techniques and applications. +### Supervised Learning +In **supervised learning**, the model is trained on a dataset where both the features and the corresponding labels are known. The system learns to map input features to the correct output labels, allowing it to make predictions or classifications on new data. +Example supervised learning tasks include **regression**, where the variable we are trying to predict is continuous; and **classification**, where the variable we are trying to predict is categorical or discrete. +### Unsupervised Learning +In contrast, **unsupervised learning** deals with data that lacks labeled outcomes. The model is tasked with finding patterns or groupings in the data without any explicit guidance. While supervised learning focuses on predicting specific outcomes, unsupervised learning seeks to uncover hidden structures or relationships within the data. +Example unsupervised learning tasks include **clustering**, where the model tries to arrange similar datapoints into groups; and **dimensionality reduction**, where the model reduces the number of features in a dataset while retaining important information. +### Reinforcement Learning -## Types of Machine Learning Tasks +**Reinforcement learning** is a different type of machine learning approach, where an agent learns to make decisions by interacting with an environment. The agent takes actions and receives feedback in the form of rewards or penalties, adjusting its strategy to maximize the cumulative reward over time. -**Supervised Learning**: when the data contains the "right answers" (a.k.a. "labels", or "target" prediction values). We share some of the right answers with the model, to help it learn what the desired output value is for a given set of inputs. -Supervised Tasks: Regression, Classification, etc. +## Machine Learning Problem Formulation -**Unsupervised Learning**: when the data does not contain the "right answers" (i.e. lack of target prediction values). In these situations it is the model's responsibility to identify patterns in the given set of inputs. +Machine learning problem formulation refers to the process of clearly defining the task that a machine learning model is meant to solve. This step is crucial in guiding the development of the model and ensuring that the right data, techniques, and metrics are applied to achieve the desired outcome. Problem formulation involves several key components: -Unsupervised Tasks: Clustering, Dimensionality Reduction, etc. + + Defining the Objective: Identifying the specific problem to solve, such as predicting future stock prices, classifying emails as spam or not, or detecting fraudulent transactions. This is the first step in understanding what the model should accomplish. -Reinforcement Learning: + + Choosing the Type of Problem: Determining whether the problem is one of classification (e.g., categorizing emails), regression (e.g., predicting continuous values like housing prices), clustering (e.g., grouping similar customers), or a decision-making task (e.g., optimizing a trading strategy). + + Identifying Features and Labels: Specifying the input variables (features) that the model will use to make predictions and, in the case of supervised learning, the corresponding output or target variable (label) that the model should predict. + + Data Availability and Quality: Assessing what data is available, its format, and whether it’s sufficient for training a model. Good data is key to a successful formulation, as noisy or incomplete data can lead to poor model performance. + + Evaluation Metrics: Establishing how the model’s success will be measured. This could involve metrics like accuracy, precision, recall for classification problems, or mean squared error for regression problems. -## Supervised Learning Tasks - -**Regression**: when the target variable we wish to predict is continuous - usually numeric. - -Examples: - - + House Prices (in dollars) - + Life Expectancy (in years) - + Employee Salary (in dollars) - + Distance to the Nearest Galaxy (in light years) - -**Classification**: when the target variable we wish to predict is discrete - usually binary or categorical. - -Examples: - - + Spam or Not (binary) - + Success or Failure (binary) - + Handwritten numeric digits (categorical) - + 1-5 star rating scale (categorical???? ) - -## Unsupervised Learning Tasks - -**Dimensionality Reduction**: ___________________ - - -## Model Selection - -Regression Models: - - + Linear Regression - + Ridge Regression - + Lasso Regression - + etc. - -Classification Models: - - + Logistic Regression (yes, this is a classification, not a regression model) - + Decision Tree - + Random Forest - + etc. - -Dimensionality Reduction Models: - - + Principal Component Analysis (PCA) - + T-SNE - + UMAP - + etc. - -## Metric Selection - -Regression Metrics: - - + R^2 Score - + Mean Squared Error - + Mean Absolute Error - + Root Mean Square Error - + etc. - -Classification Metrics: - - + Accuracy - + Precision - + Recall - + F-1 Score - + ROC AUC - + etc. - - -## Data Preprocessing - -Checking for Nulls: When we explore the data, we should pay attention to whether or not there are missing or null values. We might need to either drop rows with null values, or "impute" (a.k.a. fill-in) the null values. For example, we might choose to fill in some missing values using the mean or median of all other values in that column. - -Checking for Outliers: We should also pay attention to whether or not there are any significant outliers, and consider dropping rows that contain these outliers, if it will help improve the performance of our model. - - -Examining Existing Relationships: We might use statistical techniques to examine the relationships between individual variables. This might help us select or exclude certain features as appropriate. If one column has a high correlation with the target column, perhaps we should select it as a feature. However, if the target column was directly derived from other columns, those columns should not be used as features. Also, if multiple feature columns are highly correlated with each other (collinearity), we could consider dropping the redundant ones. - - -Scaling Numeric Variables: Pay attention to the range of values for numeric variables. Some models may be more sensitive to the distance between the values, in which case we might choose to scale them into a new domain, for example between 0 and 1. - -Encoding Categorical Variables: If we have categorical features, we may need to convert the category values to numeric space. For example, we might use "one-hot encoding" to create a matrix of 0/1 binary values for each word in a sentence, to represent the contents of the sentence in a way the model can understand. - -Engineering New Features: Based on the problem definition and characteristics of the available features, it may sometimes be advantageous to create new features. - -## Splitting - -Generally we aim to split the original raw dataset into two different subsets: "train" and "test". We train the model on the training data ONLY. We use most of the data (~80% of rows) for training, and the remaining (~20%) for test. - -Sometimes models can be too well fit to the training data and don't generalize well enough on unseen data. This is why we reserve the test dataset for evaluating the model's performance on data it has not yet seen. A more advanced version of this technique, called "Cross Validation", essentially uses many different combinations of test datasets to prevent overfitting. - -We'll want to split our datasets using random sampling, to prevent training issues that may arise from similarities and relationships in the underlying data. Sometimes we will use a specific kind of sampling called stratification, which retains the same proportion of target class values. Stratification may be applicable for classification tasks. +Proper problem formulation ensures that the right machine learning approach is chosen and that the model development process is aligned with the business or research objectives. diff --git a/docs/notes/predictive-modeling/ml-foundations/process.qmd b/docs/notes/predictive-modeling/ml-foundations/process.qmd new file mode 100644 index 0000000..cac1fbc --- /dev/null +++ b/docs/notes/predictive-modeling/ml-foundations/process.qmd @@ -0,0 +1,31 @@ + + + + +# Predictive Modeling Process + +Define the Problem: + + + What is the target output variable (`y`) we want to predict? Do we have the ground truth labels or not? + + What are the input features (`x`) we can use to make the prediction? + + What kind of model(s) should we use to do the predicting? + + What scoring metrics should we use? + +Prepare the Data: + + + Checking for Nulls + + Checking for Outliers + + Examining Relationships + + Data Scaling + + Data Encoding + + Data Splitting + + +Train and Evaluate the Model(s): + + + Choose an appropriate model. + + Train the model on the training dataset (x and y), so it knows the "right answers". + + Evaluate the model on the testing dataset, which contains data it hasn't yet "seen" + +Prediction and Forecasting: + + Use the model to make predictions on real diff --git a/docs/notes/predictive-modeling/regression/index.qmd b/docs/notes/predictive-modeling/regression/index.qmd index 6a71fa9..40fea45 100644 --- a/docs/notes/predictive-modeling/regression/index.qmd +++ b/docs/notes/predictive-modeling/regression/index.qmd @@ -1 +1,26 @@ # Regression + +## Regression Objectives + + + +## Regression Models + +Regression Models: + + + Linear Regression + + Ridge Regression + + Lasso Regression + + etc. + + +## Regression Metrics + + +Regression Metrics: + + + R^2 Score + + Mean Squared Error + + Mean Absolute Error + + Root Mean Square Error + + etc. diff --git a/docs/notes/predictive-modeling/supervised-learning.qmd b/docs/notes/predictive-modeling/supervised-learning.qmd index 3468ab8..163fe16 100644 --- a/docs/notes/predictive-modeling/supervised-learning.qmd +++ b/docs/notes/predictive-modeling/supervised-learning.qmd @@ -1 +1,28 @@ # Supervised Learning + + + +## Supervised Learning Tasks + +**Regression**: Used when the target variable is continuous (e.g., predicting house prices or stock market returns). + +**Regression**: when the target variable we wish to predict is continuous - usually numeric. + +Examples: + + + House Prices (in dollars) + + Life Expectancy (in years) + + Employee Salary (in dollars) + + Distance to the Nearest Galaxy (in light years) + + +**Classification**: Used when the target variable is categorical (e.g., determining whether a transaction is fraudulent or not). + +**Classification**: when the target variable we wish to predict is discrete - usually binary or categorical. + +Examples: + + + Spam or Not (binary) + + Success or Failure (binary) + + Handwritten numeric digits (categorical) + + 1-5 star rating scale (categorical???? ) diff --git a/docs/notes/predictive-modeling/unsupervised-learning.qmd b/docs/notes/predictive-modeling/unsupervised-learning.qmd index 7c578f4..5d4c94f 100644 --- a/docs/notes/predictive-modeling/unsupervised-learning.qmd +++ b/docs/notes/predictive-modeling/unsupervised-learning.qmd @@ -1 +1,11 @@ # Unsupervised Learning + + + + +## Unsupervised Learning Tasks + + + + **Clustering**: Groups similar data points together (e.g. customer segmentation). Algorithms like K-means, hierarchical clustering, and DBSCAN are commonly used. + + + **Dimensionality Reduction**: Reduces the number of features in a dataset while retaining important information, making the data easier to visualize or process. Principal Component Analysis (PCA) and T-SNE are common techniques.