From 413df201d4a09728ffd01084d853f2d82144375e Mon Sep 17 00:00:00 2001 From: MJ Rossetti Date: Sun, 22 Sep 2024 16:13:03 -0400 Subject: [PATCH] Encoding code examples --- .../ml-foundations/data-encoding.qmd | 123 ++++++++++++++++-- 1 file changed, 109 insertions(+), 14 deletions(-) diff --git a/docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd b/docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd index fddd24e..59df51f 100644 --- a/docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd +++ b/docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd @@ -8,11 +8,11 @@ So if we have categorical or textual data, we need to use a **data encoding** st We'll choose an encoding strategy based on the nature of the data. For categorical data, we'll use either ordinal encoding if there's an inherent order among the categories, or one-hot encoding if no such order exists. For time-series data, we can apply time step encoding to represent the temporal sequence of observations. -## Encoding Approaches +## For Categorical Data -### Ordinal Encoding for Categorical Data +### Ordinal Encoding -When the data has a natural order (i.e. where one category is "greater" or "less" than another), we use **ordinal encoding**. This involves converting the categories into a sequence of numbered values. For example, in a dataset containing ticket classes like "upper", "middle", and "lower", we can map these to integers (e.g. `[1, 2, 3]`, or `[3, 2, 1]`), maintaining the ordered relationship. +When the data has a natural order (i.e. where one category is "greater" or "less" than another), we use **ordinal encoding**. This involves converting the categories into a sequence of numbered values. For example, in a dataset containing ticket classes like "business", "priority", and "economy", we can map these to integers (e.g. `[1, 2, 3]`, or `[3, 2, 1]`), maintaining the ordered relationship. ![Example of ordinal encoding (passenger ticket classes).](../../../images/ordinal-encoding-passenger-class.png) @@ -20,41 +20,136 @@ When the data has a natural order (i.e. where one category is "greater" or "less With ordinal encoding, we start with a column of textual category names, and we wind up with a column of numbers. -### One-hot Encoding for Categorical Data +To implement ordinal encoding in Python, you can consider a `DataFrame` mapping operation: + +```{python} +#| code-fold: true +from pandas import DataFrame + +# exmaple dataframe: +passengers_df = DataFrame({ + "passenger_id": ["P1", "P2", "P3", "P4", "P5", "P6"], + "ticket_class": ["BUSINESS", "PRIORITY", "BUSINESS", + "ECONOMY", "ECONOMY", "ECONOMY"] +}) +passengers_df.index = passengers_df["passenger_id"] +passengers_df.index.name = "passenger_id" +passengers_df.drop(columns=["passenger_id"], inplace=True) +#passengers_df +``` + +```{python} +ticket_class_map = {"BUSINESS": 3, "PRIORITY": 2, "ECONOMY": 1} + +x_encoded = passengers_df["ticket_class"].map(ticket_class_map) + +passengers_df["ticket_class_encoded"] = x_encoded +passengers_df +``` -When categorical data has no natural order, we use a technique called **one-hot encoding** to convert it into a numerical format. In one-hot encoding, each category is represented as a row of binary values, where one element is set to 1 to indicate the presence of that category, and the remaining elements are set to 0, indicating the absence of all other categories. + + +### One-hot Encoding + +When categorical data has no natural order, we use a technique called **one-hot encoding** to convert it into a numerical format. In one-hot encoding, each category is represented as a row of binary values, where one element is set to 1 (or True) to indicate the presence of that category, and the remaining elements are set to 0 (or False), indicating the absence of all other categories. For example, if we have five color categories (blue, green, red, purple, yellow), one-hot encoding transforms the single column of colors into five separate columns. Each column corresponds to one color, and for any given row, a 1 appears in the column that matches the color, while the other columns contain 0s to represent the absence of the other colors. ![Example of one-hot encoding for categorical data (diamond colors).](../../../images/one-hot-encoding-diamond-color.png) -With ordinal encoding, we start with a column of textual category names, and we wind up with as many columns as there were unique values in the original column. This can potentially lead to a large number of features if there are a large number of categories present. +With one-hot encoding, we start with a column of textual category names, and we wind up with as many columns as there were unique values in the original column. This can potentially lead to a large number of features if there are a large number of categories present. + +To implement one-hot encoding in Python, we can use the [`get_dummies` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) from `pandas`, specifying a list of columns to encode in this way: + +```{python} +#| code-fold: true +from pandas import DataFrame + +# example dataframe +diamonds_df = DataFrame({ + "diamond_id": ["D1", "D2", "D3", "D4", "D5", "D6"], + "color": ["blue", "green", "red", "purple", "yellow", "blue"] +}) +diamonds_df.index = diamonds_df["diamond_id"] +diamonds_df.index.name = "diamond_id" +diamonds_df.drop(columns=["diamond_id"], inplace=True) +#diamonds_df +``` +```{python} +from pandas import get_dummies as one_hot_encode + +#x_encoded = one_hot_encode(diamonds_df, columns=["color"]) +x_encoded = one_hot_encode(diamonds_df["color"]) +x_encoded = x_encoded.astype("int") +x_encoded.columns = [f"color_{color}" for color in x_encoded.columns] + +diamonds_df.merge(x_encoded, left_index=True, right_index=True) +``` + +```{python} +#diamonds_df.merge(x_encoded, left_index=True, right_index=True) +``` + + -### One-hot Encoding for Textual Data + +#### Bag of Words In natural language processing (NLP), we can use one-hot encoding to represent words (or "tokens") in a sentence. This approach is known as the "bag of words" method. In the bag of words approach, each document is transformed into a row of binary values, where 1 indicates the presence of a given word in that document, and 0 indicates absence of that word. This allows us to turn text into numbers that machine learning models can work with. -![Example of one-hot encoding for textual data.](../../../images/bag-of-words.png) +![Example of one-hot encoding for textual data (i.e. "bag of words" approach).](../../../images/bag-of-words.png) + +The bag of words approach is a simple count-based text embedding approach, however more advanced alternative approaches include Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings. To implement count-based word embedding strategies in Python, we can use the [`CountVectorizer` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), or more commonly the [`TfidfVectorizer` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from `sklearn`. -The bag of words approach is a simple count-based text embedding approach, however more advanced alternative approaches include Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings. +## For Time-series Data -### Time Step Encoding for Time-series Data +### Time Step Encoding -When working with time-series data, it's important to encode dates in a way that preserves the temporal structure. A common method is **time-step encoding**, where each timestamp is assigned a sequential integer value. For example, if our data is recorded daily, we would assign the earliest date a value of 1, the next day a value of 2, and so on. This approach works well for data collected at regular intervals, such as daily, monthly, or yearly. +When working with time-series data, it's important to encode dates in a way that preserves the temporal structure. A common method is **time-step encoding**, where each timestamp is assigned a sequential integer value. This approach works well for data collected at regular intervals (e.g. daily, monthly, quarterly, or yearly). To implement this in Python, we first sort the dataset by date in ascending order, ensuring the earliest date appears first. Then, we create a new column with sequential integers, starting at 1 for the earliest date and incrementing by 1 for each subsequent date: -```python +```{python} +#| code-fold: true + +from pandas import DataFrame + +dates = ["1970-01-01", "1971-01-01", "1972-01-01", "1973-01-01", "1974-01-01", + "1975-01-01", "1976-01-01", "1977-01-01", "1978-01-01", "1979-01-01"] + +df = DataFrame({"date": dates}) +``` + +```{python} df.sort_values(by="date", ascending=True, inplace=True) df["time_step"] = range(1, len(df) + 1) df ``` -In some ways, this is similar to an ordinal encoding, where the order of values is maintained. - ## Summary In summary, different types of data require different encoding strategies. For categorical data, we use ordinal encoding when categories have a natural order, and one-hot encoding when no such order exists. In text data, one-hot encoding is applied through the "bag of words" method to represent the presence or absence of words in documents. For time-series data, time step encoding is used to preserve the temporal order of observations. Choosing the right encoding method ensures that the model can process the data effectively and make accurate predictions.