i can filter columns,
data_set_06 = data_set_05[["Gender","Married"]]
data_set_06
and then use
pd.get_dummies(data_set_06)
so how
obj = OneHotEncoder()
obj.fit_transform(encoded_dataset)
it can give output -
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 2456 stored elements and shape (614, 8)
Because of i already filter two columns => Which break in 2 ,
But, how it access 4 columns => Which break into 8 ?
I understand your question now. Let's clarify how OneHotEncoder
works in relation to the columns you've filtered and encoded. Here’s the step-by-step explanation:
- Data Preparation:
- First, you create a copy of your original dataset (
dataset
) and handle missing values in theGender
andMarried
columns by filling them with the mode.
- First, you create a copy of your original dataset (
# Create a copy of the original dataset
data_set_04 = dataset.copy()
# Fill missing values with mode in 'Gender' and 'Married' columns
data_set_04["Gender"].fillna(data_set_04["Gender"].mode()[0], inplace=True)
data_set_04["Married"].fillna(data_set_04["Married"].mode()[0], inplace=True)
- Check Null Counts:
- After filling missing values, you can check if there are any remaining null values.
# Check null counts in the modified dataset
print(data_set_04.isnull().sum())
- Filter Columns:
- Next, you create a copy (
data_set_05
) of the modified dataset and filter out only theGender
andMarried
columns intodata_set_06
.
- Next, you create a copy (
# Create another copy of the modified dataset
data_set_05 = data_set_04.copy()
# Filter out 'Gender' and 'Married' columns into a new DataFrame
data_set_06 = data_set_05[["Gender", "Married"]]
- Perform One-Hot Encoding:
- Use
pd.get_dummies()
to perform one-hot encoding ondata_set_06
. This will create binary indicators for each unique category inGender
andMarried
.
- Use
# Perform one-hot encoding on data_set_06
encoded_dataset = pd.get_dummies(data_set_06)
# Display the first few rows to show 1s and 0s
print("First few rows with binary indicators:")
print(encoded_dataset.head())
print()
# Display the summary information of the DataFrame
print("Summary information:")
print(encoded_dataset.info())
- Initialize and Fit OneHotEncoder:
- Now, initialize
OneHotEncoder
(obj
) fromsklearn.preprocessing
and fit it toencoded_dataset
. This step adjusts the encoder based on the categories observed inencoded_dataset
.
- Now, initialize
from sklearn.preprocessing import OneHotEncoder
# Initialize OneHotEncoder
obj = OneHotEncoder()
# Fit OneHotEncoder to encoded_dataset
obj.fit(encoded_dataset)
- Transform and Convert into Array:
- Once
OneHotEncoder
is fitted, you can transformencoded_dataset
usingobj.transform()
. This operation converts the categorical data into a sparse matrix representation (<class 'scipy.sparse.csr.csr_matrix'>
).
- Once
# Transform encoded_dataset into a sparse matrix
array_01 = obj.transform(encoded_dataset)
# Display the shape of the sparse matrix
print("Shape of sparse matrix:", array_01.shape)
- Convert Sparse Matrix to DataFrame:
- To create a DataFrame (
df
) with the correct column names and the transformed data, you use theget_feature_names_out()
method to get the column names fromobj
, then convertarray_01
into an array and createdf
.
- To create a DataFrame (
# Get feature names for the columns
column_names = obj.get_feature_names_out(encoded_dataset.columns)
# Convert sparse matrix array_01 to a dense array and then to a DataFrame with column_names
df = pd.DataFrame(array_01.toarray(), columns=column_names)
# Display the DataFrame
print(df.head())
-
Column Filtering: Even though you initially filter down to only
Gender
andMarried
columns (data_set_06
), the one-hot encoding (pd.get_dummies()
) expands these into multiple binary indicators. -
OneHotEncoder Usage: When you initialize and fit
OneHotEncoder
(obj
), it learns the categories fromencoded_dataset
, which includes all unique values derived fromGender
andMarried
after one-hot encoding. -
Sparse Matrix: The output of
obj.transform()
is a sparse matrix (array_01
) that efficiently represents the encoded categorical variables, even though you filtered down to fewer original columns. -
DataFrame Creation: By using
get_feature_names_out()
to obtain column names and convertingarray_01
into a DataFrame (df
), you ensure thatdf
correctly represents the transformed data with the appropriate column names.
This approach ensures that your data preprocessing, filtering, one-hot encoding, and transformation into a format suitable for further analysis or modeling are correctly executed. Adjust variables and methods based on your specific dataset and requirements.