Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Extraction of latent space #121

Open
42elenz opened this issue Sep 11, 2024 · 5 comments · Fixed by #202
Open

[FEATURE] Extraction of latent space #121

42elenz opened this issue Sep 11, 2024 · 5 comments · Fixed by #202
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@42elenz
Copy link

42elenz commented Sep 11, 2024

Dear Team,
Thanks for this awesome Repo :)
I wanted to suggest a way for extracting Latent-Spaces Embeddings and the backpropagation of losses based on this. This could be done with a encode(x) function that gives back the Latent-Spaces.

Maybe this also could be interesting in Using some Embedding anlyses for feature importans, but mainly i would like to use the embedding space for downstream tasks.

I hope this description was helpfull and clear and that you take this idea in consideration.

Thx.

@42elenz 42elenz added the enhancement New feature or request label Sep 11, 2024
@AnFreTh AnFreTh self-assigned this Sep 11, 2024
@AnFreTh AnFreTh changed the title [FEATURE] Extracting of Embeddings [FEATURE] Extraction of latent space Sep 11, 2024
@AnFreTh AnFreTh added this to the v 0.2.0 HPO milestone Sep 11, 2024
@42elenz
Copy link
Author

42elenz commented Nov 20, 2024

Hello!
I wanted to get back to that.
So since this week I had some time to look into it. It seems really difficult because the whole data-preprocessing and all the modules (Data_Module, Preprocessor, Task_model and Mambular + Mamba) are so entangled.

But I think I am almost there to at least make it work with categorial features (because in my Embedding task I don't always have a target I could do the binning on the metrical features or at least I don't want to "pollute" my training with this information yet).

My biggest question at this moment is the input to the model:
I don't want to use the whole preprocessor and all the other modules because I have my own preprocessing and just want to initialize the Mambular-Model and feed data to it in batches.
I am also using lightning but no Data_module but good old Dataset_loaders. Also I need to keep the same structure of my data due to the contrastive nature of my task and the IDs need to be preserverd.

Now I have this problem:

What does the Mambular-Model actually get as input? To make it easier I would just be interested in what it does get if I am just using categorial features.
So in the initialization I give the dict. That worked and I understood that.
Now I checked the shape of the actual input to the model (model(num_features, cat_features))

image

  • Why are this lists? (I saw that the len of the list e.g. cat_features == number of cat columns)
  • Why is the shape of a list 10? (are this "tokens" somehow? Where does the number 10 come from?)
  • How can I change my data into this form without the preprocessor module?

I hope you can help me with that and maybe point me to the crucial preprocessing functions so I can extract them from the modules and add them to my preprocessing steps.

Thank you very much for your work :)

@AnFreTh
Copy link
Collaborator

AnFreTh commented Nov 20, 2024

Hi,

thanks for reiterating on this. While I have not come around incorporating it, I am confident, that we will have the option to extract latent spaces within this year.

Why are this lists

  • We use lists, such that each feature, independent of the shape is easily accessible. I.e. if you have tensors instead (as ist common for tabular data) you would have to store the corresponding input shapes of each feature as after e.g. ple or basis expansion during preprocessing, the dimension of feature $x_1$ can be anything $\in\mathbb{R}^T$.
    However, to clarify: Each element in cat_tensors is a tensor.

If you have multiple categorical features - lets say 3 - which are integer encoded, the shape of each tensor in cat_features should be (B, 1) with B being the batch_size.

Why is the shape of a list 10? (are this "tokens" somehow? Where does the number 10 come from?)

  • Hard to tell without seeing your data generating process. If you use simple categorical features with integer encoding, the len of each feature in cat_features corresponds to the Batch size

How can I change my data into this form without the preprocessor module?

  • The input to the forward pass of Mambular are two lists. If you do not have any numerical_features, simply pass an empty list to it.

Maybe it would be much simpler, if we integrated your custom preprocessing directly into the library? If you can share your preprocessing code and a minimal code example, it should be a rather fast integration from our part.

Otherwise, if you only use the base_models/Mambular version and are not relying on preprocessing/data_module and lightning trainer part, you could also just change the input of that BaseModel altogether. Since you are interested in the embeddings, you could change the self.embedding_layer() to a module after your liking that simply takes a single x as input.

The input to self.mamba is the same as to any transformer. I.e. a tensor of shape (B, S, D) (Batch Size) X (Sequence Length) X Embedding Dimension), Where the sequence length corresponds to the number of features.

I hope this helps :) Feel free to further engage if this does not answer your questions.

@42elenz
Copy link
Author

42elenz commented Nov 21, 2024

Thank you for your detailed answer.
I think that I managed to use the model as an embedding model.
As you said I implemented just an own Embedding class that inherits from Mambular.
I always reshape the input that are comming in batches from my Data_Loader to the shape that you defined.

cat_features = [cat_features[:, i:i+1].int() for i in range(cat_features.size(1))]

This then yields:
len(cat_features) == number of columns
cat_features.shape == [batchsize, value]

The num_features variable is by default an empty list. Maybe I add them later. For now I think I will just bin them by myself in my personal preprocessing steps and make them categorial in this way.

I then pass this through the model.
At the moment I am extracting the space after the MLP and put the Embedding later in my Projection Head for Contrastive Learning.

Feel free to comment on that place for extracting. But since the Projection Head is nothing more than a fancy Linear-Layer the shape after the MLP-Layer was the best regarding the shapes.

Thanks again for you help and prompt answers :)

@AnFreTh
Copy link
Collaborator

AnFreTh commented Nov 21, 2024

Great :)
If you care to share your preprocessing, we could still include it into the library and also include the embedding extraction in one of our next version releases :)

@42elenz
Copy link
Author

42elenz commented Nov 25, 2024

There is no big difference in my preprocessing step. (Just that I use Gaussian-Rank-Scaler for Normalizing). The main difference is in the loader and somehow I could not manage to entangle the preprocessing from the loading.

The reason I had to use a different data loading approach was mainly due to my contrastive tasks (which involve two modalities). I always had to ensure that their IDs were in the same order. In my loader, I consistently loaded and preprocessed both modalities, then passed them into my training pipeline and their respective encoders where they then ended up in the shared embedding spaces after their respective Projection - Heads.

Now a thing that I am working on is how to find the best embeddings if there is no downstream task for the encoding of the latent space on the metric values. Because then I have to use fixed binning (like 50) but maybe there is another method that could work to discretize the metrical values besides using a tree approach.

Looking forward to your next release!

@AnFreTh AnFreTh linked a pull request Jan 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants