-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Extraction of latent space #121
Comments
Hello! But I think I am almost there to at least make it work with categorial features (because in my Embedding task I don't always have a target I could do the binning on the metrical features or at least I don't want to "pollute" my training with this information yet). My biggest question at this moment is the input to the model: Now I have this problem: What does the Mambular-Model actually get as input? To make it easier I would just be interested in what it does get if I am just using categorial features.
I hope you can help me with that and maybe point me to the crucial preprocessing functions so I can extract them from the modules and add them to my preprocessing steps. Thank you very much for your work :) |
Hi, thanks for reiterating on this. While I have not come around incorporating it, I am confident, that we will have the option to extract latent spaces within this year.
If you have multiple categorical features - lets say 3 - which are integer encoded, the shape of each tensor in cat_features should be (B, 1) with B being the batch_size.
Maybe it would be much simpler, if we integrated your custom preprocessing directly into the library? If you can share your preprocessing code and a minimal code example, it should be a rather fast integration from our part. Otherwise, if you only use the base_models/Mambular version and are not relying on preprocessing/data_module and lightning trainer part, you could also just change the input of that BaseModel altogether. Since you are interested in the embeddings, you could change the self.embedding_layer() to a module after your liking that simply takes a single x as input. The input to self.mamba is the same as to any transformer. I.e. a tensor of shape (B, S, D) (Batch Size) X (Sequence Length) X Embedding Dimension), Where the sequence length corresponds to the number of features. I hope this helps :) Feel free to further engage if this does not answer your questions. |
Thank you for your detailed answer.
This then yields: The num_features variable is by default an empty list. Maybe I add them later. For now I think I will just bin them by myself in my personal preprocessing steps and make them categorial in this way. I then pass this through the model. Feel free to comment on that place for extracting. But since the Projection Head is nothing more than a fancy Linear-Layer the shape after the MLP-Layer was the best regarding the shapes. Thanks again for you help and prompt answers :) |
Great :) |
There is no big difference in my preprocessing step. (Just that I use Gaussian-Rank-Scaler for Normalizing). The main difference is in the loader and somehow I could not manage to entangle the preprocessing from the loading. The reason I had to use a different data loading approach was mainly due to my contrastive tasks (which involve two modalities). I always had to ensure that their IDs were in the same order. In my loader, I consistently loaded and preprocessed both modalities, then passed them into my training pipeline and their respective encoders where they then ended up in the shared embedding spaces after their respective Projection - Heads. Now a thing that I am working on is how to find the best embeddings if there is no downstream task for the encoding of the latent space on the metric values. Because then I have to use fixed binning (like 50) but maybe there is another method that could work to discretize the metrical values besides using a tree approach. Looking forward to your next release! |
Dear Team,
Thanks for this awesome Repo :)
I wanted to suggest a way for extracting Latent-Spaces Embeddings and the backpropagation of losses based on this. This could be done with a encode(x) function that gives back the Latent-Spaces.
Maybe this also could be interesting in Using some Embedding anlyses for feature importans, but mainly i would like to use the embedding space for downstream tasks.
I hope this description was helpfull and clear and that you take this idea in consideration.
Thx.
The text was updated successfully, but these errors were encountered: