-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sample: added --smiles and --save_json options #25
base: master
Are you sure you want to change the base?
Conversation
--smiles can be used to encode smiles strings that are not in the dataset. The dataset is still used to do the charset one-hot encoding correctly. --save_json was also added as an alternate output format
I've been thinking about this PR and I'm not a big fan of the direction it takes us. Do we add parallel json handling next to h5 in all of the other scripts? Do we let the interfaces skew? This adds global-sounding flags but which are only relevant for the I'd much rather keep hfd5 as the serialization scheme here (json becomes really problematic on larger datasets) and add a script for working between hdf5 and json if needed. As for |
Hi, sorry for the delay responding. I'll remove The main contribution here is The |
Yes, I definitely agree with this. I think the bigger thing that's wrong here is there's no reason to store the cc @dakoner |
Yep, I'd want a --charset charset.h5 This would work even with legacy files that have the 'charset' data in them (model, encoder output). |
Sure, I think |
@dribnet You might like to take a look at my PR #43, which hard codes a charset, and provides a helper object for decoding and encoding strings given that charset. We can refactor your Is this what you were thinking of / is that what is useful? |
I've been a little out of this the last week as my GPU rig is down and I've been traveling so unable to get to it to fix things. I've landed #43 though so that should simplify the charset questions here. |
--smiles can be used to encode smiles strings that are not
in the dataset. The dataset is still used to do the charset
one-hot encoding correctly.
--save_json was also added as an alternate output format