Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample: added --smiles and --save_json options #25

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dribnet
Copy link
Contributor

@dribnet dribnet commented Nov 7, 2016

--smiles can be used to encode smiles strings that are not
in the dataset. The dataset is still used to do the charset
one-hot encoding correctly.

--save_json was also added as an alternate output format

--smiles can be used to encode smiles strings that are not
in the dataset. The dataset is still used to do the charset
one-hot encoding correctly.

--save_json was also added as an alternate output format
@maxhodak
Copy link
Owner

maxhodak commented Nov 9, 2016

I've been thinking about this PR and I'm not a big fan of the direction it takes us. Do we add parallel json handling next to h5 in all of the other scripts? Do we let the interfaces skew? This adds global-sounding flags but which are only relevant for the encoder target.

I'd much rather keep hfd5 as the serialization scheme here (json becomes really problematic on larger datasets) and add a script for working between hdf5 and json if needed.

As for --smiles, I'd rather that read from stdin which is a better non-data-file interface.

@dribnet
Copy link
Contributor Author

dribnet commented Nov 11, 2016

Hi, sorry for the delay responding. I'll remove --save_json and update --smiles to use stdin. Longer thoughts below.

The main contribution here is --smiles. It was slightly tricky to get this logic right, but I think it's valuable to reuse a trained model on data not in the original training set is valuable - which means encoding the new smiles string against an existing charset. I've done some experiments and it is working well for me with the model_500k.h5 provided trained model. I agree that using stdin would be a good alternative and can make this change.

The --save_json was just added as a small afterthought to make this library compatible with my own plat library. It's only a few lines of code and thought it provided a slightly cleaner output than the default savetxt. But I'm happy to remove it since it's fairly trivial to take the savetxt output and transform it into this json form.

@dribnet dribnet mentioned this pull request Nov 11, 2016
@maxhodak
Copy link
Owner

I think it's valuable to reuse a trained model on data not in the original training set is valuable - which means encoding the new smiles string against an existing charset.

Yes, I definitely agree with this. I think the bigger thing that's wrong here is there's no reason to store the charset with the data to operate on at any given time. I've been thinking about pulling out a --charset or even as a required parameter for a while but it seemed like extra work. What do you think of that instead of extra specifying new data?

cc @dakoner

@dakoner
Copy link
Contributor

dakoner commented Nov 12, 2016

Yep, I'd want a --charset charset.h5

This would work even with legacy files that have the 'charset' data in them (model, encoder output).

@dribnet
Copy link
Contributor Author

dribnet commented Nov 12, 2016

Sure, I think --charset is a great idea. I can introduce that when I refactor this pull request.

@pechersky
Copy link
Collaborator

@dribnet You might like to take a look at my PR #43, which hard codes a charset, and provides a helper object for decoding and encoding strings given that charset. We can refactor your --smiles flag to provide a file-like object that streams in SMILES strings, which get converted on the fly to one-hot-matrices and then encoded/autoencoded.

Is this what you were thinking of / is that what is useful?

@maxhodak
Copy link
Owner

I've been a little out of this the last week as my GPU rig is down and I've been traveling so unable to get to it to fix things. I've landed #43 though so that should simplify the charset questions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants