sample: added --smiles and --save_json options #25

dribnet · 2016-11-07T12:23:50Z

--smiles can be used to encode smiles strings that are not
in the dataset. The dataset is still used to do the charset
one-hot encoding correctly.

--save_json was also added as an alternate output format

--smiles can be used to encode smiles strings that are not in the dataset. The dataset is still used to do the charset one-hot encoding correctly. --save_json was also added as an alternate output format

maxhodak · 2016-11-09T08:44:06Z

I've been thinking about this PR and I'm not a big fan of the direction it takes us. Do we add parallel json handling next to h5 in all of the other scripts? Do we let the interfaces skew? This adds global-sounding flags but which are only relevant for the encoder target.

I'd much rather keep hfd5 as the serialization scheme here (json becomes really problematic on larger datasets) and add a script for working between hdf5 and json if needed.

As for --smiles, I'd rather that read from stdin which is a better non-data-file interface.

dribnet · 2016-11-11T21:02:20Z

Hi, sorry for the delay responding. I'll remove --save_json and update --smiles to use stdin. Longer thoughts below.

The main contribution here is --smiles. It was slightly tricky to get this logic right, but I think it's valuable to reuse a trained model on data not in the original training set is valuable - which means encoding the new smiles string against an existing charset. I've done some experiments and it is working well for me with the model_500k.h5 provided trained model. I agree that using stdin would be a good alternative and can make this change.

The --save_json was just added as a small afterthought to make this library compatible with my own plat library. It's only a few lines of code and thought it provided a slightly cleaner output than the default savetxt. But I'm happy to remove it since it's fairly trivial to take the savetxt output and transform it into this json form.

maxhodak · 2016-11-11T23:41:34Z

I think it's valuable to reuse a trained model on data not in the original training set is valuable - which means encoding the new smiles string against an existing charset.

Yes, I definitely agree with this. I think the bigger thing that's wrong here is there's no reason to store the charset with the data to operate on at any given time. I've been thinking about pulling out a --charset or even as a required parameter for a while but it seemed like extra work. What do you think of that instead of extra specifying new data?

cc @dakoner

dakoner · 2016-11-12T00:13:20Z

Yep, I'd want a --charset charset.h5

This would work even with legacy files that have the 'charset' data in them (model, encoder output).

dribnet · 2016-11-12T21:37:55Z

Sure, I think --charset is a great idea. I can introduce that when I refactor this pull request.

pechersky · 2016-11-23T17:01:41Z

@dribnet You might like to take a look at my PR #43, which hard codes a charset, and provides a helper object for decoding and encoding strings given that charset. We can refactor your --smiles flag to provide a file-like object that streams in SMILES strings, which get converted on the fly to one-hot-matrices and then encoded/autoencoded.

Is this what you were thinking of / is that what is useful?

maxhodak · 2016-11-25T07:27:18Z

I've been a little out of this the last week as my GPU rig is down and I've been traveling so unable to get to it to fix things. I've landed #43 though so that should simplify the charset questions here.

sample: added --smiles and --save_json options

b836cc5

--smiles can be used to encode smiles strings that are not in the dataset. The dataset is still used to do the charset one-hot encoding correctly. --save_json was also added as an alternate output format

dribnet mentioned this pull request Nov 11, 2016

Property prediction #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample: added --smiles and --save_json options #25

sample: added --smiles and --save_json options #25

dribnet commented Nov 7, 2016

maxhodak commented Nov 9, 2016

dribnet commented Nov 11, 2016

maxhodak commented Nov 11, 2016

dakoner commented Nov 12, 2016

dribnet commented Nov 12, 2016

pechersky commented Nov 23, 2016

maxhodak commented Nov 25, 2016

sample: added --smiles and --save_json options #25

Are you sure you want to change the base?

sample: added --smiles and --save_json options #25

Conversation

dribnet commented Nov 7, 2016

maxhodak commented Nov 9, 2016

dribnet commented Nov 11, 2016

maxhodak commented Nov 11, 2016

dakoner commented Nov 12, 2016

dribnet commented Nov 12, 2016

pechersky commented Nov 23, 2016

maxhodak commented Nov 25, 2016