-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761
base: master
Are you sure you want to change the base?
(issue 724) : bypass pandas using pytables directly to work with HDF5 files #761
Conversation
I know you didn't ask for it, but FWIW, I will not have time to review this initial POC before 26th of April. I need to have clear head to do this and it's not the case these days... my initial gut feeling is that you are putting too much into the LArray HDF layer (eg sort_rows, sort_columns & fill_value). Those are only useful when we are loading HDF not produced via larray, which is nice to have but is not what I expected this issue was about. I just wanted to have a simple binary save of larray objects and load them back which would be as fast as possible, and thus bypass pandas entirely. Loading arbitrary HDF objects (not produced via larray) is much more complex and I wouldn't tacke this now, unless you have a clear idea of what is needed. |
PytablesHDFStore is not yet implemented. The class is almost empty. My first objective is to get the backward compatibility with old produced HDF files. Currently, the PR shows a fully implemented PandasHDFStore which just reproduces what we get before. Nothing else. I started the PR to let you know that I'm working on it. I don't expect you to review this soon but maybe, after the 26/4, to first look at the classes structure. |
I know. I just wanted to clarify my vision. Not, that you have to follow it, but I just wanted to avoid any misunderstanding. |
Just to also clarify myself, my wish now is to implement a |
larray/inout/hdf.py
Outdated
return array.astype(str) | ||
try: | ||
array = np.asarray(array) | ||
if array.dtype == np.object_ or (not PY2 and array.dtype.kind == 'S'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converting object arrays to string seem like a bad idea (imagine object array containing floats or ints), or mixed arrays with both strings and numbers (I think most of our real-life object arrays are of that kind).
remainder: add test with an array of dtype=np.object_ |
9264074
to
01669f2
Compare
df28aea
to
03e2fc4
Compare
The pandas bypass and the internals cleanup both seem great. However, I am not sure about making this (LHDFStore) part of the public API yet because this overlaps heavily with a lazy Session backed by an HDF file. At this point I am unsure what is the best API to offer to our users for opening an HDF file and load/write some arrays when needed then close the file but I fear having both lazy sessions and LHDFStore as public API could confuse our users, because that would be essentially two ways to do the same thing (in addition to the current confusion about read_X and sessions). But do not revert anything (except maybe the changes to api.rst), in the worst case LHDFStore will be used by lazy sessions. We might also decide this is a better API than lazy sessions or that it's worth having both API but I simply cannot tell for now. I would like to avoid you releasing the 0.32 release with this new API being advertised and then we add another similar API in the next release and our users being confused. This might be what we do anyway in the end but this at least this needs to be thought through. |
- moved LHDFStore to inout/hdf.py - implemented PandasStorer and PytablesStorer - updated LArray/Axis/Group.to_hdf - removed Metadata.to_hdf and Metadata.from_hdf - renamed PandasHDFHandler as HDFHandler
03e2fc4
to
559d0c0
Compare
559d0c0
to
fa1c222
Compare
engine = 'tables' | ||
else: | ||
import tables | ||
handle = tables.open_file(filepath, mode='r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to use a context manager here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a good look at this again with fresh eyes. And I think there are too many abstraction layers in there:
# -> means "uses"
# ( ) means inherits from
Session -> HDFHandler(FileHandler) -> LHDFStore -> PytablesStorer(AbstractStorer)
-> PandasStorer(AbstractStorer)
I imagined something much "flatter":
Session -> PytablesHDFHandler(FileHandler)
-> PandasHDFHandler(FileHandler)
or (more likely) to avoid a bit of code duplication:
Session -> PytablesHDFHandler(HDFHandler(FileHandler))
-> PandasHDFHandler(HDFHandler(FileHandler))
Note that nothing forbids us to have extra methods in HDFHandler and/or P*HDFHandler for HDF specific stuff (if any is actually necessary). We could probably also make a few of the methods in FileHandler public, and add a few extra methods so that it can be used directly as a context manager like you do with LHDFStore. I don't understand why we need those two extra abstraction layers. I could understand one extra layer if we cannot accomodate the HDF specificities with extra methods/attributes (but on the top of my head, I don't see why it would be the case).
I know this has been a while, but do you remember why you did it this way instead of implementing specific FileHandlers (and enhancing the FileHandler class/protocol as needed)?
PS: The LazySession stuff can come on top of the FileHandler paradigm I think, so here I am happy I didn't go with Session subclasses for each type of file.
@@ -6714,6 +6714,9 @@ def to_hdf(self, filepath, key): | |||
Path where the hdf file has to be written. | |||
key : str or Group | |||
Key (path) of the array within the HDF file (see Notes below). | |||
engine: {'auto', 'tables', 'pandas'}, optional | |||
Dump using `engine`. Use 'pandas' to update an HDF file generated with a LArray version previous to 0.31. | |||
Defaults to 'auto' (use default engine if you don't know the LArray version used to produced the HDF file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change "used to produced" to either "used to produce" or "which produced"
if group is not None: | ||
self._handle.remove_node(group, recursive=True) | ||
paths = key.split('/') | ||
# recursively create the parent groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't the createparents argument do this?
https://www.pytables.org/usersguide/libref/file_class.html#tables.File.create_group
Not yet finished but started the PR to open the discussion.