Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement Dataset API #1017

Open
gdementen opened this issue Sep 14, 2022 · 0 comments
Open

implement Dataset API #1017

gdementen opened this issue Sep 14, 2022 · 0 comments

Comments

@gdementen
Copy link
Contributor

gdementen commented Sep 14, 2022

Rethink the whole way we interact with data: Session, CheckedSession, FileHandler, LazySession (#727), open_excel, ... See also the refactoring in #761 and #614.

Dataset API:

  • __init__(connect_string, max_memory=None, **kwargs) -- filepath or connection string, kwargs passed to underlying Dataset implementation (compression option, Excel option, ...). If max_memory is not None, the Dataset will transparently flush some of its content (probably base on LRU) to "disk" when more memory is needed.
  • open(**kwargs) -- open/connect to the underlying storage. Kwargs here override those passed in __init__. Normally called via __enter__.
  • __enter__ and __exit__ (to be usable as a context manager)
  • read(key=None) -- read a single key, multiple keys (when key is a list), or everything (if key is None) and return the values. Unsure this explicit method makes sense. Maybe __getitem__, with an optional load() is enough.
  • load(key=None) -- load a single key, multiple keys (when key is a list), or everything (if key is None) and return nothing.
  • open_key(key=None) -- in the future for returning a lazy object which will load data when actually accessed. Can potentially load only part of that key (array/...). This needs further thoughts.
  • __getattr__ -> forwards to __getitem__
  • __getitem__(key) -> equivalent to load(key) if not loaded yet and return the array (or use open_key(key) instead???)
  • __setitem__(key) -> add or change an existing value.
  • close() -- close file/connection to underlying storage. Normally called via __exit__

Misc thoughts:

  • I think excel.Workbook should be a subclass of Dataset
  • We could/should also implement a generic "read" top-level function which would open a dataset, read the array and close it, to replace/complement the read_* functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant