Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: New MATLAB loader #4441

Closed
wants to merge 6 commits into from
Closed

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Jan 20, 2015

Re #4411 (loading of MATLAB classdef objects):
I gave in and just rewrote the entire MAT5 loader in pure Python -- it is now much shorter (I essentially got rid of the 800-line mio5_utils.pyx while keeping mio5.py nearly the same size) -- then added MATLAB object support (based on the the notes of @mbauman) on top of that.

Some private APIs have been changed and docstrings still need to be updated.

Some issues with character encoding that needs to be clarified.

Changed the API of ZlibInputStream to return empty reads at EOF rather
than raise IOError, like other streams.
See test_classdef.py and gen_classdefmat.m.

References to other objects is not implemented yet.
@matthew-brett
Copy link
Contributor

Did you run the benchmarks?

@anntzer
Copy link
Contributor Author

anntzer commented Jan 20, 2015

Now I have. Memory usage is basically the same as the Cython version after providing ZlibInputStream with a readinto method and adding copy=False to calls to astype; speed is still pretty bad (~10x in the non-compressed case, 30x in the compressed case).

GenericStream.readinto was made cpdef and its signature changed
accordingly to allow reading into a buffer that'll directly be turned
into an ndarray.
@matthew-brett
Copy link
Contributor

I think you will find that you won't be able to optimize much more in Python, but I am happy to be proved wrong. I wrote the Cython reader because people were complaining rather vigorously that the original reader was too slow (about the same speed as you have now).

I am happy to port what you did to the current code, if it is not obvious how to do that.

@anntzer
Copy link
Contributor Author

anntzer commented Jan 20, 2015

Yes, once stream.readinto(buffer) and flags % 0x100 become the bottlenecks I have little hope of doing better.

Re: simplification: I read and yield tags in a big generator loop (which may or may not be a good idea...) in MatFileReader._read_iter, and I got rid of the _matrix_reader and _file_reader attributes, instead passing a different stream argument to a recursive call to MatFile5Reader._read_iter. I also changed unicode decoding to something much simpler, because it seems that the current interpretation by loadmat and savemat are actually not compatible with what MATLAB uses (see #4431) -- just calling (uni)chr on each character seems correct. Anyways, not a big deal but at least this way I could understand what was happening and write the object loading part.

If you want to port the object loading part to the old version (I honestly don't have the courage to do this now), you may want to have a look at _read_minimat and _parse_props, which are basically a Python implementation of @mbauman's Julia notebook, as well as the matrix_cls == mxOPAQUE_CLASS of the main loop. In that part, you should notice the if self._is_minimat: part: I found that the objects are actually stored using a slightly different format in the subsystem than in the rest of the file.

Or, perhaps we can just cythonize the new code...

Let me know if I can be of any help.

... which I had missed initially.
@anntzer
Copy link
Contributor Author

anntzer commented Feb 5, 2015

On another topic, I was worried about what to do in case of self-containing objects, but apparently MATLAB avoids creating them so we may be fine:

>> a = simpleclass

a = 

  simpleclass with properties:

    prop: []

>> a.prop = a;
>> a

a = 

  simpleclass with properties:

    prop: [1x1 simpleclass]

>> a.prop

ans = 

  simpleclass with properties:

    prop: []

@anntzer
Copy link
Contributor Author

anntzer commented Feb 5, 2015

Taking back what I just said; self-containing objects can be created by MATLAB too, one just needs to derive the class from handle (to have reference rather than value semantics). Fortunately, object ndarrays don't have any issues with that either. Not sure I'll bother supporting them though.

@mbauman
Copy link

mbauman commented Feb 6, 2015

I'm glad to see you're finding my notes useful. I managed to access the data I needed and then the project fell off my desk. If you have any questions about my notes or run into any inconsistencies you are very welcome to contact me.

It's really exciting to see support for this in any language!

@anntzer
Copy link
Contributor Author

anntzer commented Feb 7, 2015

Some more fun on MAT files: object references to other objects are represented as uint32 column arrays of 6 elements, with the first element equal to 0xdd000000. I was wondering how do distinguish them from actual uint32 arrays that could also be saved in the file. Answer: MATLAB doesn't distinguish them!

% gen_mat.m
obj1 = simpleclass; % defined as a class with a single property, 'prop'
obj1.prop = uint32([3707764736; 2; 1; 1; 1; 1]); % magic values, refer to first object in file (obj1 itself)
obj2 = simpleclass;
obj2.prop = uint32([3707764736; 0; 0; 0; 0; 0]); % invalid reference
obj3 = simpleclass;
obj3.prop = uint32([3707764737; 0; 0; 0; 0; 0]); % this guy works fine
save tmp obj1 obj2 obj3;
clear;
load tmp;
obj1, obj1.prop, obj2, obj3
$ matlab -nodesktop -r 'gen_mat; exit;'

                               < M A T L A B (R) >
                     Copyright 1984-2014 The MathWorks, Inc.
                       R2014a (8.3.0.532) 64-bit (glnxa64)
                                February 11, 2014


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.


obj1 = 

  simpleclass with properties:

    prop: [1x1 simpleclass]


ans = 

  simpleclass with properties:

    prop: [1x1 simpleclass]


obj2 = 

  simpleclass with properties:

    prop: []


obj3 = 

  simpleclass with properties:

    prop: [6x1 uint32]

@mbauman
Copy link

mbauman commented Feb 7, 2015

:) Rather remarkable, isn't it? If I remember right they don't even check the length. See JuliaIO/MAT.jl#23 (comment)

@anntzer
Copy link
Contributor Author

anntzer commented Feb 7, 2015

@mbauman I have finally figured out a case where "segment 5" is populated: when saving classes with dynamic properties, e.g. below.

obj1 = dynamicclass; % dynamicclass derives from dynamicprops
addprop(obj1, 'dprop');
save testdynamic obj1;

Not that I care much about them, and it's not totally clear how to represent such objects (and especially arrays of such objects with varying defined properties) on the numpy side of things.

@anntzer
Copy link
Contributor Author

anntzer commented Mar 24, 2015

FWIW I repackaged the whole thing as an independent package (http://github.com/anntzer/matloader) for my own use, so that you can have the best of both worlds.

@rgommers rgommers added scipy.io needs-work Items that are pending response from the author labels Mar 24, 2015
@rgommers
Copy link
Member

No movement on this in 3.5 years and this is available as https://github.com/anntzer/matloader. Hence closing this PR. Thanks @anntzer, all!

@rgommers rgommers closed this Aug 31, 2018
@anntzer anntzer deleted the new-matlab-loader branch August 31, 2018 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-work Items that are pending response from the author scipy.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants