-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native support for groupInfo slot #64
base: devel
Are you sure you want to change the base?
Conversation
constructor support, getter, setter for groupInfo coercion to/from tbl_df/grouped_df show method presents group information documentation, example, NEWS updated version bump
Thanks for keeping the ball rolling on this. The reason they store the subscripts, of course, is that the grouping only needs to be computed once. In terms of downsides, besides the updating overhead that you mentioned, the subscripts also take up memory. If we did store the subscripts, we would want a compressed list, conceptually equivalent to the return value of |
Oh my! If you want to pursue that route you would need to check at least a dozen of other Bioconductor infrastructure packages that sit on top of S4Vectors (IRanges, GenomicRanges, Biostrings, SummarizedExperiment, etc... see Note that a much less disruptive route would be to use a metadata column to keep track of the grouping columns:
Another benefit of this approach is that you don't need to modify I don't think it's worth changing the interface of the About naming About coding style
|
Thanks for the speedy feedback, and apologies for my lack of expertise in S4 constructs. I was under the (evidently incorrect) belief from conversations with @lawremi (which I have clearly misunderstood) that a new slot could be added for the group information with a prototype of NULL and not break existing structures. Breakage of anything is not my intention at all; this should be an additional piece of stored information where it can be added but default to NULL for all other instances. Testing with an object serialized from the master branch does not work in this PR so clearly changes are required. My (excessive) changes to the
but going in this direction was purely due to my lack of understanding of how to extend the class. I'm embarrassed that in my many refactorings of my attempt I overlooked a trivial simplification. I can only hope that you'll believe me if I say it didn't start out like that. On to the meat of it - I'm fine with using
I used the name I'll clean this PR up to a) use the simpler change to |
Serialization should not be a problem, since we're adding an "optional" slot, but using the |
@jonocarroll Sounds good. If you move "grouping" to @lawremi After adding a slot all the serialized instances become invalid. So it's serious business, even if the slot is "optional". In any case, an This is a good opportunity for me to stress again the importance of adding a mechanism in R that will allow us to automatically fix out-of-sync serialized S4 instances at load time. It's very easy to do. |
Implementation up to date in I've removed any changes to the object or constructors. I've added a bit of safety checking in how |
@hpages How is the patch coming along for object updating? Good to hear that it's easy, because it was giving Gabe major headaches. |
Well the approach I'm proposing is simple and should be straightforward to implement. Still waiting for the green light to start working on a patch. Sounds like maybe I should just do it and submit to the Bugzilla tracker. |
I propose enabling native support for group information and submit a draft implementation.
One of the advantages of
dplyr
andtibble
is support for groupwise operations.dplyr
stores atbl_df
of group information; grouping columns with the non-empty combinations of levels (named by their respective column name in the data) and .rows which is a list of rownames to which the grouping results. e.g. indplyr
:This approach has the drawback (in my opinion) that this listing of .rows must be updated with every operation which alters the number of rows (including
[
,filter
,slice
,rbind
, etc... ). An alternative (and the approach used here) is to store only the names of the columns which should be involved in further groupwise operations.I provide accessors and replacement functions for this slot, but leave any processing of the data to whichever code can make use of the information. I have added
dplyr
consistency only in the creation of aDataFrame
from an already groupedtbl_df
or conversion back to oneThis way, methods which act on individual columns of a
DataFrame
can delegate how to perform the groupwise action. Where this may be of significant benefit is e.g.plyranges
(ping: @sa-lee) for which aGRanges
column may be better suited to perform the groupwise operation internally rather than sliced into rows.I have tested the implementation in this PR with a branch of
DFplyr
which aims to extenddplyr
support toDataFrame
: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo and as far as I can tell this is successful.This PR is intended as a draft and I welcome feedback. My implementation may not be ideal but hopefully serves as a starting point.