Native support for groupInfo slot #64

jonocarroll · 2020-02-28T14:10:40Z

I propose enabling native support for group information and submit a draft implementation.

One of the advantages of dplyr and tibble is support for groupwise operations. dplyr stores a tbl_df of group information; grouping columns with the non-empty combinations of levels (named by their respective column name in the data) and .rows which is a list of rownames to which the grouping results. e.g. in dplyr:

group_data(group_by(mtcars, am, gear))
# A tibble: 4 x 3
     am  gear .rows     
  <dbl> <dbl> <list>    
1     0     3 <int [15]>
2     0     4 <int [4]> 
3     1     4 <int [8]> 
4     1     5 <int [5]>

This approach has the drawback (in my opinion) that this listing of .rows must be updated with every operation which alters the number of rows (including [, filter, slice, rbind, etc... ). An alternative (and the approach used here) is to store only the names of the columns which should be involved in further groupwise operations.

str(DataFrame(mtcars[1:4,1:4], groupInfo = c("am", "gear")))
Formal class 'DFrame' [package "S4Vectors"] with 7 slots
  ..@ rownames       : chr [1:4] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
  ..@ nrows          : int 4
  ..@ groupInfo      : chr [1:2] "am" "gear"
  ..@ listData       :List of 4
  .. ..$ mpg : num [1:4] 21 21 22.8 21.4
  .. ..$ cyl : num [1:4] 6 6 4 6
  .. ..$ disp: num [1:4] 160 160 108 258
  .. ..$ hp  : num [1:4] 110 110 93 110
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata: NULL
  ..@ metadata       : list()

I provide accessors and replacement functions for this slot, but leave any processing of the data to whichever code can make use of the information. I have added dplyr consistency only in the creation of a DataFrame from an already grouped tbl_df or conversion back to one

groupInfo(as(group_by(mtcars, am, gear), "DataFrame"))
[1] "am"   "gear"

This way, methods which act on individual columns of a DataFrame can delegate how to perform the groupwise action. Where this may be of significant benefit is e.g. plyranges (ping: @sa-lee) for which a GRanges column may be better suited to perform the groupwise operation internally rather than sliced into rows.

I have tested the implementation in this PR with a branch of DFplyr which aims to extend dplyr support to DataFrame: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo and as far as I can tell this is successful.

This PR is intended as a draft and I welcome feedback. My implementation may not be ideal but hopefully serves as a starting point.

constructor support, getter, setter for groupInfo coercion to/from tbl_df/grouped_df show method presents group information documentation, example, NEWS updated version bump

lawremi · 2020-02-28T21:23:18Z

Thanks for keeping the ball rolling on this.

The reason they store the subscripts, of course, is that the grouping only needs to be computed once. In terms of downsides, besides the updating overhead that you mentioned, the subscripts also take up memory. If we did store the subscripts, we would want a compressed list, conceptually equivalent to the return value of base::grouping().

hpages · 2020-02-28T21:53:00Z

Oh my!
I hope you realize that changing the internal representation of DFrame objects will break all the serialized instances that exist around so will probably break hundreds of Bioconductor packages and a few dozens CRAN packages. Plus of course many standalone scripts that people keep around just in case they need to re-run their analyses. So this cannot be made lightly ;-)

If you want to pursue that route you would need to check at least a dozen of other Bioconductor infrastructure packages that sit on top of S4Vectors (IRanges, GenomicRanges, Biostrings, SummarizedExperiment, etc... see S4Vectors/inst/doc/HTS_core_package_stack.txt) and provide PRs that fix the broken ones.

Note that a much less disruptive route would be to use a metadata column to keep track of the grouping columns:

DF <- DataFrame(mtcars[1:4, ])
mcols(DF)$.grouping <- colnames(DF) %in% c("am", "gear")

DF
# DataFrame with 4 rows and 11 columns
#                      mpg       cyl      disp        hp      drat        wt
#                <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# Mazda RX4           21.0         6       160       110      3.90     2.620
# Mazda RX4 Wag       21.0         6       160       110      3.90     2.875
# Datsun 710          22.8         4       108        93      3.85     2.320
# Hornet 4 Drive      21.4         6       258       110      3.08     3.215
#                     qsec        vs        am      gear      carb
#                <numeric> <numeric> <numeric> <numeric> <numeric>
# Mazda RX4          16.46         0         1         4         4
# Mazda RX4 Wag      17.02         0         1         4         4
# Datsun 710         18.61         1         1         4         1
# Hornet 4 Drive     19.44         1         0         3         1

mcols(DF)
# DataFrame with 11 rows and 1 column
#      .grouping
#      <logical>
# mpg      FALSE
# cyl      FALSE
# disp     FALSE
# hp       FALSE
# drat     FALSE
# wt       FALSE
# qsec     FALSE
# vs       FALSE
# am        TRUE
# gear      TRUE
# carb     FALSE

Another benefit of this approach is that you don't need to modify extractCOLS() to make it aware of the new_groupInfo slot because extractCOLS() will do the right thing out-of-the box.

I don't think it's worth changing the interface of the DataFrame() constructor either. Having accessors to set/get the grouping columns should be enough.

About naming
groupInfo is too vague. I would suggest being explicit about the fact that you are recording the grouping columns. This is important to avoid any confusion with the existing grouping mechanism used for CompressedList derivatives where grouping of the rows is recorded via a Partitioning object (the CompressedList class is implemented in the IRanges package).

About coding style

Please touch only the parts you need to touch so when we look at your PR we don't get distracted by your attempts at improving/reorganizing existing things that are not related to your proposal. So for example, instead of:

-         representation(
-                        rownames = "character_OR_NULL",
-                        nrows = "integer"
-                        ),
-         prototype(rownames = NULL,
-                   nrows = 0L,
-                   listData = structure(list(), names = character())),
+         slots = c(rownames = "character_OR_NULL",
+                   nrows = "integer",
+                   groupInfo = "character_OR_NULL"),

your change should just be:

         representation(
                        rownames = "character_OR_NULL",
-                       nrows = "integer"
+                      nrows = "integer",
+                      groupInfo = "character_OR_NULL"
                        ),

Really?

if (!is.null(groupInfo)) {
  groups <- groupInfo
} else {
  groups <- NULL
}

The way I generally solve this problem is by doing

groups <- groupInfo

jonocarroll · 2020-02-28T23:23:19Z

Thanks for the speedy feedback, and apologies for my lack of expertise in S4 constructs.

I was under the (evidently incorrect) belief from conversations with @lawremi (which I have clearly misunderstood) that a new slot could be added for the group information with a prototype of NULL and not break existing structures. Breakage of anything is not my intention at all; this should be an additional piece of stored information where it can be added but default to NULL for all other instances. Testing with an object serialized from the master branch does not work in this PR so clearly changes are required.

My (excessive) changes to the setClass were from trying to add a slots argument which does not play nice with the representation argument (deprecated from R3.0.0)

Argument "representation" cannot be used if argument "slots" is supplied

but going in this direction was purely due to my lack of understanding of how to extend the class.

I'm embarrassed that in my many refactorings of my attempt I overlooked a trivial simplification. I can only hope that you'll believe me if I say it didn't start out like that.

On to the meat of it - I'm fine with using mcols for this purpose. It seems to solve all the immediate issues. Part of the reason for looking to make a change in this repo was that I was hoping to find a canonical place to store the grouping information - my current master branch of DFplyr stores this in an attribute of listData but the two biggest issues with that were:

show does not know about this. Having a canonical place to store this information means show can be modified in this repo to check for it and display grouping information (as per this PR)
There was no enforced contract for anyone else to use this space to store grouping information, so it was unlikely to become a standard. This means everyone needs to write their own getters and setters.

I used the name groupInfo because originally I was going to follow dplyr's methodology of storing the unique values of the columns and the corresponding rows. With a reduction to just the columns I agree the name should change. I didn't use grouping as I intended the getter to be named the same and this would conflict with base::grouping.

I'll clean this PR up to a) use the simpler change to setClass and b) move "grouping" to mcols(x) then we can hopefully discuss further. Thank you.

lawremi · 2020-02-28T23:29:55Z

Serialization should not be a problem, since we're adding an "optional" slot, but using the mcols()would be a good way to prototype these ideas before making changes that would be difficult to reverse.

hpages · 2020-02-29T00:03:01Z

@jonocarroll Sounds good. If you move "grouping" to mcols(x) you don't need to touch setClass at all.

@lawremi After adding a slot all the serialized instances become invalid. So it's serious business, even if the slot is "optional". In any case, an updateObject() method would be a must.

This is a good opportunity for me to stress again the importance of adding a mechanism in R that will allow us to automatically fix out-of-sync serialized S4 instances at load time. It's very easy to do.

jonocarroll · 2020-02-29T04:35:30Z

Implementation up to date in DFplyr: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo

I've removed any changes to the object or constructors. I've added a bit of safety checking in how groupCols() updates. I'm not too fussed about the naming, so if these names aren't ideal we can change them.

lawremi · 2020-02-29T15:58:08Z

@hpages How is the patch coming along for object updating? Good to hear that it's easy, because it was giving Gabe major headaches.

hpages · 2020-02-29T20:31:34Z

Good to hear that it's easy, because it was giving Gabe major headaches.

Well the approach I'm proposing is simple and should be straightforward to implement. Still waiting for the green light to start working on a patch. Sounds like maybe I should just do it and submit to the Bugzilla tracker.

jonocarroll added 2 commits February 17, 2020 10:27

support for storage of groupInfo in a new slot

ef1c254

constructor support, getter, setter for groupInfo coercion to/from tbl_df/grouped_df show method presents group information documentation, example, NEWS updated version bump

groupInfo should be a character vector

b37d45c

move groupCol information to mcols()

70a072f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for groupInfo slot #64

Native support for groupInfo slot #64

jonocarroll commented Feb 28, 2020

lawremi commented Feb 28, 2020

hpages commented Feb 28, 2020

jonocarroll commented Feb 28, 2020

lawremi commented Feb 28, 2020

hpages commented Feb 29, 2020

jonocarroll commented Feb 29, 2020

lawremi commented Feb 29, 2020

hpages commented Feb 29, 2020

Native support for groupInfo slot #64

Are you sure you want to change the base?

Native support for groupInfo slot #64

Conversation

jonocarroll commented Feb 28, 2020

lawremi commented Feb 28, 2020

hpages commented Feb 28, 2020

jonocarroll commented Feb 28, 2020

lawremi commented Feb 28, 2020

hpages commented Feb 29, 2020

jonocarroll commented Feb 29, 2020

lawremi commented Feb 29, 2020

hpages commented Feb 29, 2020