Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native support for groupInfo slot #64

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

jonocarroll
Copy link

I propose enabling native support for group information and submit a draft implementation.

One of the advantages of dplyr and tibble is support for groupwise operations. dplyr stores a tbl_df of group information; grouping columns with the non-empty combinations of levels (named by their respective column name in the data) and .rows which is a list of rownames to which the grouping results. e.g. in dplyr:

group_data(group_by(mtcars, am, gear))
# A tibble: 4 x 3
     am  gear .rows     
  <dbl> <dbl> <list>    
1     0     3 <int [15]>
2     0     4 <int [4]> 
3     1     4 <int [8]> 
4     1     5 <int [5]> 

This approach has the drawback (in my opinion) that this listing of .rows must be updated with every operation which alters the number of rows (including [, filter, slice, rbind, etc... ). An alternative (and the approach used here) is to store only the names of the columns which should be involved in further groupwise operations.

str(DataFrame(mtcars[1:4,1:4], groupInfo = c("am", "gear")))
Formal class 'DFrame' [package "S4Vectors"] with 7 slots
  ..@ rownames       : chr [1:4] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
  ..@ nrows          : int 4
  ..@ groupInfo      : chr [1:2] "am" "gear"
  ..@ listData       :List of 4
  .. ..$ mpg : num [1:4] 21 21 22.8 21.4
  .. ..$ cyl : num [1:4] 6 6 4 6
  .. ..$ disp: num [1:4] 160 160 108 258
  .. ..$ hp  : num [1:4] 110 110 93 110
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata: NULL
  ..@ metadata       : list()

I provide accessors and replacement functions for this slot, but leave any processing of the data to whichever code can make use of the information. I have added dplyr consistency only in the creation of a DataFrame from an already grouped tbl_df or conversion back to one

groupInfo(as(group_by(mtcars, am, gear), "DataFrame"))
[1] "am"   "gear"

This way, methods which act on individual columns of a DataFrame can delegate how to perform the groupwise action. Where this may be of significant benefit is e.g. plyranges (ping: @sa-lee) for which a GRanges column may be better suited to perform the groupwise operation internally rather than sliced into rows.

I have tested the implementation in this PR with a branch of DFplyr which aims to extend dplyr support to DataFrame: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo and as far as I can tell this is successful.

This PR is intended as a draft and I welcome feedback. My implementation may not be ideal but hopefully serves as a starting point.

constructor support, getter, setter for groupInfo
coercion to/from tbl_df/grouped_df
show method presents group information
documentation, example, NEWS updated
version bump
@lawremi
Copy link
Collaborator

lawremi commented Feb 28, 2020

Thanks for keeping the ball rolling on this.

The reason they store the subscripts, of course, is that the grouping only needs to be computed once. In terms of downsides, besides the updating overhead that you mentioned, the subscripts also take up memory. If we did store the subscripts, we would want a compressed list, conceptually equivalent to the return value of base::grouping().

@hpages
Copy link
Contributor

hpages commented Feb 28, 2020

Oh my!
I hope you realize that changing the internal representation of DFrame objects will break all the serialized instances that exist around so will probably break hundreds of Bioconductor packages and a few dozens CRAN packages. Plus of course many standalone scripts that people keep around just in case they need to re-run their analyses. So this cannot be made lightly ;-)

If you want to pursue that route you would need to check at least a dozen of other Bioconductor infrastructure packages that sit on top of S4Vectors (IRanges, GenomicRanges, Biostrings, SummarizedExperiment, etc... see S4Vectors/inst/doc/HTS_core_package_stack.txt) and provide PRs that fix the broken ones.

Note that a much less disruptive route would be to use a metadata column to keep track of the grouping columns:

DF <- DataFrame(mtcars[1:4, ])
mcols(DF)$.grouping <- colnames(DF) %in% c("am", "gear")

DF
# DataFrame with 4 rows and 11 columns
#                      mpg       cyl      disp        hp      drat        wt
#                <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# Mazda RX4           21.0         6       160       110      3.90     2.620
# Mazda RX4 Wag       21.0         6       160       110      3.90     2.875
# Datsun 710          22.8         4       108        93      3.85     2.320
# Hornet 4 Drive      21.4         6       258       110      3.08     3.215
#                     qsec        vs        am      gear      carb
#                <numeric> <numeric> <numeric> <numeric> <numeric>
# Mazda RX4          16.46         0         1         4         4
# Mazda RX4 Wag      17.02         0         1         4         4
# Datsun 710         18.61         1         1         4         1
# Hornet 4 Drive     19.44         1         0         3         1

mcols(DF)
# DataFrame with 11 rows and 1 column
#      .grouping
#      <logical>
# mpg      FALSE
# cyl      FALSE
# disp     FALSE
# hp       FALSE
# drat     FALSE
# wt       FALSE
# qsec     FALSE
# vs       FALSE
# am        TRUE
# gear      TRUE
# carb     FALSE

Another benefit of this approach is that you don't need to modify extractCOLS() to make it aware of the new_groupInfo slot because extractCOLS() will do the right thing out-of-the box.

I don't think it's worth changing the interface of the DataFrame() constructor either. Having accessors to set/get the grouping columns should be enough.

About naming
groupInfo is too vague. I would suggest being explicit about the fact that you are recording the grouping columns. This is important to avoid any confusion with the existing grouping mechanism used for CompressedList derivatives where grouping of the rows is recorded via a Partitioning object (the CompressedList class is implemented in the IRanges package).

About coding style

  • Please touch only the parts you need to touch so when we look at your PR we don't get distracted by your attempts at improving/reorganizing existing things that are not related to your proposal. So for example, instead of:

    -         representation(
    -                        rownames = "character_OR_NULL",
    -                        nrows = "integer"
    -                        ),
    -         prototype(rownames = NULL,
    -                   nrows = 0L,
    -                   listData = structure(list(), names = character())),
    +         slots = c(rownames = "character_OR_NULL",
    +                   nrows = "integer",
    +                   groupInfo = "character_OR_NULL"),
    

    your change should just be:

             representation(
                            rownames = "character_OR_NULL",
    -                       nrows = "integer"
    +                      nrows = "integer",
    +                      groupInfo = "character_OR_NULL"
                            ),
    
  • Really?

    if (!is.null(groupInfo)) {
      groups <- groupInfo
    } else {
      groups <- NULL
    }
    

    The way I generally solve this problem is by doing

    groups <- groupInfo
    

@jonocarroll
Copy link
Author

Thanks for the speedy feedback, and apologies for my lack of expertise in S4 constructs.

I was under the (evidently incorrect) belief from conversations with @lawremi (which I have clearly misunderstood) that a new slot could be added for the group information with a prototype of NULL and not break existing structures. Breakage of anything is not my intention at all; this should be an additional piece of stored information where it can be added but default to NULL for all other instances. Testing with an object serialized from the master branch does not work in this PR so clearly changes are required.

My (excessive) changes to the setClass were from trying to add a slots argument which does not play nice with the representation argument (deprecated from R3.0.0)

Argument "representation" cannot be used if argument "slots" is supplied

but going in this direction was purely due to my lack of understanding of how to extend the class.

I'm embarrassed that in my many refactorings of my attempt I overlooked a trivial simplification. I can only hope that you'll believe me if I say it didn't start out like that.

On to the meat of it - I'm fine with using mcols for this purpose. It seems to solve all the immediate issues. Part of the reason for looking to make a change in this repo was that I was hoping to find a canonical place to store the grouping information - my current master branch of DFplyr stores this in an attribute of listData but the two biggest issues with that were:

  • show does not know about this. Having a canonical place to store this information means show can be modified in this repo to check for it and display grouping information (as per this PR)
  • There was no enforced contract for anyone else to use this space to store grouping information, so it was unlikely to become a standard. This means everyone needs to write their own getters and setters.

I used the name groupInfo because originally I was going to follow dplyr's methodology of storing the unique values of the columns and the corresponding rows. With a reduction to just the columns I agree the name should change. I didn't use grouping as I intended the getter to be named the same and this would conflict with base::grouping.

I'll clean this PR up to a) use the simpler change to setClass and b) move "grouping" to mcols(x) then we can hopefully discuss further. Thank you.

@lawremi
Copy link
Collaborator

lawremi commented Feb 28, 2020

Serialization should not be a problem, since we're adding an "optional" slot, but using the mcols()would be a good way to prototype these ideas before making changes that would be difficult to reverse.

@hpages
Copy link
Contributor

hpages commented Feb 29, 2020

@jonocarroll Sounds good. If you move "grouping" to mcols(x) you don't need to touch setClass at all.

@lawremi After adding a slot all the serialized instances become invalid. So it's serious business, even if the slot is "optional". In any case, an updateObject() method would be a must.

This is a good opportunity for me to stress again the importance of adding a mechanism in R that will allow us to automatically fix out-of-sync serialized S4 instances at load time. It's very easy to do.

@jonocarroll
Copy link
Author

Implementation up to date in DFplyr: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo

I've removed any changes to the object or constructors. I've added a bit of safety checking in how groupCols() updates. I'm not too fussed about the naming, so if these names aren't ideal we can change them.

@lawremi
Copy link
Collaborator

lawremi commented Feb 29, 2020

@hpages How is the patch coming along for object updating? Good to hear that it's easy, because it was giving Gabe major headaches.

@hpages
Copy link
Contributor

hpages commented Feb 29, 2020

Good to hear that it's easy, because it was giving Gabe major headaches.

Well the approach I'm proposing is simple and should be straightforward to implement. Still waiting for the green light to start working on a patch. Sounds like maybe I should just do it and submit to the Bugzilla tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants