Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

CMS NanoAOD interface #45

Open
jpivarski opened this issue Mar 11, 2019 · 9 comments
Open

CMS NanoAOD interface #45

jpivarski opened this issue Mar 11, 2019 · 9 comments
Labels
enhancement New feature or request

Comments

@jpivarski
Copy link
Member

There should be a mechanism that recognizes a TTree as NanoAOD and presents a virtual, formatted view of the data, using knowledge of NanoAOD idioms. For instance, Muon_* should be collected into a single jagged table called muons with the muon branches as its columns. It should use VirtualArrays, so that you can carry an array of muons around without having loaded all of the branches. References between particles and jets—expressed as integer indexes in NanoAOD—should be IndexedArrays. I'm on the fence about making them ChunkedArrays at the basket level—that may be too small. Perhaps they could be ChunkedArrays at the file level (or a function for loading them that takes the chunking size as an option).

This was inspired by scikit-hep/awkward-0.x#95.

@jpivarski
Copy link
Member Author

The links I'm referring to are for cross-cleaning.

@guitargeek
Copy link
Contributor

This is such a great proposition! I am doing something similar in my analysis, but there is unfortunately a large overhead when loading NanoAODs because individual columns are spread over several files accessed via xrootd (about 20 per dataset). You should maybe keep this in mind when thinking of a solution. Some day, data should also be anyway stored columnar in CMS I hope!

Some more things I observed when working with awkward and NanoAOD:

  1. Cross cleaning info in NanoAOD is not really necessary since cross cleaning is so fast and easy with awkward and uproot-methods. As this example where I do a quick TnP study on Run2 data shows: https://github.com/guitargeek/geeksw/blob/master/examples/electron_tnp.ipynb (ln [17]). The cross cleaning information could be dropped to save space.

  2. One could save space in NanoAOD by saving branches directly in the jagged table style described here. there should be one flat tree per object (electrons, muons), and in the main event tree the "starts" and "stops" for each event and object are stored only once. Right now, this information is in every branch which is redundant because many branches belong to the same object and have the same starts and stops.

All in all, I think uproot/awkward should not only adapt better to NanoAOD, but NanoAOD could also benefit from the lessons learned in awkward the other way around.

@jpivarski
Copy link
Member Author

Actually, they evolved together—we were talking with each other when NanoAOD and awkward were both being developed. I had some suggestions about the branch type: to use ROOT arrays instead of std::vector (which adds 10 bytes per event per branch). This is the JaggedArray format, almost byte for byte. (ROOT's offsets are byte offsets relative to the TKey, rather than item offsets relative to the start of data, but that's a subtraction and a bit-shift.)

NanoAOD can save space by storing one set of counts (nMuons) instead of the counts and also the offsets (internally in each of Muon_pt, Muon_eta, etc.), but that's also a ROOT feature, motivated by NanoAOD itself: the TTree::IOFeatures bit that tells ROOT to not store offsets and get everything from the counts. I don't know if this feature is being used in production because it's not backward-compatible in ROOT, but if it is used, uproot can read it and nobody would notice that it's there, apart from a 30% savings in space.

What I'm talking about in this issue is not about changing any formats or making anything more efficient—just packaging it up in a more intuitive way. Turning NanoAOD's links into IndexedArrays wouldn't change their speed, but it would make them act like pointers without user intervention. (As though you had a subset of the electron objects nested within their jets, rather than having to do some extra indexing by hand.)

@guitargeek
Copy link
Contributor

guitargeek commented Mar 11, 2019

Thank you for your explanations! As someone relatively new in CMS, I'm always very glad if someone explains me some context on how things evolved historically. I did not know many of this, so thanks for taking the time to answer even though my previous comment was not really on topic as I see now.

@jpivarski
Copy link
Member Author

That's okay—it's good to hear about the level of interest!

The thread here will be replaced with a PR as soon as I start actually working on it anyway.

@nsmith-
Copy link
Member

nsmith- commented Apr 11, 2019

Hey, can we use the recursively defined IndexedArray for the gen particle parents? :)

@jpivarski
Copy link
Member Author

Yes. If the gen particles looks something like this:

tree = awkward.fromiter([
    {"value": 1.23, "left":    1, "right":    2},     # node 0
    {"value": 3.21, "left":    3, "right":    4},     # node 1
    {"value": 9.99, "left":    5, "right":    6},     # node 2
    {"value": 3.14, "left":    7, "right": None},     # node 3
    {"value": 2.71, "left": None, "right":    8},     # node 4
    {"value": 5.55, "left": None, "right": None},     # node 5
    {"value": 8.00, "left": None, "right": None},     # node 6
    {"value": 9.00, "left": None, "right": None},     # node 7
    {"value": 0.00, "left": None, "right": None},     # node 8
])
left = tree.contents["left"].content
right = tree.contents["right"].content
left[(left < 0) | (left > 8)] = 0         # satisfy overzealous validity checks
right[(right < 0) | (right > 8)] = 0
tree.contents["left"].content = awkward.IndexedArray(left, tree)
tree.contents["right"].content = awkward.IndexedArray(right, tree)

tree[0].tolist()

we can make a tree. (That's what the above does: tree[0] is the tree and all other elements of tree are its subtrees. Try the above code: it prints out a nested dict of dicts.)

@nsmith-
Copy link
Member

nsmith- commented Apr 11, 2019

lol that's a BDT
So the NanoAOD is a bottom-up rather than top-down: each entry in the list has a reference to the index of its parent entry. I played a bit with the recursive thing after reading your spec and I'm pretty sure its possible. Just keeping it on the radar. I can devote some time to an implementation if you like.

@jpivarski
Copy link
Member Author

BDT was a motivating case. Yes, these are top-down arrows, so that you can walk from root to leaf. If gen particle arrows point from leaf to root, then a new calculation would be needed. Since we'd only want to do that on demand, it could be in a VirtualArray.

There are quite a few good things the CMS NanoAOD extension could have. It's not short-timescale like the awkward/uproot-methods version management, though.

By the way, I lost track of something you said about mocking Methods—I didn't understand and then lost the tab. Could you ping me on that with more explanation?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants