-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB rp empirical #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I believe the empirical RP calculation is correct, but may just want to include more NA handling.
Also, just a heads up that alot of this can also be done straight in pandas
instead of numpy
, which I've just found to be a bit more legible and easier to debug (e.g., don't need to worry about matching the ordering of the arrays and stuff since it's all in the same dataframe).
Just as an example, this is how I've done it to calculate the RPs by admin1, but could be done by any groupby by
:
def calculate_rp(group, col_name):
group[f"rank_{col_name}"] = group[col_name].rank(ascending=False)
group[f"rp_{col_name}"] = (len(group) + 1) / group[f"rank_{col_name}"]
return group
era5_adm1_jas_df = (
era5_adm1_jas_df.groupby("ADM1_PCODE")
.apply(calculate_rp, col_name="mean", include_groups=False)
.reset_index()
.drop(columns=["level_1"])
)
In this example it was for flooding, so high value = bad. But for drought (i.e. low value = bad), you can just set ascending=True
in .rank()
(which is the default behaviour anyways).
yeah this code makes sense. But I don't really get how the |
yeah i switched to your suggestion of using pandas instead of numpy - seems a bit more user-friendly for this use-case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 😎 Looks good to me!
I didn't go into as much detail checking the values, but confirmed the basics:
- higher value means higher RP
- RPs look plausible, and are inf for when RP > 10
- NAs don't generate false RPs
src/utils/return_periods.py
Outdated
df_filt = df[ | ||
~df[by].apply(tuple, axis=1).isin(df_nans.apply(tuple, axis=1)) | ||
] | ||
df_maxima_filt = df_maxima[ | ||
~df_maxima[by] | ||
.apply(tuple, axis=1) | ||
.isin(df_maxima_nans.apply(tuple, axis=1)) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the source of the SettingWithCopyWarning
👀
To avoid those coming up, can just add .copy()
to each of these filtering steps like:
df_filt = df[
~df[by].apply(tuple, axis=1).isin(df_nans.apply(tuple, axis=1))
].copy()
df_maxima_filt = df_maxima[
~df_maxima[by]
.apply(tuple, axis=1)
.isin(df_maxima_nans.apply(tuple, axis=1))
].copy()
Anyways not necessary, up to you if you want to switch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i had gotten rid of a bunch of them by using .loc
, but not sure i how i could do that for these operations. Anyways i added the .copy()
for now. good call, thanks!
add empirical RP simplification functions.
src/utils/return_periods.py
exploration/04_add_return_periods_example.py
I've kept the loading functions and processing functions separate for flexibility in how the pipeline will be configured/parameterized.