Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB rp empirical #4

Merged
merged 8 commits into from
Dec 10, 2024
Merged

DB rp empirical #4

merged 8 commits into from
Dec 10, 2024

Conversation

zackarno
Copy link
Collaborator

@zackarno zackarno commented Dec 5, 2024

add empirical RP simplification functions.

  • I've added functions to calculate empirical RPs to src/utils/return_periods.py
  • An example of the implementation is shown in exploration/04_add_return_periods_example.py

I've kept the loading functions and processing functions separate for flexibility in how the pipeline will be configured/parameterized.

@zackarno zackarno requested a review from t-downing December 5, 2024 19:16
@zackarno zackarno mentioned this pull request Dec 5, 2024
3 tasks
Copy link
Collaborator

@t-downing t-downing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I believe the empirical RP calculation is correct, but may just want to include more NA handling.

Also, just a heads up that alot of this can also be done straight in pandas instead of numpy, which I've just found to be a bit more legible and easier to debug (e.g., don't need to worry about matching the ordering of the arrays and stuff since it's all in the same dataframe).

Just as an example, this is how I've done it to calculate the RPs by admin1, but could be done by any groupby by:

def calculate_rp(group, col_name):
    group[f"rank_{col_name}"] = group[col_name].rank(ascending=False)
    group[f"rp_{col_name}"] = (len(group) + 1) / group[f"rank_{col_name}"]
    return group

era5_adm1_jas_df = (
    era5_adm1_jas_df.groupby("ADM1_PCODE")
    .apply(calculate_rp, col_name="mean", include_groups=False)
    .reset_index()
    .drop(columns=["level_1"])
)

In this example it was for flooding, so high value = bad. But for drought (i.e. low value = bad), you can just set ascending=True in .rank() (which is the default behaviour anyways).

requirements.txt Outdated Show resolved Hide resolved
src/utils/pg.py Show resolved Hide resolved
src/utils/return_periods.py Outdated Show resolved Hide resolved
src/utils/return_periods.py Show resolved Hide resolved
@zackarno
Copy link
Collaborator Author

zackarno commented Dec 9, 2024

Nice! I believe the empirical RP calculation is correct, but may just want to include more NA handling.

Also, just a heads up that alot of this can also be done straight in pandas instead of numpy, which I've just found to be a bit more legible and easier to debug (e.g., don't need to worry about matching the ordering of the arrays and stuff since it's all in the same dataframe).

Just as an example, this is how I've done it to calculate the RPs by admin1, but could be done by any groupby by:

def calculate_rp(group, col_name):
    group[f"rank_{col_name}"] = group[col_name].rank(ascending=False)
    group[f"rp_{col_name}"] = (len(group) + 1) / group[f"rank_{col_name}"]
    return group

era5_adm1_jas_df = (
    era5_adm1_jas_df.groupby("ADM1_PCODE")
    .apply(calculate_rp, col_name="mean", include_groups=False)
    .reset_index()
    .drop(columns=["level_1"])
)

In this example it was for flooding, so high value = bad. But for drought (i.e. low value = bad), you can just set ascending=True in .rank() (which is the default behaviour anyways).

yeah this code makes sense. But I don't really get how the group arg is being applied in the function implementation. I'm not sure the details/trade-offs here in python, but in R i always think it's slick when you can vectorize a function so that all it takes is the x vector and allow the dplyr::group_by()/ pd.groupby to hand to handle the grouping aspect independently. In R there are a lot of advantages to doing this, but i'll DM you to chat further as I'm afraid i might start a circular conversation here due to my lack of python/pandas knowledge

@zackarno
Copy link
Collaborator Author

zackarno commented Dec 9, 2024

yeah i switched to your suggestion of using pandas instead of numpy - seems a bit more user-friendly for this use-case

Copy link
Collaborator

@t-downing t-downing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 😎 Looks good to me!

I didn't go into as much detail checking the values, but confirmed the basics:

  • higher value means higher RP
  • RPs look plausible, and are inf for when RP > 10
  • NAs don't generate false RPs

Comment on lines 15 to 22
df_filt = df[
~df[by].apply(tuple, axis=1).isin(df_nans.apply(tuple, axis=1))
]
df_maxima_filt = df_maxima[
~df_maxima[by]
.apply(tuple, axis=1)
.isin(df_maxima_nans.apply(tuple, axis=1))
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the source of the SettingWithCopyWarning 👀

To avoid those coming up, can just add .copy() to each of these filtering steps like:

    df_filt = df[
        ~df[by].apply(tuple, axis=1).isin(df_nans.apply(tuple, axis=1))
    ].copy()
    df_maxima_filt = df_maxima[
        ~df_maxima[by]
        .apply(tuple, axis=1)
        .isin(df_maxima_nans.apply(tuple, axis=1))
    ].copy()

Anyways not necessary, up to you if you want to switch it.

Copy link
Collaborator Author

@zackarno zackarno Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i had gotten rid of a bunch of them by using .loc , but not sure i how i could do that for these operations. Anyways i added the .copy() for now. good call, thanks!

@zackarno zackarno merged commit b48964e into main Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants