-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add min_issue to covidcast meta data #236
Conversation
Fails integration tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this is not going to work as intended, and instead will show the issue date of the least-recently-updated of the issues in a group, which i would argue is meaningless (plz correct me if im wrong). i think it will require a MIN(issue) inside the sub-SELECT and appropriate handling outside of that.
also, some non-trivial test cases would be nice, with different issue values.
@@ -576,6 +576,7 @@ def get_covidcast_meta(self): | |||
ROUND(AVG(`value`),7) AS `mean_value`, | |||
ROUND(STD(`value`),7) AS `stdev_value`, | |||
MAX(`value_updated_timestamp`) AS `last_update`, | |||
MIN(`issue`) as `min_issue`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this is not going to work as intended, and instead will show the issue date of the least-recently-updated of the issues in a group, which i would argue is meaningless (plz correct me if im wrong). i think it will require a MIN(issue) inside the sub-SELECT and appropriate handling outside of that.
also, some non-trivial test cases would be nice, with different issue values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, I will think about it how to fix it. Moreover, we can think of using the is_latest_issue
to find the max issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@melange396 can you take a look at the new statements. I split it into two, since I couldn't think of a fast way to handle both cases at the same time. Moreover, I switched the nested statement to the is_latest_issue
flag. It is not that easy to test for me, since I don't have access to the real database but just local test data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suppose this will work, but the old SQL query would have done the trick if you added min(
issue)
min_issue`` to the nested select... i think that would be preferable to adding a bunch of extra lines unless you can think of another benefit that justifies the extra complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but I think this version should be more efficient using is_latest_issue
instead of a join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after wrestling with and waiting for long-duration queries for the past week, i think it might be wise to test the performance of the one-query vs two-query approaches on the staging server. i can certainly help do this if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for being so late to this conversation.
The use case for min_issue
is briefly described in the source issue for this PR, but the jist is this: A researcher is constructing as_of
queries and wants to know the earliest date they can pass in and still expect a non-nil result. This means that min_issue
must not be restricted to rows with is_latest_issue=1
, since that would hide issues which are not the most recent but nevertheless accessible using as_of
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will go ahead and try this and other variants out on the staging copy of the database to check performance and runtimes
@@ -576,6 +576,7 @@ def get_covidcast_meta(self): | |||
ROUND(AVG(`value`),7) AS `mean_value`, | |||
ROUND(STD(`value`),7) AS `stdev_value`, | |||
MAX(`value_updated_timestamp`) AS `last_update`, | |||
MIN(`issue`) as `min_issue`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suppose this will work, but the old SQL query would have done the trick if you added min(
issue)
min_issue`` to the nested select... i think that would be preferable to adding a bunch of extra lines unless you can think of another benefit that justifies the extra complexity
Putting this back on your radar @melange396 now that the JHU fires are out |
closing this PR since meta generation has changed since then |
closes #232