-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare the MONDO outreach presentation materials #13
Comments
Some slides / topics for the presentation
|
@yonromai it probably makes sense to tailor the next week of work to make sure we have a MVP dataset and analysis for the presentation. I think the main things we're missing are:
How does this sound? If it seems infeasible, we can restrict the scope of the MVP. |
Definitely!
This issue fell through the cracks on my end, would probably need more context on what's involved.
Sounds good.
Right - it's been bothering me that I have left all the data artifacts out so far. Do you guys use internally tools like Git LFS and/or DVC? (cc: @ravwojdyla)
Sure! In terms of feature importance, does something like this (see "Example of run") work?
Good! Do you think you'd have some time tomorrow / early next week for a quick live catch up on this ^ ? |
For this repo, we'll want to use a method where the data easily available to a public user, preferably integrated with the repo. I like Git LFS, but the GitHub billing for LFS can be excessive. The method we use for ensembl-genes and nxontolgoy-data is to commit the data directly to an output branch without LFS. Works if all datasets are < 100 MB, which I assume with be the case here. Kind of hacky but gets you version control, gratis storage, and good forkability and community accessibility.
Yes that works, especially if those importance values can be aggregated, so we can show groupings of features.
Yes, let's find a time in slack. |
Right, usually I prefer to avoid putting large data files in git since (1) it makes git increasingly slow and (2) it becomes more likely to hit the 1GB GH size limit (unless erasing data file history). If the checked-in data is small and fairly static then none of these are an issue. In the case of checking-in experiments, training set and output models (to increase reproducibility and preserve history) - it becomes more of a problem. Using a solution backed by something like S3 or GCS could work long term but is definitely too much work / out of scope for now. If history/reproducibility aren't super important, it's probably okay too keep being mindful about which dataset is persisted and only check-in essential datasets (like => I agree that checking the "final" model in git is the way to go forward at this point in time.
Cool! Let's discuss all these points tomorrow |
I put up a rough outline and template slide deck at https://slides.com/dhimmel/efo-disease-precision. |
@dhimmel nice to prepare the seminar in the open. Just FYI, some people in Monarch have already asked that we look at some concrete examples of diseases where it is hard to classify them as grouping (area) or proper disease. Will you mention a few examples of clear-cut subtypes, groupings, and diseases, and a few questionable ones for debate? |
Just a quick note here on the difficulties of establishing ground truth here, don't have time for a full summary here but see this as a quick literature proxy (usual caveats AI etc): https://www.perplexity.ai/search/Is-Parkinson-disease-G_cC1TZOQ121Q2ZWmKjCZw?s=c |
@dhimmel one of these might be worth including: #2 (comment) I have no strong opinion one way or the other, but +1 to having a sampling of terms by precision label somewhere after #30. |
@eric-czech the one area that I don't feel fully confident in presenting is how the initial RS labels that we use for training were calculated. Would you be able to jot down a few notes on that process here? I could then copy that to a slide and then you could present that slide. |
Certainly. The process went like this:
I went through this loop ~5 times and stopped once the top predictions for the |
Thanks @yonromai and @eric-czech for the help presenting. We can leave this issue up until the recording is online and we add a link to the recording/slides to the README. Copying the zoom chat log here from today, since there are some good suggestions that we should follow up on.
|
On 2023-09-22, we're giving a presentation at the Mondo Workshops / Outreach Call. The current title is "Classifying EFO/MONDO diseases as areas, roots, or subtypes". Daniel Korn from Every Cure will also present, so we should plan to not exceed 25 minutes.
We can use this issue to coordinate the slides and materials.
The text was updated successfully, but these errors were encountered: