This CFDE-CC training plan lays out our long term goals (~3-5 years) as well as our short term plan of action (for the remainder of 2020). It will be followed by another report in October that provides a progress update on 2020 efforts, assessment and evaluation results, and details the next steps for training.
The goals of the CFDE training effort are four fold. First, we want to work with specific CFDE Data Coordinating Centers to develop and run DCC-specific and targeted cross-DCC-data set training programs to help their users make improved use of their data. Second, we want to provide broad-based training on data analysis in the cloud, to help the entire CFDE user base shift into a more sustainable long term approach. Third, we plan to work with DCCs to execute hackathons in which more advanced users explore “out of the box” data analysis, in order to help DCCs guide their user experience and pipeline development. And fourth, we expect that broad and deep engagement with a wide range of users will help us identify new and exciting use cases for data reuse that can be brought back to the CF DCCs and the CFDE. Collectively, our training program will train users, help the DCCs lower their support burden, improve user experience, and identify new use cases for data reuse and data integration within and across DCCs.
All training materials produced by the CFDE-CC will be made available through the central nih-cfde.org web site, under CC0 or CC-BY licenses, which will allow it to be used and remixed by any other stakeholders without limitations. Assessment and iteration on the materials will be carried out by the CFDE-CC’s training team during the pilot period, which we expect to continue through the end of 2020; we may engage with external assessment and evaluation as our efforts expand.
The CFDE-CC’s training component is run by Dr. Titus Brown and Dr. Amanda Charbonneau; we are currently recruiting three trainers as well as at least one staff coordinator, all to start in May/June. The training component is closely integrated with the engagement plan, and we expect training to interface with user experience evaluation and iteration across the entire CF and CFDE as well as use case creation and refinement.
In this training plan, we have no specific plans to interface with training efforts outside the Common Fund. However, we are aware of a number of training efforts with similar goals, including Broad’s Terra training program and ANViL’s training focus. Our approaches to sharing materials and running trainings are designed to allow these other training efforts to make use of our materials, and the underlying technologies and approaches we are using (see below) are entirely compatible.
In-person training vs online training
Our initial plan was to run a series of in-person workshops during 2020. However, our plan is now pivoting to an online strategy, because of the COVID-19 pandemic sweeping the world. In particular, we expect there to be no in-person meetings before August, and expect there to be substantial barriers to travel after that. We will re-evaluate our plans in October in our updated training plan.
Online training is very different from in-person training. In our experience, in-person training offers a natural focus for many and can support an extended (~4-6 hrs/day) engagement with materials. Moreover, technology problems on the learner’s side can often be fixed by in-person helpers who have direct access to the learner’s computer. Finally, the intensity of in-person workshops combines well with the higher cost of travel: in the past we have successfully run many in-person workshops, lasting between 2 days and 2 weeks, where either the instructors or the students travel significant distances to attend the workshop.
Online training requires different affordances. Learner attention span in the absence of interpersonal interaction is much shorter. Remote debugging is possible but much less effective than in-person debugging. And both instructors and learners must manage more technology, including teleconferencing software and chat, often on the same screen size as before. These challenges, among others, have limited the effectiveness of online training efforts, including MOOCs; several studies of MOOCs have shown that most learners drop out of MOOCs quickly, and that the main benefits of MOOCs have been to those who already have experience with the material.
In exchange for these downsides, online training offers some opportunities. By using asynchronous delivery of material, different schedules can be accomodated among the learners, and there is much more time for offline experimentation and challenge experiments. Moreover, online training can offer somewhat more scalability and can potentially be offered more cheaply, since it involves no travel or local facilities.
While we believe we can leverage online training effectively, we will need to experiment with formats, try out a variety of technologies, and put more effort into robust material development to offset the challenges of learner debugging. This may delay some of our previously planned training, but should result in robust training materials that meet our original objectives on approximately the same timeline.
Online lesson development approach
We need to transition from current draft materials for an in-person workshop to materials that can be delivered online. Our current plan is to start by breaking lessons up into 5-10 minute video chunks that integrate concepts and technical activities. These chunks can be viewed in “flipped” or offline mode, and will be interspersed with opportunities for virtual attendees to seek technical help, explore their own interests, and ask questions in an individual or group setting.
After our initial revamp, we will deliver each lesson within the training team, and then expand to groups outside our team. Each delivery will result in an iteration on the materials. After 2-3 iterations are delivered to beta users and CF program members, we will set up a formal registration system and encourage adventurous biomedical scientists to attend half-day sessions over a period of a week or two.
During the lesson development and delivery period, we will work closely with each partner DCC to make sure our lessons align with their best practices, as well as conveying any technical challenges with user experience back to the DCCs in order to identify potential improvements in DCC portals.
This lesson development approach is slow and cautious, and provides plenty of opportunity to improve the materials in response to lived experience of both instructors and learners. We expect to be able to develop a new lesson (~8-16 hours of training) approximately every month based on this approach, although we may develop two lessons in tandem; once we have 3-4 lessons developed, we will switch to offering them on a larger scale for a month or two, and then conduct assessments and evaluate our overall approach, as well as next steps for specific lesson development.
Assessment approach
Our assessments for this period will be formative, and will focus on improving our impact by better understanding the needs of our learners, areas where our materials can be improved, and techniques for better online delivery of our materials. Assessment will primarily consist of during training check-points, pre- and post-training surveys, and remote interviews with learners both before and after training. The results from these assessments will be reported in the October training update and will also be used to develop larger-scale instruments that we can use to standardize summative assessment in 2021. We will also work with DCCs to measure continued use by learners as one of our longer-term metrics.
We will work with DCCs to build training materials that help their current and future users make use of their data sets. Our primary goals here are to (a) create and expand materials for users, (b) offer regular trainings in collaboration with the DCCs, (c) provide expanded help documentation for users to lower the DCC support burden, and (d) work with the DCCs over time to refine the user experience and further lower the DCC support burden.
In the near term, Kids First, GTEx, and LINCS have all expressed interest in working with us on specific training opportunities. We have already connected with KF and GTEx, and prior to the COVID-19 pandemic were working on RNAseq tutorials for the KF/Cavatica platform as well as the GTEx/ANViL/Terra platform. With Kids First, we have already conducted one alpha presentation at UC Davis and are communicating with them about our results. For GTEx and LINCs we have specific plans but do not yet have the personnel to put them in action.
We have begun a Kids First WGS and RNAseq tutorial and are awaiting the chance to iterate with the KF DCC, which is dependent on their and our hiring processes. Our next focus is on a video walkthrough of the KF portal and Cavatica to discuss with the KF DCC. We have not yet started writing a GTEx tutorial, however it will also be RNAseq based. We anticipate ramping this up quickly as our staff and GTEx staff come on line in May or June. We plan to run at least one full workshop each on KF WGS/RNAseq and GTEx RNAseq, between now and the end of August. The exact timeline and number of workshops will depend on how we recruit participants and how quickly our lesson development proceeds.
- Persistent, user-led walkthrough documents (~4 hours of material each)
- Accompanying short videos of difficult sections
- Materials are clinician- focused (KF only)
- Materials are data scientist-focused (GTEx only)
- Materials available at nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Online community space for learner engagement
- Formal registration system
- Code of Conduct
- Moderator Group
- DCC and other expert volunteers to answer questions
- Promotion of materials
- Promotion of online community
- Assessment
- Materials contain breaks for checking understanding
- Pre- training surveys
- Post- training surveys
- Conduct Remote interviews with learners both before and after training
- Secure any approvals for human data collection
- Collect contact information from learners
We have plans with LINCS to help with materials development for a graduate-level course called “Data Science with LINCS Data” to be offered as an online video series. LINCS has already developed draft materials for this course, but has not had time to test or record them. They would like our team to run through the materials and identify areas that need improvement, and to help them to make those changes. While they would like their own staff to present and record the materials, they have also asked for our assistance in aspects of post-production such as video editing. This effort is ready to begin as soon as our staff and their funding come on-line.
- Persistent video lessons
- Videos are accessible
- Include written transcripts
- Include closed-captioning
- Materials are graduate level, research scientist- focused
- Materials available at nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Promotion of materials
- Assessment
- Materials contain breaks for checking understanding
- Pre- training surveys
- Post- training surveys
We will reach out to SPARC to discuss training plans as their funding develops.
We will develop online training materials for biomedical scientists that want to analyze data in the cloud. Many future NIH Common Fund plans for large scale data analysis rely on analyzing the data on remote-hosted cloud platforms, be they commercial clouds such as Amazon Web Services and Google Compute Platform (GTEx, KF) or on-premise hosting systems like the Pittsburgh Supercomputing Center (HuBMAP). Working in these systems involves several different technologies around data upload, automated workflows, and statistical analysis/visualization on remote platforms.
Since most biomedical scientists have little or no training in these areas, they will need substantial support to take advantage of cloud computing platforms to do large scale data analysis.
We anticipate running at least one full workshop on cloud bioinformatics between now and the end of August. The exact timeline and number of workshops will depend on how we recruit participants and how quickly our lesson development proceeds.
Our “general bioinformatics in the cloud” tutorials are already available for in-person meetings, but need to be updated and revamped to an online format.
For workflows, there are two primary workflow systems in use, WDL and CWL. At least one of these (and sometimes both) is supported by every CF program that uses cloud workflow systems. We will develop initial training materials for data-analysis focused biomedical scientists to make use of these workflow systems, based on our existing workflow materials.
For statistics/visualization, there are two commonly used analysis systems, R/RStudio and Python/Jupyter, that are used by almost all of the CF programs. We already have in-person training material for these systems, and will adapt them to online delivery.
- Persistent, user-led walkthrough documents (~4 hours of material each)
- Accompanying short videos of difficult sections
- Materials available at nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Online community space for learner engagement
- Formal registration system
- Code of Conduct
- Moderator Group
- DCC and other expert volunteers to answer questions
- Promotion of materials
- Promotion of online community
- Assessment
- Materials contain breaks for checking understanding
- Pre- training surveys
- Post- training surveys
- Conduct Remote interviews with learners both before and after training
- Secure any approvals for human data collection
- Collect contact information from learners
In tandem with the specific workshops above, we will engage with biomedical scientists who are interested in reusing CF data. This will include members of the CF communities, biomedical scientists who attend our training sessions, and biomedical scientists recruited via social media. These discussions will be used to inform future use case development for data analysis and integration. GTEx in particular is in close contact with their end user community, and has suggested that their user base would be available for engagement.
4. Discuss and plan opportunities for in-person hackathons for technology development and use case brainstorming.
In-person activities such as hackathons and use case brainstorming are incredibly effective ways to develop an understanding of what direction technology needs to move in to enable new data analyses and integrations. While in-person meetings are deferred for the moment, the SPARC and KF DCCs (hosted in Philadelphia) and the Metabolomics DCC (at UCSD) expressed interest in specific events to facilitate technology development. We will connect with all three DCCs to plan events that can be held as travel restrictions ease.
We currently have no dedicated training personnel, due to departures at the beginning of the year. We are hiring three trainers and a staff coordinator; all will start in May and June.