-
Notifications
You must be signed in to change notification settings - Fork 26
Home
A Knowledge Graph Hub (KG Hub) is software to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML
-driven way. The workflow is:
- download data
- transform data for each data source into two TSV files (
edges.tsv
andnodes.tsv
) as specified here - merge the graphs for each data source of interest using KGX to produce a merged knowledge graph
To facilitate interoperability of datasets, biolink categories are added to nodes and biolink associations are added to edges during transformation.
The KG-Covid-19 project is the first such KG Hub. Output is a Knowledge Graph Hub that downloads and transforms COVID-19/SARS-COV-2 and related data and emits a knowledge graph that can be loaded into KGX and used for machine learning or others uses, to produce actionable knowledge.
Download knowledge graph:
A merged knowledge graph comprised of data from all available transforms is here:
See here for a description of the KGX TSV format.
Summary of data (Apr 2020):
A detailed summary of data in kg-covid-19 is here, with contents of the knowledge graph broken down by biolink categories and biolink associations for nodes and edges, respectively.
We gratefully acknowledge the Elsevier Coronavirus Information Center for sharing their coronavirus pathway data, and also acknowledge and thank all COVID-19 data providers for making their data available.
A few organizing principles:
- UniprotKB IDs are used for genes and proteins when possible
- For drug/compound IDs, these IDs are preferred, in descending order of preference: CHEBI > CHEMBL > DRUGBANK > PUBCHEM
- Less is more: for each data source, we ingest only the subset of data that is most relevant to the KG-Hub in question (here KG-COVID-19)
- We avoid ingesting data from a source that isn't authoritative for the data in question (e.g. do not ingest protein interaction data from a drug database)
- Each ingest should make an effort to add provenance data by adding a
provided_by
column in each edge TSV file, populated with the source of each datum
People:
- Justin Reese
- Deepak Unni
- Marcin Joachimiak
- Peter Robinson
- Chris Mungall
- Tiffany Callahan
- Luca Cappelletti
- Vida Ravanmehr
The code:
-
Here is the github repo for this project.
-
Here is the github repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs.
Installation:
git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
pip install .
pip install -r requirements.txt
Running the code:
python run.py download
python run.py transform
python run.py merge
Querying the graph:
A SPARQL endpoint for the complete, merged graph with all available source data is here. Consider using https://yasgui.triply.cc/ for querying. Here are some example queries: https://github.com/Knowledge-Graph-Hub/kg-covid-19/tree/master/queries/sparql
Contributing:
- Here is a more detailed description, and instructions on how to help.