-
Notifications
You must be signed in to change notification settings - Fork 26
Home
A Knowledge Graph Hub (KG Hub) is software to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML
-driven way. The workflow is:
- download data
- transform data for each data source into two TSV files (
edges.tsv
andnodes.tsv
) as specified here - merge the graphs for each data source of interest using KGX to produce a merged knowledge graph
To facilitate interoperability of datasets, biolink categories are added to nodes and biolink associations are added to edges during transformation.
The KG-Covid-19 project is the first such KG Hub. Output is a Knowledge Graph Hub that downloads and transforms COVID-19/SARS-COV-2 and related data and emits a knowledge graph that can be loaded into KGX and used for machine learning or others uses, to produce actionable knowledge.
Download knowledge graph:
A merged knowledge graph comprised of data from all available transforms is here:
See here for a description of the KGX TSV format.
Summary of data (Apr 2020):
A detailed, up-to-date summary of data in kg-covid-19 is here, with contents of the knowledge graph broken down by biolink categories and biolink associations for nodes and edges, respectively.
As of July 6, 2020, the graph contains the following data:
count_by_category:
biolink:OntologyClass: 80005
biolink:Publication: 52105
biolink:Drug: 32204
biolink:BiologicalProcess: 30702
biolink:ChemicalSubstance: 29858
biolink:Protein: 21070
biolink:Gene: 19240
biolink:MolecularActivity: 12202
human_phenotype: 10384
biolink:AnatomicalEntity: 4656
biolink:CellularComponent: 4454
unknown: 4340
biolink:Disease: 289
biolink:NamedThing: 227
external: 87
biolink:Assay: 48
sequence: 46
biolink:RNA: 7
biolink:OrganismalEntity: 6
biolink:MolecularEntity: 1
gene_ontology: 1
We gratefully acknowledge the Elsevier Coronavirus Information Center for sharing their coronavirus pathway data, and also acknowledge and thank all COVID-19 data providers for making their data available.
A few organizing principles:
- UniprotKB IDs are used for genes and proteins when possible
- For drug/compound IDs, these IDs are preferred, in descending order of preference: CHEBI > CHEMBL > DRUGBANK > PUBCHEM
- Less is more: for each data source, we ingest only the subset of data that is most relevant to the KG-Hub in question (here KG-COVID-19)
- We avoid ingesting data from a source that isn't authoritative for the data in question (e.g. do not ingest protein interaction data from a drug database)
- Each ingest should make an effort to add provenance data by adding a
provided_by
column in each edge TSV file, populated with the source of each datum
People:
- Justin Reese
- Deepak Unni
- Marcin Joachimiak
- Peter Robinson
- Chris Mungall
- Tiffany Callahan
- Luca Cappelletti
- Vida Ravanmehr
The code:
-
Here is the github repo for this project.
-
Here is the github repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs.
Installation:
git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
pip install .
pip install -r requirements.txt
Running the code:
python run.py download
python run.py transform
python run.py merge
Querying the graph:
A SPARQL endpoint for the complete, merged graph with all available source data is here. Consider using https://yasgui.triply.cc/ for querying. Here are some example queries: https://github.com/Knowledge-Graph-Hub/kg-covid-19/tree/master/queries/sparql
Contributing:
- Here is a more detailed description, and instructions on how to help.