Skip to content
Justin Reese edited this page Jun 19, 2020 · 68 revisions

Knowledge Graph Hub concept

A Knowledge Graph Hub (KG Hub) is software to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML-driven way. The workflow is:

  • download data
  • transform data for each data source into two TSV files (edges.tsv and nodes.tsv) as specified here
  • merge the graphs for each data source of interest using KGX to produce a merged knowledge graph

To facilitate interoperability of datasets, biolink categories are added to nodes and biolink associations are added to edges during transformation.

KG-COVID-19 project

The KG-Covid-19 project is the first such KG Hub. Output is a Knowledge Graph Hub that downloads and transforms COVID-19/SARS-COV-2 and related data and emits a knowledge graph that can be loaded into KGX and used for machine learning or others uses, to produce actionable knowledge.

Download knowledge graph:

A merged knowledge graph comprised of data from all available transforms is here:

RDF

TSV

See here for a description of the KGX TSV format.

Summary of data (Apr 2020):

Summary of data ingested (as of Apr 2020)

As of June 19, 2020, the graph contains the following data:

  count_by_category:
    biolink:AnatomicalEntity: 4656
    biolink:BiologicalProcess: 30702
    biolink:CellularComponent: 4454
    biolink:ChemicalSubstance: 22958
    biolink:Disease: 289
    biolink:Drug: 32208
    biolink:Gene: 20464
    biolink:MolecularActivity: 12202
    biolink:MolecularEntity: 1
    biolink:NamedThing: 227
    biolink:OntologyClass: 150348
    biolink:OrganismalEntity: 6
    biolink:Protein: 20738
    biolink:Publication: 52097
    biolink:RNA: 7
    external: 87
    gene_ontology: 1
    human_phenotype: 10384
    sequence: 46
    unknown: 1021

A detailed summary of data in kg-covid-19 is here, with contents of the knowledge graph broken down by biolink categories and biolink associations for nodes and edges, respectively.

We gratefully acknowledge the Elsevier Coronavirus Information Center for sharing their coronavirus pathway data, and also acknowledge and thank all COVID-19 data providers for making their data available.

A few organizing principles:

  • UniprotKB IDs are used for genes and proteins when possible
  • For drug/compound IDs, these IDs are preferred, in descending order of preference: CHEBI > CHEMBL > DRUGBANK > PUBCHEM
  • Less is more: for each data source, we ingest only the subset of data that is most relevant to the KG-Hub in question (here KG-COVID-19)
  • We avoid ingesting data from a source that isn't authoritative for the data in question (e.g. do not ingest protein interaction data from a drug database)
  • Each ingest should make an effort to add provenance data by adding a provided_by column in each edge TSV file, populated with the source of each datum

People:

The code:

  • Here is the github repo for this project.

  • Here is the github repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs.

Installation:

git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
pip install .
pip install -r requirements.txt

Running the code:

python run.py download
python run.py transform
python run.py merge

Querying the graph:

A SPARQL endpoint for the complete, merged graph with all available source data is here. Consider using https://yasgui.triply.cc/ for querying. Here are some example queries: https://github.com/Knowledge-Graph-Hub/kg-covid-19/tree/master/queries/sparql

Contributing:

  • Here is a more detailed description, and instructions on how to help.
Clone this wiki locally