Skip to content

Commit

Permalink
Add script to assign triple-object type checking
Browse files Browse the repository at this point in the history
An issue was found in the test data, where nodes were incorrectly
spelled as RDF literals instead of resource references.  Unfortunately,
the current state of the SHACL implementation did not identify this
typing error.  The rdfs:range defined on the properties needs to be
carried into sh:PropertyShape constraints.

Note that there are eight datatype properties that do not have
sh:datatype constraints generated, because they are currently
implemented in UCO as blank nodes.  Not having a name, they can't be
referenced in multiple locations.  Some of the blank nodes touch on
tickets in UCO's backlog, such as Bug OC-12 (which touches on other
matters).  Some raise issues of hierarchies of RDF-literals, such as
the constrained core:confidence range.  For at least these reasons,
sh:datatype annotations for properties with blank ranges will be left
for a future version of UCO to address.

Here is a transcript of the script's run, listing the just-described
datatype properties:

WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/core#ConfidenceFacet.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/core#confidence.
WARNING:populate_node_kind.py:1 datatype properties with blank nodes as ranges found, and will not receive sh:datatype constraints.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#ContactSIP.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#contactSIPScope.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#ContactEmail.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#contactEmailScope.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#URLVisitFacet.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#urlTransitionType.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#WhoisContactFacet.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#whoisContactType.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#ContactPhone.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#contactPhoneScope.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#ContactAddress.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#contactAddressScope.
WARNING:populate_node_kind.py:n_node_shape = https://unifiedcyberontology.org/ontology/uco/observable#ContactURL.
WARNING:populate_node_kind.py:n_path = https://unifiedcyberontology.org/ontology/uco/observable#contactURLScope.
WARNING:populate_node_kind.py:7 datatype properties with blank nodes as ranges found, and will not receive sh:datatype constraints.

No analagous issues were found for `rdfs:ObjectProperty`s.

The above transcripts still apply after the cherry-pick from the -v6
tree.

References:
* [OC-12] UCO's idea of "Open vocabulary" does not agree with its
  implementation with owl:oneOf
* [OC-68] (CP-23) Convert current property restrictions and domain
  assertions to SHACL class shapes
* [Feature-CP-23-v6] https://github.com/ajnelson-nist/UCO/tree/archive/Feature-CP-23-v6

Signed-off-by: Alex Nelson <[email protected]>
(cherry picked from commit 8e38bbc)
  • Loading branch information
ajnelson-nist committed Jul 29, 2021
1 parent 6672c16 commit 861e287
Showing 1 changed file with 286 additions and 0 deletions.
286 changes: 286 additions & 0 deletions src/populate_node_kind.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
#!/usr/bin/make -f

# This software was developed at the National Institute of Standards
# and Technology by employees of the Federal Government in the course
# of their official duties. Pursuant to title 17 Section 105 of the
# United States Code this software is not subject to copyright
# protection and is in the public domain. NIST assumes no
# responsibility whatsoever for its use by other parties, and makes
# no guarantees, expressed or implied, about its quality,
# reliability, or any other characteristic.
#
# We would appreciate acknowledgement if the software is used.

"""
The purpose of this program is to build definition statements in a SHACL ontology, such that sh:PropertyShapes record the rdfs:range defined on the property definition.

The program's current intent is as a single-purpose utility. Usage:
1. Be in a Python environment that has rdflib installed. (This could be done by enabling the virtual environment under ../tests/venv.)
2. Run this program. It will **overwrite** all ontology files matching the pattern ../uco-*/*.ttl.
3. Re-run 'make check' from the root directory, to re-normalize ontology files.

The outline of this program is:
1. Load all ontology files into dictionary of graphs, keyed by relpath from top_srcdir.
2. Store all property-defining rdf:type and rdfs:range triples from all loaded ontologies into a "properties" graph.
3. For each ontology (by relpath):
3.1 CONSTRUCT triples for each PropertyShape, based on property being ObjectProperty or DatatypeProperty.
3.1.1 DatatypeProperty -> sh:nodeKind = sh:Literal
3.1.2 DatatypeProperty -> sh:datatype = (rdfs:range of property, if an IRI[1])
3.1.3 ObjectProperty -> sh:nodeKind = sh:BlankNodeOrIRI
3.1.4 ObjectProperty -> sh:class = (rdfs:range of property, if an IRI[1])
[1] If a property's rdfs:range is a blank node, currently this script does NOT generate a sh:datatype or sh:class constraint, due to needing to address ontology design issues.
"""

__version__ = "0.1.0"

import argparse
import logging
import os
import pathlib
import typing

import rdflib.plugins.sparql

_logger = logging.getLogger(os.path.basename(__file__))

NS_OWL = rdflib.OWL
NS_RDF = rdflib.RDF
NS_RDFS = rdflib.RDFS
NS_SH = rdflib.SH

def main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument("--debug", action="store_true")
argument_parser.add_argument("--dry-run", action="store_true", help="Count updates, but do not overwrite ontology files.")
args = argument_parser.parse_args()

logging.basicConfig(level=logging.DEBUG if args.debug else logging.INFO)

# 0. Self-orient.
top_srcdir = pathlib.Path(os.path.dirname(__file__)) / ".."

# Sanity check.
assert (top_srcdir / ".git").exists(), "Hard-coded top_srcdir discovery is no longer correct."

# 1. Load all ontology files into dictionary of graphs.

# The extra filtering step loop to keep from picking up CI files. Path.glob returns dot files, unlike shell's glob.
# The uco.ttl file is also skipped because the Python output removes supplementary prefix statements.
ontology_filepaths : typing.List[pathlib.Path] = []
for x in top_srcdir.glob("uco-*/*.ttl"):
if ".check-" in str(x):
continue
if "uco.ttl" in str(x):
continue
ontology_filepaths.append(x)
assert len(ontology_filepaths) > 0, "Hard-coded relative paths to ontology files is no longer correct."

filepath_to_graph : typing.Dict[pathlib.Path, rdflib.Graph] = dict()

for ontology_filepath in sorted(ontology_filepaths):
_logger.debug("Loading %s...", ontology_filepath)
filepath_to_graph[ontology_filepath] = rdflib.Graph()
ontology_filepath_str = str(ontology_filepath)
filepath_to_graph[ontology_filepath].parse(ontology_filepath_str, format=rdflib.util.guess_format(ontology_filepath_str))
_logger.debug("Loaded.")

# Build global nsdict.
nsdict = dict()
for ontology_filepath in sorted(filepath_to_graph.keys()):
tmp_nsdict = {k:v for (k,v) in filepath_to_graph[ontology_filepath].namespace_manager.namespaces()}
for key in tmp_nsdict:
if key in nsdict:
try:
assert nsdict[key] == tmp_nsdict[key]
except:
_logger.error("ontology_filepath = %s.", ontology_filepath)
_logger.error("key = %r.", key)
_logger.error("nsdict[key] = %r.", nsdict[key])
_logger.error("tmp_nsdict[key] = %r.", tmp_nsdict[key])
raise
nsdict[key] = tmp_nsdict[key]

# 2. Store all property-defining rdf:type and rdfs:range triples from all loaded ontologies into a "properties" graph.

properties_graph = rdflib.Graph()
_logger.debug("Building properties graph...")
for ontology_filepath in sorted(filepath_to_graph.keys()):
for n_type_value in [NS_OWL.DatatypeProperty, NS_OWL.ObjectProperty]:
for triple_0 in filepath_to_graph[ontology_filepath].triples((
None,
NS_RDF.type,
n_type_value
)):
properties_graph.add(triple_0)
for triple_1 in filepath_to_graph[ontology_filepath].triples((
triple_0[0],
NS_RDFS.range,
None
)):
properties_graph.add(triple_1)
_logger.debug("Built.")

#3. For each ontology (by relpath):
# 3.1 CONSTRUCT triples for each PropertyShape, based on property being ObjectProperty or DatatypeProperty.

# 3.1.0.1 DatatypeProperty, rdfs:range a bnode -> Warn.
select_datatype_range_bnode_query = rdflib.plugins.sparql.prepareQuery("""\
SELECT ?nNodeShape ?nPath
WHERE {
?nNodeShape
a sh:NodeShape ;
sh:property ?nPropertyShape ;
.

?nPropertyShape
sh:path ?nPath ;
.

?nPath
a owl:DatatypeProperty ;
rdfs:range ?nRange ;
.

FILTER isBlank(?nRange)
}
""", initNs=nsdict)

# 3.1.0.2 ObjectProperty, rdfs:range a bnode -> Warn.
select_object_range_bnode_query = rdflib.plugins.sparql.prepareQuery("""\
SELECT ?nNodeShape ?nPath
WHERE {
?nNodeShape
a sh:NodeShape ;
sh:property ?nPropertyShape ;
.

?nPropertyShape
sh:path ?nPath ;
.

?nPath
a owl:ObjectProperty ;
rdfs:range ?nRange ;
.

FILTER isBlank(?nRange)
}
""", initNs=nsdict)

# 3.1.1 DatatypeProperty -> sh:nodeKind = sh:Literal
# 3.1.2 DatatypeProperty -> sh:datatype = (rdfs:range of property)
construct_datatype_property_query = rdflib.plugins.sparql.prepareQuery("""\
CONSTRUCT {
?nPropertyShape
sh:datatype ?nRange ;
sh:nodeKind sh:Literal ;
.
}
WHERE {
?nNodeShape
a sh:NodeShape ;
sh:property ?nPropertyShape ;
.

?nPropertyShape
sh:path ?nPath ;
.

?nPath
a owl:DatatypeProperty ;
.

OPTIONAL {
?nPath
rdfs:range ?nRange ;
.

FILTER isIRI(?nRange)
}
}
""", initNs=nsdict)

# 3.1.3 ObjectProperty -> sh:nodeKind = sh:BlankNodeOrIRI
# 3.1.4 ObjectProperty -> sh:class = (rdfs:range of property) (NOT performed currently)
construct_object_property_query = rdflib.plugins.sparql.prepareQuery("""\
CONSTRUCT {
?nPropertyShape
sh:class ?nRange ;
sh:nodeKind sh:BlankNodeOrIRI ;
.
}
WHERE {
?nNodeShape
a sh:NodeShape ;
sh:property ?nPropertyShape ;
.

?nPropertyShape
sh:path ?nPath ;
.

?nPath
a owl:ObjectProperty ;
.

OPTIONAL {
?nPath
rdfs:range ?nRange ;
.
FILTER isIRI(?nRange)
}
}
""", initNs=nsdict)

for ontology_filepath in sorted(filepath_to_graph.keys()):
_logger.debug("Augmenting %s...", ontology_filepath)

_logger.debug("len(base_graph) = %d.", len(filepath_to_graph[ontology_filepath]))

base_and_properties_graph = properties_graph + filepath_to_graph[ontology_filepath]
_logger.debug("len(base_and_properties_graph) = %d.", len(base_and_properties_graph))

constructed_graph = rdflib.Graph()

_logger.debug("Finding datatype properties with blank nodes as ranges ...")
num_found = 0
for result in base_and_properties_graph.query(select_datatype_range_bnode_query):
(n_node_shape, n_path) = result
_logger.warning("n_node_shape = %s.", n_node_shape)
_logger.warning("n_path = %s.", n_path)
num_found += 1
if num_found == 0:
_logger.debug("None found.")
else:
_logger.warning("%d datatype properties with blank nodes as ranges found, and will not receive sh:datatype constraints.", num_found)

_logger.debug("Finding object properties with blank nodes as ranges ...")
num_found = 0
for result in base_and_properties_graph.query(select_object_range_bnode_query):
(n_node_shape, n_path) = result
_logger.error("n_node_shape = %s.", n_node_shape)
_logger.error("n_path = %s.", n_path)
num_found += 1
if num_found == 0:
_logger.debug("None found.")
else:
_logger.error("%d object properties with blank nodes as ranges found, and will not receive sh:class constraints.", num_found)

for result in base_and_properties_graph.query(construct_datatype_property_query):
constructed_graph.add(result)
_logger.debug("len(constructed_graph (+d)) = %d.", len(constructed_graph))

for result in base_and_properties_graph.query(construct_object_property_query):
constructed_graph.add(result)
_logger.debug("len(constructed_graph (+o)) = %d.", len(constructed_graph))

filepath_to_graph[ontology_filepath] += constructed_graph
_logger.debug("len(base_graph) = %d.", len(filepath_to_graph[ontology_filepath]))

if not args.dry_run:
filepath_to_graph[ontology_filepath].serialize(str(ontology_filepath), format="turtle")

_logger.debug("Augmented.")

if __name__ == "__main__":
main()

0 comments on commit 861e287

Please sign in to comment.